CN113421231B

CN113421231B - Bleeding point detection method, device and system

Info

Publication number: CN113421231B
Application number: CN202110638425.5A
Authority: CN
Inventors: 徐雨亭
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2023-02-28
Anticipated expiration: 2041-06-08
Also published as: CN113421231A

Abstract

The application provides a bleeding point detection method, a bleeding point detection device and a bleeding point detection system, wherein the method comprises the following steps: determining optical flow detection information between an original detection image and a previous frame detection image, and generating an optical flow detection image corresponding to the optical flow detection information; inputting the original detection image and the optical flow detection image into a target network model, extracting context semantic detection features from the original detection image through the target network model, extracting motion detection features from the optical flow detection image, and fusing the context semantic detection features and the motion detection features to obtain fused detection features; inputting the fusion detection characteristics and the context semantic detection characteristics into a target network model so as to output a binary mask detection image comprising a bleeding area and a non-bleeding area through the target network model; bleeding area in the detection image based on the binary mask determines bleeding point positions inside the target object. Through the technical scheme, the bleeding point position can be quickly and accurately positioned, and the detection accuracy is improved.

Description

Bleeding point detection method, device and system

Technical Field

The present application relates to the field of image processing, and in particular, to a bleeding point detection method, device, and system.

Background

An endoscope (Endoscopes) is a commonly used medical instrument, and is composed of a light guide beam structure and a group of lenses, after the endoscope enters the inside of a target object, an original image inside the target object can be acquired by using the endoscope, and the target object can be inspected and treated based on the original image inside the target object.

If a bleeding condition exists in the target object, a bleeding point needs to be found in time so as to be processed. However, it is difficult to find the bleeding point quickly and accurately, a high level of professional knowledge and a great amount of practical experience are required, and for a doctor with less experience, the bleeding point may not be found, and even if there is an original image acquired by an endoscope, the bleeding point inside the target object cannot be known based on the original image.

Disclosure of Invention

The present application provides a bleeding point detection method, comprising:

acquiring an original detection image inside a target object through an endoscope;

determining optical flow detection information between the original detection image and a detection image of a frame before the original detection image, and generating an optical flow detection image corresponding to the optical flow detection information;

inputting an original detection image and an optical flow detection image into a trained target network model, extracting context semantic detection features from the original detection image through the target network model, extracting motion detection features from the optical flow detection image, and fusing the context semantic detection features and the motion detection features to obtain fused detection features; the context semantic detection features represent context semantic information of bleeding points in an original detection image, and the motion detection features represent motion information of the bleeding points in an optical flow detection image; inputting the fusion detection feature and the context semantic detection feature to the target network model to output a binary mask detection image through the target network model; the binary mask detection image comprises a bleeding area and a non-bleeding area, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value; determining bleeding point locations inside the target object based on bleeding regions in the binary mask detection image.

The present application provides a bleeding point detection device, the device comprising: the acquisition module is used for acquiring an original detection image inside a target object through an endoscope; a generating module, configured to determine optical flow detection information between the original detection image and a detection image of a frame previous to the original detection image, and generate an optical flow detection image corresponding to the optical flow detection information; the system comprises an acquisition module, a fusion detection module and a processing module, wherein the acquisition module is used for inputting an original detection image and an optical flow detection image into a trained target network model, so as to extract context semantic detection features from the original detection image through the target network model, extract motion detection features from the optical flow detection image, and fuse the context semantic detection features and the motion detection features to obtain fusion detection features; the context semantic detection features represent context semantic information of bleeding points in an original detection image, and the motion detection features represent motion information of the bleeding points in an optical flow detection image; inputting the fusion detection feature and the context semantic detection feature into the target network model so as to output a binary mask detection image through the target network model; the binary mask detection image comprises a bleeding area and a non-bleeding area, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value; a determination module for determining a bleeding point location inside the target object based on the bleeding region in the binary mask detection image.

In a possible implementation, the apparatus further includes a training module, configured to train the target network model by: acquiring an original training image inside a target object through an endoscope; determining optical flow training information between the original training image and a training image of a frame before the original training image, and generating an optical flow training image corresponding to the optical flow training information; inputting an original training image and an optical flow training image into an initial network model, extracting context semantic training features from the original training image through the initial network model, extracting motion training features from the optical flow training image, and fusing the context semantic training features and the motion training features to obtain fused training features; wherein the context semantic training features represent context semantic information of bleeding points in the original training image, and the motion training features represent motion information of the bleeding points in the optical flow training image; inputting the fusion training features and the context semantic training features to the initial network model to output a binary mask training image through the initial network model; the binary mask training image comprises a bleeding area and a non-bleeding area, wherein the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value; determining a loss value based on the binary mask training image and the configured binary mask calibration image; the binary mask calibration image is a calibration image of the original training image, the binary mask calibration image comprises a bleeding area and a non-bleeding area, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value; and training the initial network model based on the loss value to obtain a target network model.

The present application provides a bleeding point detection system, the bleeding point detection system includes: the endoscope, the light source device, the camera host, the display device and the storage device; wherein: the endoscope is used for acquiring an original detection image in a target object and inputting the original detection image to the camera host;

the light source device is used for providing a light source for the endoscope;

the camera host is used for determining optical flow detection information between the original detection image and a detection image of a previous frame of the original detection image and generating an optical flow detection image corresponding to the optical flow detection information; inputting the original detection image and the optical flow detection image into a trained target network model, extracting context semantic detection features from the original detection image through the target network model, extracting motion detection features from the optical flow detection image, and fusing the context semantic detection features and the motion detection features to obtain fused detection features; inputting the fusion detection features and the context semantic detection features into the target network model to output a binary mask detection image through the target network model, wherein the binary mask detection image comprises a bleeding area and a non-bleeding area; determining bleeding point locations inside the target object based on bleeding regions in the binary mask detection image;

the display device is used for displaying positions of bleeding points in the target object;

the storage device is used for storing the original detection image and the binary mask detection image.

According to the technical scheme, in the embodiment of the application, the original detection image and the optical flow detection image are input to the target network model, the binary mask detection image is output through the target network model, the bleeding point position inside the target object is determined based on the bleeding area in the binary mask detection image, that is, the bleeding point position inside the target object is determined based on the original detection image and the optical flow detection image, so that the bleeding point position is quickly and accurately positioned, the bleeding point is accurately positioned in real time, a doctor is reminded to find the bleeding point as soon as possible, and the bleeding point is processed. By inputting the optical flow detection image into the target network model, the optical flow information (i.e. the motion information of the bleeding point) can be input into the target network model, so that the motion information is used for guiding the target network model to make more accurate judgment, and the accuracy of bleeding point position detection is improved.

Drawings

FIG. 1 is a schematic diagram of a bleed point detection system in one embodiment of the present application;

FIG. 2 is a flow diagram of a method of a training process for a target network model in one embodiment of the present application;

FIG. 3 is a schematic illustration of an HSV color space in one embodiment of the present application;

FIGS. 4A and 4B are schematic structural diagrams of an initial network model in one embodiment of the present application;

FIG. 5 is a schematic flow chart of a bleed point detection method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a camera host according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an intelligent processing unit in one embodiment of the subject application;

FIGS. 8A and 8B are examples of a detection image and a light flow diagram in one embodiment of the present application;

FIG. 8C is a block diagram of a target network model in one embodiment of the subject application;

FIG. 8D is a schematic illustration of the display effect in one embodiment of the present application;

fig. 9 is a schematic structural diagram of a bleeding point detecting device according to an embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Depending on the context, moreover, the use of the word "if" can be interpreted as "at … …" or "at … …" or "in response to a determination.

Referring to fig. 1, a schematic diagram of a bleeding point detection system (which may also be referred to as an endoscope system) is shown, where the bleeding point detection system may include: the endoscope, the light source device, the camera shooting host (also called as camera shooting system host), the display device and the storage device, wherein the display device and the storage device are external equipment. Of course, fig. 1 is only an example of a bleeding point detection system, and the structure of the bleeding point detection system is not limited.

The endoscope can capture an original image of the inside of a target object (a subject such as a patient) and input the original image to the camera host. The light source device can provide a light source for the endoscope, i.e., emit illumination light from the front end of the endoscope, so that the endoscope can capture a relatively clear original image of the inside of the target object.

After the camera host receives the original image, the original image can be input to the storage device, the storage device stores the original image, and in the subsequent process, a user (such as a doctor and the like) can access the original image in the storage device, or access a video (a video consisting of a large number of original images) in the storage device.

After the camera host receives the original image, the original image can be input to the display device, the display device displays the original image, and a user can observe the original image displayed by the display device in real time.

After receiving the original image, the camera host may further perform related processing on the original image, where the processing process refers to a subsequent embodiment to obtain a binary mask image corresponding to the original image, and the camera host may determine a bleeding point position inside the target object based on the bleeding area in the binary mask image.

After the camera host obtains the binary mask image corresponding to the original image, the binary mask image can be input into the storage device, the storage device stores the binary mask image, and in the subsequent process, a user can access the binary mask image in the storage device. After the camera host obtains the binary mask image corresponding to the original image, the binary mask image can be input to the display device, the binary mask image is displayed by the display device, and a user can observe the binary mask image displayed by the display device in real time.

After the camera host determines the bleeding point position in the target object, the bleeding point position in the target object can be input to the display device, the display device displays the bleeding point position in the target object, and a user can observe the bleeding point position in the target object displayed by the display device in real time.

In the application scenario, the bleeding point detection method provided in the embodiment of the present application may relate to a training process and a detection process. In the training process, a target network model may be trained using training data, and in the detection process, a bleeding point position inside a target object may be detected based on the trained target network model, and the training process and the detection process are described below with reference to specific embodiments.

Referring to fig. 2, a flowchart of a method of a training process for a target network model, which may be applied to a camera host (camera system host), may include:

step 201, an original training image of the interior of a target object is acquired through an endoscope.

For example, the endoscope may acquire an original image of the inside of the target object and input the original image to the camera host, and for convenience of distinguishing, the original image in the training process may be recorded as an original training image, and obviously, the camera host may obtain the original training image of the inside of the target object.

Step 202, determining optical flow training information between the original training image and a training image of a frame previous to the original training image, and generating an optical flow training image corresponding to the optical flow training information.

For example, the endoscope may collect multiple frames of original training images inside the target object, and input all of the original training images to the camera host, based on which the camera host may receive the multiple frames of original training images. When receiving each frame of original training image, the camera host may first search for a training image previous to the original training image (i.e., a first frame of original training image located in front of the original training image), and then may determine optical flow information between the original training image and the training image previous to the original training image.

For example, an optical flow method may be used to determine optical flow training information between an original training image and a training image of a frame before the original training image, and the optical flow method may be a pyramid LK optical flow method or other types of optical flow methods. Alternatively, the optical flow training information between the original training image and the training image of the frame preceding the original training image may be determined by using a FlowNet (optical flow neural network) method based on deep learning. Of course, the above-described method is merely an example, and the determination method of the optical flow training information is not limited.

When the optical flow method is used for determining the optical flow training information, the optical flow method is to find the corresponding relation existing between the previous frame and the current frame by using the change of the pixels in the image sequence in the time domain and the correlation between the adjacent frames so as to calculate the motion information of the object between the adjacent frames, so that the optical flow training information between the original training image and the training image of the previous frame of the original training image can be determined by using the optical flow method.

In determining the optical flow training information by using the FlowNet method, a FlowNet for finding the motion information of an object between adjacent frames may be trained in advance, and therefore, the optical flow training information between an original training image and a training image of a frame preceding the original training image may be determined by using the FlowNet method.

For example, the optical flow training information may include a motion direction (i.e., a motion direction of the moving object) and a motion speed (i.e., a motion speed of the moving object, which is determined based on the distance and the duration). And aiming at each pixel point in the original training image, finding a reference point matched with the pixel point from the previous training image, and determining the motion direction and the motion speed corresponding to the pixel point based on the position relation between the pixel point and the reference point. Obviously, the optical flow training information includes the corresponding motion direction and motion speed of each pixel point in the original training image.

Based on this, the optical flow training image corresponding to the optical flow training information is generated, which may include but is not limited to: and mapping the motion direction to a hue component of the HSV color space, mapping the motion speed to a brightness component of the HSV color space, and generating an optical flow training image based on the hue component and the brightness component.

For example, for a first pixel point in an original training image, a motion direction corresponding to the pixel point is mapped to a hue component of an HSV color space, a motion speed corresponding to the pixel point is mapped to a value component of the HSV color space, the hue component and the value component form a pixel value of the first pixel point of an optical flow training image, and so on, pixel values of all pixel points of the optical flow training image can be obtained, and the pixel values of all pixel points can form the optical flow training image, that is, the optical flow training image is generated.

For example, the manner in which HSV expresses a color image may consist of three parts: hue, saturation, color purity, value, see fig. 3, this cylinder may be used to represent HSV color space, the cross section of the cylinder may be regarded as a polar coordinate system, H is represented by the polar angle of the polar coordinate, S is represented by the polar axis length of the polar coordinate, and V is represented by the height of the central axis of the cylinder. On the basis, aiming at each pixel point in the original training image, the motion direction corresponding to the pixel point can be mapped to the hue component of the HSV color space, the hue component corresponds to the polar angle of the polar coordinate, and the mapping process is not limited. The motion speed corresponding to the pixel point can be mapped to the lightness component of the HSV color space, the lightness component corresponds to the height of the central axis of the cylinder, and the mapping process is not limited. Obviously, after the hue component and the brightness component are obtained, the pixel values of the HSV color space can be obtained, and then all the pixel values can be combined into an optical flow training image, namely an image of the HSV color space.

For the optical flow training image of HSV color space, the optical flow training image may be a visualized image, where different colors in the optical flow training image represent different directions of motion in the optical flow training information, and different brightnesses in the optical flow training image represent different velocities of motion in the optical flow training information. In summary, for each pixel point in the optical flow training image, the color of the pixel point represents the motion direction in the optical flow training information, and the brightness of the pixel point represents the motion speed in the optical flow training information.

Step 203, inputting the original training image and the optical flow training image to an initial network model, so as to extract context semantic training features from the original training image through the initial network model, extract motion training features from the optical flow training image, and fuse the context semantic training features and the motion training features to obtain fusion training features. The context semantic training features may represent context semantic information of bleeding points in the original training image, the motion training features may represent motion information of the bleeding points in the optical flow training image, and the fusion training features may represent fusion information of the context semantic information and the motion information.

And 204, inputting the fusion training feature and the context semantic training feature into an initial network model so as to output a binary mask training image through the initial network model. For example, the binary mask training image may include a bleeding area and a non-bleeding area, where the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value.

For example, an initial network model (i.e., a network model to be trained) may be pre-constructed, the initial network model may be a deep learning network model, or may be another type of network model, and the structure of the initial network model is not limited as long as the input of the initial network model is an original training image and an optical flow training image, and the output of the initial network model is a binary mask training image.

For example, the initial network model may include a first sub-network, a second sub-network, and a third sub-network, where the first sub-network is used to process the original training images and the features corresponding to the original training images, the second sub-network is used to process the optical flow training images and the features corresponding to the optical flow training images, and the third sub-network is used to process the features corresponding to the original training images and the features corresponding to the optical flow training images.

The first sub-network may be a semantic segmentation network, such as a semantic segmentation network using deep learning model structure for semantic segmentation, or may be another type of network. The second sub-network may be a semantic segmentation network, such as a semantic segmentation network using deep, or may be another type of network, which is not limited to this. The first sub-network and the second sub-network have similar network structures, and may be the same or different, for example, the second sub-network is lighter than the first sub-network. The third sub-network may include at least one MGA (Motion Guided Attention) network layer, which is used between the first sub-network and the second sub-network.

For

steps

203 and 204, the original training image may be input to a first subnetwork, the contextual semantic training features may be extracted from the original training image by the first subnetwork, and the contextual semantic training features may be input to a third subnetwork. For example, the first sub-network may include at least one network layer, such as a convolutional layer, a pooling layer, an excitation layer, and the like, and may use feature vectors output by one network layer as the context semantic training features, or may use feature vectors output by multiple network layers as the context semantic training features.

The optical flow training image may be input to a second sub-network, the motion training features extracted from the optical flow training image by the second sub-network, and the motion training features input to a third sub-network. For example, the second sub-network may include at least one network layer, such as a convolutional layer, a pooling layer, a stimulus layer, and the like, and the feature vectors output by one network layer may be used as the motion training features, or the feature vectors output by multiple network layers may be used as the motion training features, that is, the number of the motion training features may be one or more.

And after the context semantic training characteristics and the motion training characteristics are input into a third sub-network, the context semantic training characteristics and the motion training characteristics are fused through the third sub-network to obtain fusion training characteristics. After the fused training features are obtained, the third subnetwork inputs the fused training features to the first subnetwork. For example, the third sub-network includes at least one network layer, and for each network layer, the input of the network layer is the context semantic training feature and the motion training feature, the output of the network layer is the fusion training feature, and the network layer is configured to fuse the context semantic training feature and the motion training feature to obtain the fusion training feature.

In the above process, the fused training features and the contextual semantic training features have been input to the first sub-network, which may generate a binary mask training image based on the fused training features and the contextual semantic training features. For example, after the fusion training feature and the context semantic training feature are input to a certain network layer of the first sub-network, the network layer may perform an operation on the fusion training feature and the context semantic training feature to obtain a binary mask training image.

Referring to fig. 4A, a schematic diagram of an initial network model is shown, in which a first sub-network includes a network layer a1 and a network layer a2, a second sub-network includes a network layer b1, and a third sub-network includes a network layer c1. The original training image can be input to the network layer a1, the network layer a1 processes the original training image to obtain a feature vector d1, the feature vector d1 is a context semantic training feature, the network layer a1 inputs the feature vector d1 to the network layer c1, and the feature vector d1 is input to the network layer a2. The optical flow training image can be input to the network layer b1, the network layer b1 processes the optical flow training image to obtain a feature vector d2, the feature vector d2 is a motion training feature, and the network layer b1 inputs the feature vector d2 to the network layer c1. After receiving the feature vector d1 and the feature vector d2, the network layer c1 fuses the feature vector d1 and the feature vector d2 to obtain a feature vector d3, wherein the feature vector d3 is a fusion training feature, and the network layer c1 inputs the feature vector d3 to the network layer a2. After receiving the feature vector d1 and the feature vector d3, the network layer a2 may generate a feature vector d4 based on the feature vector d1 and the feature vector d3, for example, the feature vector d4 is the sum of the feature vector d1 and the feature vector d3, and generate a binary mask training image matched with the feature vector d 4.

Referring to fig. 4B, which is a schematic diagram of an initial network model, a first sub-network includes a network layer a1, a network layer a2 and a network layer a3, a second sub-network includes a network layer B1 and a network layer B2, and a third sub-network includes a network layer c1 and a network layer c2. The network layer a1 processes the original training image to obtain a feature vector e1 (context semantic training feature), and inputs the feature vector e1 to the network layer c1 and the network layer a2. The network layer b1 processes the optical flow training image to obtain a feature vector e2 (motion training feature), and inputs the feature vector e2 to the network layer c1 and the network layer b2. The network layer c1 fuses the feature vector e1 and the feature vector e2 to obtain a feature vector e3 (fusion training feature), and inputs the feature vector e3 to the network layer a2. The network layer a2 generates a feature vector e4 based on the feature vector e1 and the feature vector e3, processes the feature vector e4 to obtain a feature vector e5 (context semantic training feature), and inputs the feature vector e5 to the network layer c2 and the network layer a3. The network layer b2 processes the feature vector e2 to obtain a feature vector e6 (motion training feature), and inputs the feature vector e6 to the network layer c2. The network layer c2 fuses the feature vector e5 and the feature vector e6 to obtain a feature vector e7 (fusion training feature), and inputs the feature vector e7 to the network layer a3. The network layer a3 generates a feature vector e8 (e.g., the sum of the feature vector e5 and the feature vector e 7) based on the feature vector e5 and the feature vector e7, and generates a binary mask training image matched with the feature vector e 8.

Of course, fig. 4A and 4B are only examples, and the initial network model is not limited as long as the initial network model can extract the context semantic training features from the original training image, extract the motion training features from the optical flow training image, fuse the context semantic training features and the motion training features to obtain the fusion training features, and output the binary mask training image based on the fusion training features and the context semantic training features.

In the above embodiment, for the context semantic training feature, as compared with a bleeding point and a non-bleeding point in a target object, the context semantic (i.e., the surrounding environment) of the bleeding point is different from the context semantic of the non-bleeding point, and the original training image can reflect the context semantic information of each pixel point, therefore, after the context semantic training feature is extracted from the original training image, the context semantic training feature can represent the context semantic information of the bleeding point and the context semantic information of the non-bleeding point in the original training image, that is, the context semantic training feature can reflect the difference between the bleeding point and the non-bleeding point, that is, the context semantic information of each pixel point in the original training image represents the difference between the bleeding point and the non-bleeding point.

For the motion training feature, compared with a bleeding point and a non-bleeding point in a target object, the bleeding point has motion in a picture, the non-bleeding point does not have motion in the picture, and the optical flow training image can reflect motion information of each pixel point, so after the motion training feature is extracted from the optical flow training image, the motion training feature can represent the motion information of the bleeding point and the motion information of the non-bleeding point in the optical flow training image, that is, the motion training feature can reflect the difference between the bleeding point and the non-bleeding point, that is, the difference between the bleeding point and the non-bleeding point is reflected by the motion information of each pixel point in the optical flow training image.

For the fusion training feature, the context semantic training feature and the motion training feature are fused, the context semantic training feature can represent context semantic information of bleeding points and context semantic information of non-bleeding points, and the motion training feature can represent motion information of the bleeding points and motion information of the non-bleeding points, so that the fusion training feature can represent fusion information of the context semantic information and the motion information, and the fusion information can reflect the difference between the bleeding points and the non-bleeding points, namely, the difference between the bleeding points and the non-bleeding points is represented through the context semantic information and the motion information.

For a binary mask training image, the binary mask training image is a binary image, which can be understood as a black-and-white image, that is, the pixel value of a pixel point is only a first value (e.g., 255) or a second value (e.g., 0). In the generation process of the binary mask training image, the binary mask training image can be divided into a bleeding area and a non-bleeding area based on the context semantic information and the motion information, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value. For a bleeding area, the context semantic information of the bleeding area represents the bleeding surroundings, and the motion information of the bleeding area represents that there is motion in the area. For a non-bleeding area, the context semantic information of the non-bleeding area indicates the non-bleeding surroundings, and the motion information of the non-bleeding area indicates that there is no motion in the area.

In summary, after the original training image and the optical flow training image are input to the initial network model, the initial network model outputs a binary mask training image, and subsequent processing may be performed based on the binary mask training image.

In one possible embodiment, the context semantic training features and the motion training features are fused by the third sub-network to obtain fused training features, which may include but are not limited to: convolving the motion training features to obtain the motion training features after convolution, and mapping the motion training features after convolution by adopting an activation function (such as a Sigmoid function) to obtain the mapped motion training features; and then, performing multiplication operation on the context semantic training characteristics and the mapped motion training characteristics to obtain fusion training characteristics.

For example, the above fusion operation can be expressed by the following formula:

in the above equation, fm represents a motion training feature, fm is a feature vector of dimension C' × H × W, and H (fm) represents convolving fm, such as by convolving fm with a convolution kernel of 1*1. Sigmoid (h (fm)) represents that the convolved motion training features h (fm) are mapped by adopting a Sigmoid function, fa represents context semantic training features,

the elements representing the corresponding positions are multiplied, f' a represents the fusion training feature.

Step 205, determining a loss value based on the binary mask training image and the configured binary mask calibration image. For example, the binary mask calibration image may be a calibration image of an original training image, and the binary mask calibration image may include a bleeding area and a non-bleeding area, where pixel values of pixel points in the bleeding area are all first values, and pixel values of pixel points in the non-bleeding area are all second values.

For example, before step 205, a calibration image of the original training image may be configured, and the calibration image may be recorded as a binary mask calibration image. For example, when the user calibrates the original training image, the user may obtain a bleeding area and a non-bleeding area in the original training image, so that the bleeding area and the non-bleeding area in the original training image may be calibrated, and a binary mask calibration image may be generated based on the calibration information. When the binary mask calibration image is generated, the pixel values of the pixel points corresponding to the bleeding area are all the first values (for example, 255), and the pixel values of the pixel points corresponding to the non-bleeding area are all the second values (for example, 0), so that the binary mask calibration image can be generated, and the binary mask calibration image is a binary image and can be understood as a black and white image. Obviously, after the binary mask calibration image is obtained, all the first-valued areas are bleeding areas, and all the second-valued areas are non-bleeding areas.

After the binary mask training image and the binary mask calibration image are obtained, a loss value can be determined based on a difference value between the binary mask training image and the binary mask calibration image. Obviously, the smaller the loss value is, the closer the binary mask training image and the binary mask calibration image are, that is, the more accurate the binary mask training image output by the initial network model is, and the larger the loss value is, the farther the binary mask training image and the binary mask calibration image are, that is, the more inaccurate the binary mask training image output by the initial network model is.

And step 206, training the initial network model based on the loss value to obtain a target network model.

For example, the network parameters (i.e., network weights) of the initial network model may be updated based on the loss values, resulting in an updated network model. For example, the network parameters of the initial network model are updated through a back propagation algorithm to obtain an updated network model, and the updating process is not limited. An example of a back propagation algorithm may be a gradient descent method, i.e. the network parameters of the initial network model are updated by a gradient descent method.

It is then determined whether the updated network model has converged. And if not, determining the updated network model as the initial network model, and returning to execute the operation of inputting the original training image and the optical flow training image to the initial network model. And if so, determining the updated network model as the trained target network model.

For example, if the iteration number of the initial network model has reached the preset number threshold, it is determined that the updated network model has converged, otherwise, it is determined that the updated network model has not converged.

For another example, if the iteration duration of the initial network model has reached the preset duration threshold, it is determined that the updated network model has converged, otherwise, it is determined that the updated network model has not converged.

For another example, if the loss value is less than the predetermined loss value threshold, it is determined that the updated network model has converged, and if the loss value is not less than the predetermined loss value threshold, it is determined that the updated network model has not converged.

Of course, the above are just a few examples, and the determination manner of whether the network model has converged is not limited.

After obtaining the trained target network model, bleeding point detection may be implemented based on the trained target network model, and for the detection process, see fig. 5, it is a flowchart of a bleeding point detection method, which may be applied to a camera host (camera system host), and the method may include:

step 501, acquiring an original detection image of the interior of a target object through an endoscope.

For example, the endoscope may collect an original image of the inside of the target object, and input the original image to the camera host, and for convenience of distinguishing, the original image in the detection process is recorded as an original detection image.

Step 502, determining optical flow detection information between the original detection image and a detection image of a frame previous to the original detection image, and generating an optical flow detection image corresponding to the optical flow detection information.

For example, when each frame of original detection image is received, the camera host may first search for a previous frame of detection image of the original detection image (i.e., a first frame of original detection image located in front of the original detection image), and then determine optical flow information between the original detection image and the previous frame of detection image of the original detection image.

The optical flow detection information may include a motion direction and a motion speed, and for each pixel point in the original detection image, a reference point matching the pixel point is found from a previous detection image, and based on a positional relationship between the pixel point and the reference point, the motion direction and the motion speed corresponding to the pixel point are determined. Obviously, the optical flow detection information includes the motion direction and the motion speed corresponding to each pixel point in the original detection image.

Based on this, generating the optical flow detection image corresponding to the optical flow detection information may include, but is not limited to: and mapping the motion direction to a hue component of an HSV color space, mapping the motion speed to a brightness component of the HSV color space, and generating an optical flow detection image based on the hue component and the brightness component.

For the optical flow detection image of HSV color space, the optical flow detection image may be a visualized image, different colors in the optical flow detection image represent different moving directions in the optical flow detection information, and different brightness in the optical flow detection image represents different moving speeds in the optical flow detection information.

Illustratively, step 502 is similar to step 202, and is not repeated here.

Step 503, inputting the original detection image and the optical flow detection image to a trained target network model, so as to extract context semantic detection features from the original detection image through the target network model, extract motion detection features from the optical flow detection image, and fuse the context semantic detection features and the motion detection features to obtain fusion detection features. The context semantic detection features represent context semantic information of bleeding points in the original detection image, and the motion detection features represent motion information of the bleeding points in the optical flow detection image.

Step 504, inputting the fusion detection feature and the context semantic detection feature to a target network model, so as to output a binary mask detection image through the target network model. For example, the binary mask detection image may include a bleeding area and a non-bleeding area, where the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value.

Illustratively, the network structure of the target network model is similar to that of the initial network model, and therefore, the target network model may include a first sub-network, a second sub-network and a third sub-network, the first sub-network being used for processing the original detection image and the features corresponding to the original detection image, the second sub-network being used for processing the optical flow detection image and the features corresponding to the optical flow detection image, and the third sub-network being used for processing the features corresponding to the original detection image and the features corresponding to the optical flow detection image.

Based on the detection result, the original detection image is input to a first sub-network, the context semantic detection features are extracted from the original detection image through the first sub-network, and the context semantic detection features are input to a third sub-network.

The optical flow detection image may be input to a second sub-network, motion detection features extracted from the optical flow detection image by the second sub-network, and the motion detection features input to a third sub-network. After the context semantic detection feature and the motion detection feature are input to the third sub-network, the context semantic detection feature and the motion detection feature can be fused through the third sub-network to obtain a fused detection feature. After obtaining the fused detection feature, the third sub-network may input the fused detection feature to the first sub-network.

In the above process, the fused detection feature and the contextual semantic detection feature have been input to the first sub-network, which generates a binary mask detection image based on the fused detection feature and the contextual semantic detection feature.

In the above embodiment, the context semantic detection feature can represent context semantic information of a bleeding point and context semantic information of a non-bleeding point in the original detection image, that is, the context semantic detection feature can reflect a difference between the bleeding point and the non-bleeding point. The motion detection feature can represent motion information of a bleeding point and motion information of a non-bleeding point in the optical flow detection image, that is, the motion detection feature can represent the difference between the bleeding point and the non-bleeding point. The fusion detection feature may represent fusion information of context semantic information and motion information, and the fusion information may reflect a difference between a bleeding point and a non-bleeding point, that is, the difference between the bleeding point and the non-bleeding point is reflected by the context semantic information and the motion information.

The binary mask detection image is a binary image, the binary mask detection image is divided into a bleeding area and a non-bleeding area based on context semantic information and motion information, the context semantic information of the bleeding area represents a bleeding surrounding environment, and the motion information of the bleeding area represents that the area has motion. The context semantic information of the non-bleeding area indicates the non-bleeding surroundings, and the motion information of the non-bleeding area indicates that there is no motion in the area.

In summary, after the original detection image and the optical flow detection image are input to the target network model, the target network model outputs a binary mask detection image, and subsequent processing may be performed based on the binary mask detection image.

Illustratively, fusing the context semantic detection feature and the motion detection feature through the third sub-network to obtain a fused detection feature, which may include but is not limited to: convolving the motion detection features to obtain the motion detection features after convolution, and mapping the motion detection features after convolution by adopting an activation function (such as a Sigmoid function) to obtain the mapped motion detection features; and then, performing multiplication operation on the context semantic detection features and the mapped motion detection features to obtain fusion detection features. For example, the above fusion operation can be expressed by the following formula:

fm represents the motion detection feature, h (fm) represents the convolution of fm, sigmoid represents the mapping using Sigmoid function, fa represents the context semantic detection feature,

the elements representing the corresponding positions are multiplied, f' a represents the fusion detection feature.

Illustratively, steps 503-504 are similar to steps 203-204 and are not repeated here.

In step 505, a bleeding point position inside the target object is determined based on the bleeding region in the binary mask detection image, i.e., the bleeding region is determined as the bleeding point position inside the target object.

Illustratively, the binary mask detection image includes a bleeding area and a non-bleeding area, the pixel values of the pixels in the bleeding area are all first values, and the pixel values of the pixels in the non-bleeding area are all second values, so that the area formed by all the pixels with the first values is determined as the bleeding point position (also referred to as a bleeding point area) inside the target object, that is, the bleeding point position is the area formed by all the pixels with the first values.

In one possible embodiment, a minimum circumscribed circle that matches the bleeding point location may also be determined, and the bleeding point area inside the target object may be determined based on the minimum circumscribed circle. Then, the bleeding cumulative time of which the area of the bleeding point is greater than the preset area threshold may be counted, for example, in a preset time period, with K as a cycle, the steps 501 to 505 are performed in each cycle to obtain the bleeding point position of the current cycle, if the area of the bleeding point corresponding to the bleeding point position of the current cycle is greater than the preset area threshold, the bleeding cumulative time may be updated (i.e., K is added on the basis of the bleeding cumulative time), and if the area of the bleeding point corresponding to the bleeding point position of the current cycle is not greater than the preset area threshold, the bleeding cumulative time may be kept unchanged. Then, a bleeding point position (i.e., a bleeding point position of the current cycle), a bleeding point area (i.e., a bleeding point area of the current cycle), and a bleeding cumulative time (i.e., a bleeding cumulative time within a preset time period) may be displayed on the preview page.

According to the technical scheme, in the embodiment of the application, the original detection image and the optical flow detection image can be input into the target network model, the binary mask detection image is output through the target network model, the bleeding point position inside the target object is determined based on the bleeding area in the binary mask detection image, that is, the bleeding point position inside the target object is determined based on the original detection image and the optical flow detection image, so that the bleeding point position is quickly and accurately positioned, the bleeding point is accurately positioned in real time, a doctor is reminded of finding the bleeding point as soon as possible, and the bleeding point is processed. By inputting the optical flow detection image into the target network model, the optical flow information (i.e. the motion information of the bleeding point) can be input into the target network model, so that the motion information is used for guiding the target network model to make more accurate judgment, and the accuracy of bleeding point position detection is improved.

The following describes the above technical solution of the embodiment of the present application with reference to a specific application scenario.

Referring to fig. 6, the main camera is a schematic structural diagram, and may include an image input unit, an image processing unit, an intelligent processing unit, a video encoding unit, a control unit, and an operation unit.

The image input unit may receive an original image (e.g., an original training image or an original inspection image) transmitted from the endoscope and transmit the original image to the image processing unit.

The image processing unit carries out ISP operation on the original image, the original image after the ISP operation is transmitted to the intelligent processing unit, the video coding unit and the display device, and the display device displays the original image. The ISP operations may include, but are not limited to, at least one of: luminance transformation, sharpening, moire removal, and scaling.

The intelligent processing unit carries out intelligent analysis processing on the original image and transmits the processed original image to the image processing unit and the video coding unit. The intelligent analysis process may include, but is not limited to, at least one of: scene classification based on deep learning, instrument detection, instrument head detection, gauze detection, moire classification, dense fog classification and the like. After receiving the original image processed by the intelligent processing unit, the image processing unit can also perform processing such as brightness conversion, moire pattern removal, frame folding, scaling and the like on the original image.

The video coding unit performs coding compression on the received original image and transmits the original image to the storage device.

The control unit controls various functions of the bleeding point detection system, which may include, but are not limited to, at least one of: lighting mode of light source, image processing mode, intelligent processing mode and video coding mode.

The operation unit may include, but is not limited to, a switch, a button, a touch panel, and the like, and the operation unit may receive an external instruction signal, output the external instruction signal to the control unit, and determine an illumination mode, an image processing mode, an intelligent processing mode, a video encoding mode, and the like of the light source based on the external instruction signal.

The schematic structure of the intelligent processing unit can be seen from fig. 7, and the intelligent processing unit may include a preprocessing module, an optical flow estimation module, a semantic segmentation module, a pre-training module, and a post-processing module.

For the original detection image collected by the endoscope, the original detection image may be input to the preprocessing module, the preprocessing module preprocesses the original detection image, the preprocessing mode is not limited, such as down-sampling the original detection image, and the preprocessed original detection image is input to the optical flow estimation module.

The optical flow estimation module determines optical flow detection information corresponding to the original detection image, generates an optical flow detection image corresponding to the optical flow detection information, and inputs the original detection image and the optical flow detection image to the semantic segmentation module.

The pre-training module trains a target network model and inputs the target network model to the semantic segmentation module.

The semantic segmentation module extracts context semantic detection features in an original detection image based on a target network model, extracts motion detection features in an optical flow detection image, fuses the context semantic detection features and the motion detection features to obtain fusion detection features, obtains a binary mask detection image based on the context semantic detection features and the fusion detection features, and inputs the binary mask detection image to the post-processing module.

The post-processing module determines a bleeding point position in the target object based on the binary mask detection image, determines a bleeding point area and bleeding accumulation time in the target object, encodes the bleeding point position, the bleeding point area, the bleeding accumulation time and other contents into a code stream and sends the code stream to a display device, and the display device displays the bleeding point position, the bleeding point area, the bleeding accumulation time and other contents on a preview page in a superposition mode.

For the optical flow estimation module, the optical flow estimation module is used to find a motion region (i.e., a region where a bleeding point is located), and information of the motion region is used as a guide to provide prior assistance for a subsequent semantic segmentation module. The input of the optical flow estimation module is an original detection image and a detection image of a frame before the original detection image, and the output of the optical flow estimation module is an optical flow detection image. For example, the optical flow estimation module determines optical flow detection information between the two frames of images, the optical flow detection information including a moving direction and a moving speed, and converts the optical flow detection information into a visualized optical flow graph (i.e., an optical flow detection image), as shown in fig. 8A, which shows an example of the detection image, and a visualized optical flow graph corresponding to the detection image, as shown in fig. 8B. When the optical flow detection information is converted into a visual optical flow graph, the motion direction may be mapped to a hue component of an HSV color space, and the motion speed may be mapped to a lightness component of the HSV color space, thereby obtaining an optical flow detection image. Different colors in the optical flow detection image represent different moving directions in the optical flow detection information, and different luminances in the optical flow detection image represent different moving velocities in the optical flow detection information.

For the semantic segmentation module, which can generate a binary mask detection image using the target network model, the input of the semantic segmentation module is the original detection image (i.e., RGB color image) and the optical flow detection image (i.e., moving optical flow graph), and the output of the semantic segmentation module is the binary mask detection image.

Referring to fig. 8C, which is a schematic structural diagram of the target network model, the target network model may adopt a two-way network structure, and the network structures of the upper and lower branches are similar and both adopt a deepab network structure. The upper branch is called the first sub-network and the lower branch is called the second sub-network, a third sub-network is used for connection between the upper and lower two branches, and the third sub-network may comprise a plurality of MGA network layers. The first sub-network is used for extracting context semantic detection features (used for segmenting a target area) from an original detection image, the second sub-network is used for extracting motion detection features (namely motion attention information) from an optical flow detection image, the third sub-network is used for fusing the context semantic detection features and the motion detection features, and the motion detection features extracted by the second sub-network are fused to the first sub-network, so that the segmentation result of the first sub-network is guided through an attention mechanism.

Referring to fig. 8C, the first sub-network may include a plurality of network layers, for example, an Encoder network layer (Encoder), a head-conv network layer (header convolution layer), a residual-1 network layer (residual layer), a residual-2 network layer, a residual-3 network layer, a residual-4 network layer, an ASPP (aperture Spatial convolution and Pyramid Pooling) network layer, a Decoder network layer, a conv1 network layer (convolution layer), a concat network layer (connection layer), a conv2 network layer, a conv3 network layer, a conv4 network layer, a conv5 network layer, a Sigmoid network layer (excitation layer) in fig. 8C. Of course, the above is only an example of the first sub-network, and the structure of the first sub-network is not limited. And, the second sub-network may include a plurality of network layers, fig. 8C exemplifies an Encoder network layer, a head-conv network layer, a residual-1 network layer, a residual-2 network layer, a residual-3 network layer, a residual-4 network layer, an ASPP network layer, and a Decoder network layer. Of course, the above is only an example of the second sub-network, and the structure of the second sub-network is not limited. And, the third sub-network may include a plurality of network layers, fig. 8C takes MGA-0 network layer, MGA-1 network layer, MGA-2 network layer, MGA-3 network layer, MGA-4 network layer, and MGA-5 network layer as examples, which are only examples of the third sub-network and do not limit the structure of the third sub-network.

For the first sub-network, the input of the first network layer is the original detection image, and the first network layer processes the original detection image, which is not limited to this processing mode. The input from the second network layer to the last network layer is a feature vector (i.e. the feature vector of the first subnetwork) or two feature vectors (i.e. the feature vector of the first subnetwork and the feature vector of the third subnetwork), if the input is a feature vector, the network layer directly processes the feature vector, if the input is two feature vectors, the network layer firstly determines the sum of the two feature vectors, and then processes the sum of the two feature vectors.

The output of the first network layer to the next-to-last network layer is a feature vector, which may be output only to the next network layer or to the next network layer and the third sub-network (i.e., the MGA network layer of the third sub-network) simultaneously. The output of the last network layer is a binary mask detection image.

For the second subnetwork, the input to the first network layer is the optical flow detection image, which is processed by the first network layer. The input to the last network layer from the second network layer is a feature vector (i.e., the feature vector of the second subnetwork itself) that is processed by the network layer.

The output of the first network layer to the next to last network layer is the feature vector, which is output to the next network layer and the third subnetwork (i.e. the MGA network layer of the third subnetwork) simultaneously. The output of the last network layer is the feature vector, which is output to the MGA network layer of the third subnetwork.

For the third sub-network, the input of each MGA network layer is two feature vectors (the feature vector input by the first sub-network and the feature vector input by the second sub-network), the two feature vectors are marked as the context semantic detection feature and the motion detection feature, the MGA network layer fuses the context semantic detection feature and the motion detection feature to obtain a fusion detection feature, and the fusion detection feature is input to the first network layer.

The MGA network layer fuses the context semantic detection characteristics and the motion detection characteristics to obtain fused detection characteristics, and the following formula is adopted:

fm represents the motion detection feature, h (fm) is the convolution of fm, sigmoid (h (fm)) represents the mapping of h (fm) by Sigmoid function, fa represents the context semantic detection feature,

For the post-processing module, after the semantic segmentation module obtains the binary mask detection image, the semantic segmentation module may output the binary mask detection image to the post-processing module, and the post-processing module determines a bleeding point position inside the target object based on a bleeding region in the binary mask detection image. And the post-processing module can also determine a minimum circumscribed circle matched with the bleeding point position, determine the bleeding point area inside the target object based on the minimum circumscribed circle, and count the bleeding cumulative time of which the bleeding point area is greater than a preset area threshold value.

For example, after the post-processing module obtains the bleeding point position (bleeding point region) inside the target object, the circle center position and the diameter size of the bleeding point region may be determined, the minimum circumscribed circle of the bleeding point region may be determined based on the circle center position and the diameter size, the bleeding point area inside the target object may be determined based on the minimum circumscribed circle, and the contents of the bleeding point area, the bleeding point position, the circle center position, the diameter size, and the like may be recorded. The post-processing module can also count the bleeding accumulation time with the bleeding point area larger than a preset area threshold value, and the unit is second.

The post-processing module may further determine whether to generate a bleeding abnormal event according to the bleeding point area and the bleeding cumulative time, for example, if the bleeding point area is greater than a preset area threshold and the bleeding cumulative time is greater than a preset duration, the bleeding abnormal event may be generated, otherwise, the bleeding abnormal event may not be generated.

The post-processing module may send the bleeding point position (e.g., the bleeding point position is represented by the circle center position and the diameter size), the bleeding point area, the bleeding accumulated time, and the like to the display device, and the display device displays the bleeding point position, the bleeding point area, and the bleeding accumulated time on a preview page, as shown in fig. 8D, which is a schematic diagram of a display effect showing the contents of the bleeding point area, the bleeding accumulated time, and the like. Alternatively, the post-processing module may send the bleeding point position, the bleeding point area, the bleeding accumulation time, the bleeding abnormal event, and the like to the display device, and the display device displays the bleeding point position, the bleeding point area, the bleeding accumulation time, and the bleeding abnormal event on the preview page. Obviously, the bleeding point position, the bleeding point area and the bleeding accumulation time can be displayed on the preview page in a superposed mode, bleeding abnormal events are displayed, the bleeding point position display method can play a role in reminding a doctor to find the bleeding point as soon as possible, suturing and stopping bleeding, accurately assisting in positioning the bleeding point position in real time, and reminding and displaying the doctor.

For the pre-training module, the initial network model may be trained based on the original training image and the binary mask calibration image to obtain a target network model, and a structure of the initial network model may be as shown in fig. 8C. For example, an optical flow training image is obtained based on an original training image, and the original training image and the optical flow training image are input to an initial network model to obtain a binary mask training image. And determining a loss value based on the binary mask training image and the binary mask calibration image, and training the initial network model based on the loss value. For example, the parameter set of the initial network model is Θ, the parameter set Θ of the initial network model needs to be adjusted to obtain an adjusted parameter set, and the target network model can be obtained based on the adjusted parameter set.

Parameters related to the network structure may include, but are not limited to: number of convolution layers, connection mode of convolution layers, number of convolution filters per convolution layer, size of convolution kernel, and weight parameter W of convolution filter _CN And offset parameter B of convolution filter _CN And the parameter set Θ may include part of the above parameters or all of the above parameters, for example, the parameter set Θ includes the weighting parameter W _CN And an offset parameter B _CN This is not limitative.

For the pre-training module, the target network model may be trained by the following steps:

step S1, a large number of original training images are collected, the original training images are calibrated to obtain binary mask calibration images, the original training images and the binary mask calibration images form a group of training data, namely a training set omega can comprise a plurality of groups of training data, and each group of training data comprises the original training images and the binary mask calibration images.

S2, configuring network parameters theta for the initial network model ₀ I.e. the parameter set is theta ₀ And setting high-level parameters related to training (such as learning rate, related parameters of gradient descent algorithm and the like).

Step S3, the network parameter is theta ₀ The initial network model performs forward calculation on the training set omega to obtain a binary mask training image corresponding to the original training image in the training set omega. And determining a loss value based on the binary mask training image and the binary mask calibration image in the training set omega.

S4, adjusting theta by using a back propagation algorithm based on the loss value to obtain theta _i 。

Step S5, repeating the step S3 to the step S4 until the network model is converged and outputting the parameter theta _final And is based on the parameter theta _final And the formed network model is the trained target network model.

Based on the same application concept as the above method, in the embodiment of the present application, a bleeding point detecting device is provided, as shown in fig. 9, which is a schematic structural diagram of the device, and the device may include: an acquisition module 91, configured to acquire an original detection image of the inside of a target object through an endoscope; a generating module 92, configured to determine optical flow detection information between the original detection image and a detection image of a frame previous to the original detection image, and generate an optical flow detection image corresponding to the optical flow detection information; an obtaining module 93, configured to input an original detection image and an optical flow detection image to a trained target network model, so as to extract a context semantic detection feature from the original detection image through the target network model, extract a motion detection feature from the optical flow detection image, and fuse the context semantic detection feature and the motion detection feature to obtain a fused detection feature; the context semantic detection features represent context semantic information of bleeding points in an original detection image, and the motion detection features represent motion information of the bleeding points in an optical flow detection image; inputting the fusion detection feature and the context semantic detection feature into the target network model to output a binary mask detection image through the target network model; the binary mask detection image comprises a bleeding area and a non-bleeding area, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value; a determining module 94 for determining bleeding point locations inside the target object based on the bleeding areas in the binary mask detection image.

For example, when the generating module 92 generates the optical flow detection image corresponding to the optical flow detection information, it is specifically configured to: if the optical flow detection information comprises a motion direction and a motion speed, mapping the motion direction to a hue component of an HSV color space, and mapping the motion speed to a brightness component of the HSV color space; generating the optical flow detection image based on the hue component and the brightness component.

Illustratively, the target network model includes a first sub-network, a second sub-network and a third sub-network, and the obtaining module 93 is specifically configured to: extracting context semantic detection features from the original detection image through a first sub-network, and inputting the context semantic detection features to a third sub-network; extracting motion detection features from the optical flow detection image through a second sub-network, and inputting the motion detection features to a third sub-network; and fusing the context semantic detection feature and the motion detection feature through a third sub-network to obtain a fused detection feature, and inputting the fused detection feature to the first sub-network.

For example, the obtaining module 93 may fuse the context semantic detection feature and the motion detection feature through the third sub-network, and when obtaining a fused detection feature, is specifically configured to: convolving the motion detection features to obtain convolved motion detection features, and mapping the convolved motion detection features by adopting an activation function to obtain mapped motion detection features; and performing multiplication operation on the context semantic detection features and the mapped motion detection features to obtain fusion detection features.

Illustratively, the apparatus further comprises a training module for training the target network model by: acquiring an original training image inside a target object through an endoscope; determining optical flow training information between an original training image and a training image of a frame before the original training image, and generating an optical flow training image corresponding to the optical flow training information; inputting an original training image and an optical flow training image into an initial network model, extracting context semantic training features from the original training image through the initial network model, extracting motion training features from the optical flow training image, and fusing the context semantic training features and the motion training features to obtain fused training features; the context semantic training features represent context semantic information of bleeding points in the original training image, and the motion training features represent motion information of the bleeding points in the optical flow training image; inputting the fusion training features and the context semantic training features to the initial network model to output a binary mask training image through the initial network model; the binary mask training image comprises a bleeding area and a non-bleeding area, wherein the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value; determining a loss value based on the binary mask training image and the configured binary mask calibration image; the binary mask calibration image is a calibration image of the original training image, the binary mask calibration image comprises a bleeding area and a non-bleeding area, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value; and training the initial network model based on the loss value to obtain a target network model.

Based on the same application concept as the method, the embodiment of the present application provides a bleeding point detection system, including: the endoscope, the light source device, the camera host, the display device and the storage device; wherein: the endoscope is used for acquiring an original detection image in a target object and inputting the original detection image to the camera host; the light source device is used for providing a light source for the endoscope; the camera host is used for determining optical flow detection information between the original detection image and a detection image of a previous frame of the original detection image and generating an optical flow detection image corresponding to the optical flow detection information; inputting the original detection image and the optical flow detection image into a trained target network model, extracting context semantic detection features from the original detection image through the target network model, extracting motion detection features from the optical flow detection image, and fusing the context semantic detection features and the motion detection features to obtain fused detection features; inputting the fusion detection features and the context semantic detection features into the target network model to output a binary mask detection image through the target network model, wherein the binary mask detection image comprises a bleeding area and a non-bleeding area; determining bleeding point locations inside the target object based on bleeding regions in the binary mask detection image; the display device is used for displaying positions of bleeding points in the target object; the storage device is used for storing the original detection image and the binary mask detection image.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Furthermore, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A bleed point detection method, the method comprising:

inputting the original detection image and the optical flow detection image to a trained target network model, extracting context semantic detection features from the original detection image through the target network model, extracting motion detection features from the optical flow detection image, and fusing the context semantic detection features and the motion detection features to obtain fused detection features; the context semantic detection features represent context semantic information of bleeding points and context semantic information of non-bleeding points in the original detection image, and the motion detection features represent motion information of the bleeding points and motion information of the non-bleeding points in the optical flow detection image;

inputting the fusion detection feature and the context semantic detection feature to the target network model to output a binary mask detection image through the target network model; the binary mask detection image comprises a bleeding area and a non-bleeding area, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value;

determining bleeding point locations inside the target object based on bleeding regions in a binary mask detection image.

2. The method of claim 1,

the generating of the optical flow detection image corresponding to the optical flow detection information includes:

if the optical flow detection information comprises a motion direction and a motion speed, mapping the motion direction to a hue component of an HSV color space, and mapping the motion speed to a brightness component of the HSV color space;

generating the optical flow detection image based on the hue component and the brightness component.

3. The method of claim 1,

the target network model comprises a first sub-network, a second sub-network and a third sub-network;

the extracting context semantic detection characteristics from the original detection image through a target network model, extracting motion detection characteristics from the optical flow detection image, and fusing the context semantic detection characteristics and the motion detection characteristics to obtain fused detection characteristics, including:

extracting context semantic detection features from the original detection image through a first sub-network of the target network model, and inputting the context semantic detection features to the third sub-network;

extracting motion detection features from the optical flow detection image through a second sub-network of the target network model and inputting the motion detection features to the third sub-network;

and fusing the context semantic detection features and the motion detection features through the third sub-network to obtain fused detection features, and inputting the fused detection features to the first sub-network.

4. The method of claim 3, wherein fusing the context semantic detection feature and the motion detection feature via the third sub-network to obtain a fused detection feature comprises:

convolving the motion detection features to obtain convolved motion detection features, and mapping the convolved motion detection features by using an activation function to obtain mapped motion detection features; and performing multiplication operation on the context semantic detection features and the mapped motion detection features to obtain fusion detection features.

5. The method according to any one of claims 1-4, wherein after determining a bleeding point location inside the target object based on the bleeding region in the binary mask detection image, further comprising:

determining a minimum circumscribed circle that matches the bleeding point location;

determining a bleeding point area inside the target object based on the minimum circumscribed circle;

counting the bleeding accumulation time of which the area of the bleeding point is larger than a preset area threshold;

displaying the bleeding point location, the bleeding point area, and the bleeding accumulation time on a preview page.

6. The method according to any one of claims 1 to 4,

the training process of the target network model comprises the following steps:

acquiring an original training image inside a target object through an endoscope;

determining optical flow training information between the original training image and a training image of a frame before the original training image, and generating an optical flow training image corresponding to the optical flow training information;

inputting the original training image and the optical flow training image to an initial network model, extracting context semantic training features from the original training image through the initial network model, extracting motion training features from the optical flow training image, and fusing the context semantic training features and the motion training features to obtain fused training features; wherein the context semantic training features represent context semantic information of bleeding points in the original training image, and the motion training features represent motion information of the bleeding points in the optical flow training image;

inputting the fusion training features and the context semantic training features to the initial network model to output a binary mask training image through the initial network model; the binary mask training image comprises a bleeding area and a non-bleeding area, wherein the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value;

determining a loss value based on the binary mask training image and the configured binary mask calibration image; the binary mask calibration image is a calibration image of the original training image, the binary mask calibration image comprises a bleeding area and a non-bleeding area, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value;

and training the initial network model based on the loss value to obtain a target network model.

7. The method of claim 6, wherein training the initial network model based on the loss values to obtain a target network model comprises:

updating the network parameters of the initial network model based on the loss value to obtain an updated network model, and determining whether the updated network model is converged;

if not, determining the updated network model as an initial network model, and returning to execute the operation of inputting the original training image and the optical flow training image to the initial network model;

and if so, determining the updated network model as a target network model.

8. A bleeding point detection device, the device comprising:

the acquisition module is used for acquiring an original detection image inside a target object through an endoscope;

a generating module, configured to determine optical flow detection information between the original detection image and a detection image of a frame previous to the original detection image, and generate an optical flow detection image corresponding to the optical flow detection information;

an obtaining module, configured to input the original detection image and the optical flow detection image to a trained target network model, so as to extract a context semantic detection feature from the original detection image through the target network model, extract a motion detection feature from the optical flow detection image, and fuse the context semantic detection feature and the motion detection feature to obtain a fused detection feature; the context semantic detection features represent context semantic information of bleeding points and context semantic information of non-bleeding points in the original detection image, and the motion detection features represent motion information of the bleeding points and motion information of the non-bleeding points in the optical flow detection image; inputting the fusion detection feature and the context semantic detection feature into the target network model so as to output a binary mask detection image through the target network model; the binary mask detection image comprises a bleeding area and a non-bleeding area, the pixel value of each pixel point in the bleeding area is a first value, and the pixel value of each pixel point in the non-bleeding area is a second value;

a determination module for determining a bleeding point location inside the target object based on the bleeding region in the binary mask detection image.

9. The apparatus of claim 8, further comprising:

a training module, configured to train to obtain the target network model by:

inputting the original training image and the optical flow training image into an initial network model, extracting context semantic training features from the original training image through the initial network model, extracting motion training features from the optical flow training image, and fusing the context semantic training features and the motion training features to obtain fused training features; wherein the context semantic training features represent context semantic information of bleeding points in the original training image, and the motion training features represent motion information of the bleeding points in the optical flow training image;

10. A bleed point detection system, comprising: the endoscope, the light source device, the camera host, the display device and the storage device; wherein:

the endoscope is used for acquiring an original detection image in a target object and inputting the original detection image to the camera host;

the light source device is used for providing a light source for the endoscope;

the camera host is used for determining optical flow detection information between the original detection image and a detection image of the previous frame of the original detection image and generating an optical flow detection image corresponding to the optical flow detection information; inputting the original detection image and the optical flow detection image into a trained target network model, extracting context semantic detection features from the original detection image through the target network model, extracting motion detection features from the optical flow detection image, and fusing the context semantic detection features and the motion detection features to obtain fused detection features; the context semantic detection features represent context semantic information of bleeding points and context semantic information of non-bleeding points in the original detection image, and the motion detection features represent motion information of the bleeding points and motion information of the non-bleeding points in the optical flow detection image; inputting the fusion detection features and the context semantic detection features into the target network model to output a binary mask detection image through the target network model, wherein the binary mask detection image comprises a bleeding area and a non-bleeding area; determining bleeding point locations inside the target object based on bleeding regions in the binary mask detection image;

the display device is used for displaying the position of a bleeding point in the target object;