WO2023060637A1

WO2023060637A1 - Measurement method and measurement apparatus based on deep learning of tight box mark

Info

Publication number: WO2023060637A1
Application number: PCT/CN2021/125152
Authority: WO
Inventors: 王娟; 夏斌
Original assignee: 深圳硅基智能科技有限公司
Priority date: 2021-10-11
Filing date: 2021-10-21
Publication date: 2023-04-20
Also published as: CN115331050A; CN113780477A; CN113920126B; CN115359070A; CN115423818A; CN113780477B; CN115578577A; CN113920126A

Abstract

A measurement method based on deep learning of a tight box mark, comprising: obtaining an input image (S220); inputting the input image into a network module (20) trained on the basis of the tight box mark of a target to obtain a first output and a second output (S240), the first output comprising a probability that each pixel point in the input image belongs to each category, the second output comprising an offset between the position of each pixel point in the input image and the tight box mark of the target of each category, the network module (20) comprising a backbone network (21) for extracting a feature map of the input image, a segmentation network (22) based on weak supervised learning, and a regression network (23) based on bounding box regression, the segmentation network (22) taking the feature map as an input to obtain the first output, the regression network (23) taking the feature map as an input to obtain the second output, and the resolution of the feature map being consistent with that of the input image; and recognizing the target on the basis of the first output and the second output to obtain the tight box mark of the target of each category (S260). Therefore, the target can be recognized and accurately measured.

Description

Measuring method and measuring device based on deep learning of tight frame

technical field

The present disclosure generally relates to the field of recognition technology based on deep learning, and specifically relates to a measurement method and a measurement device based on deep learning of tight frames.

Background technique

The image often includes information of various targets, and the information of the target in the image can be automatically analyzed based on image processing technology. For example, in the medical field, tissue objects in medical images can be identified, and then the size of the tissue objects can be measured to monitor changes in the tissue objects.

In recent years, artificial intelligence technology represented by deep learning has developed significantly, and its application in target recognition or measurement has attracted more and more attention. Researchers use deep learning techniques to identify or further measure objects in images. Specifically, in some researches based on deep learning, labeled data is often used to train a neural network based on deep learning to recognize and segment the target in the image, and then the target can be measured. However, the above target recognition or measurement methods often require accurate pixel-level labeled data for neural network training, and collecting pixel-level labeled data often requires a lot of manpower and material resources. In addition, although some object recognition methods are not based on pixel-level annotation data, they only recognize objects in the image. The boundary recognition of objects is not accurate enough or the accuracy is often low near the boundary of the object, which is not suitable for precise The measured scene. In this case, the accuracy of measuring the target in the image still needs to be improved.

Contents of the invention

The present disclosure is made in view of the above-mentioned situation, and an object thereof is to provide a measurement method and a measurement device based on tight-framework deep learning that can identify a target and accurately measure the target.

To this end, the first aspect of the present disclosure provides a measurement method based on deep learning of a tight frame, which is a measurement method that uses a network module trained based on a target-based tight frame to identify the target so as to achieve measurement. Tightly framed as the minimum bounding rectangle of the target, the measuring method comprises: acquiring an input image comprising at least one target, the at least one target belonging to at least one category of interest; inputting the input image into the network module To obtain a first output and a second output, the first output includes the probability that each pixel in the input image belongs to each category, and the second output includes the position of each pixel in the input image and each The offset of the tight frame of the target of the category, the offset in the second output is used as the target offset, wherein the network module includes a backbone network, a segmentation network based on image segmentation based on weakly supervised learning, and a frame regression based on A regression network, the backbone network is used to extract the feature map of the input image, the segmentation network uses the feature map as input to obtain the first output, and the regression network takes the feature map as input to obtain Obtaining the second output, wherein the feature map is consistent with the resolution of the input image; identifying the target based on the first output and the second output to obtain a tight frame of each category of target mark.

In this disclosure, a network module including a backbone network, a segmentation network based on weakly supervised learning for image segmentation, and a regression network based on bounding box regression is constructed. The network module is trained based on the tight frame of the target. The backbone network receives the input image and Extract a feature map consistent with the resolution of the input image, input the feature map into the segmentation network and the regression network to obtain the first output and the second output, and then obtain the tight frame of the target in the input image based on the first output and the second output so that Realize measurement. In this case, the trained network module based on the tight box of the object can accurately predict the tight box of the object in the input image, and then can accurately measure it based on the tight box of the object.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the network module is trained by the following method: constructing a training sample, the input image data of the training sample includes multiple images to be trained, and the multiple The image to be trained includes an image containing an object belonging to at least one category, and the label data of the training sample includes the gold standard of the category to which the object belongs and the gold standard of the tight frame of the object; through the network module based on The input image data of the training sample, obtaining the predicted segmentation data output by the segmentation network corresponding to the training sample and the predicted offset output by the regression network; based on the label data corresponding to the training sample, the The predicted segmentation data and the predicted offset determine a training loss for the network module; and the network module is trained based on the training loss to optimize the network module. Thus, an optimized network module can be obtained.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, determining the training loss of the network module based on the label data corresponding to the training samples, the predicted segmentation data and the predicted offset , including: obtaining the segmentation loss of the segmentation network based on the predicted segmentation data and label data corresponding to the training sample; obtaining the regression based on the predicted offset corresponding to the training sample and the real offset based on the label data The regression loss of the network, wherein, the real offset is the offset of the position of the pixel point of the image to be trained and the tight frame mark of the target in the label data; and based on the segmentation loss and the regression loss , to obtain the training loss of the network module. In this case, the predicted segmentation data of the segmentation network can be approximated to the label data by the segmentation loss, and the predicted offset of the regression network can be approximated by the real offset by the regression loss.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the target offset is an offset normalized based on the average width and average height of targets of each category, or the target offset Shift is the shift normalized based on the average size of objects of each class. As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, multi-instance learning is used to obtain multiple training packages based on the gold standard of the tight frame of the target in each image to be trained by category, and based on each category A plurality of packets to be trained to obtain the segmentation loss, wherein the plurality of packets to be trained include a plurality of positive packets and a plurality of negative packets, and the number of two sides opposite to the gold standard of the tight frame connecting the target All the pixel points on each of the straight lines are divided into a positive package, and the multiple straight lines include at least one set of first parallel lines parallel to each other and second parallel lines perpendicular to each set of first parallel lines respectively. Parallel lines, the negative bag is a single pixel point in the area outside the gold standard of the tight frame of all objects of a category, the segmentation loss includes a unary item and a pairwise item, and the unary item describes each bag to be trained The degree of belonging to the gold standard of each category, the pair item describes the degree to which the pixel of the image to be trained and the pixels adjacent to the pixel belong to the same category. In this case, the segmentation loss can be obtained based on the positive and negative packets of multi-instance learning, and the tight frame is constrained by both the positive and negative packets through the unary loss, and the predicted segmentation results are smoothed through the pairwise loss.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the angle of the first parallel line is any one of the extension line of the first parallel line and the gold standard of the tight frame of the target As for the included angle of the extension lines of the non-intersecting sides, the angle of the first parallel line is greater than -90° and less than 90°. In this case, being able to divide positive packets from different angles optimizes the segmentation network. Thereby, the accuracy of the segmented data prediction by the segmented network can be improved.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, pixels falling within the gold standard of at least one target’s tight frame are selected from the images to be trained by category as positive samples of each category And obtain the matching tight frame corresponding to the positive sample to screen the positive samples of each category based on the matching tight frame, and then use the filtered positive samples of each category to optimize the regression network, wherein the The matching tight frame is the tight framed gold standard that has the smallest true deviation relative to the position of the positive sample among the tight framed gold standards that the positive sample falls into. In this way, the regression network can be optimized by using the positive samples of each category screened based on the matching tight frame.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the position of a pixel point is expressed as (x, y), and the tight frame of a target corresponding to the pixel point is expressed as b=(xl, yt, xr, yb), the offset of the tight frame b of the target relative to the position of the pixel is expressed as t=(tl, tt, tr, tb), then tl, tt, tr, tb satisfy the formula: tl =(x-xl)/S _c1 , tt=(y-yt)/S _c2 , tr=(xr-x)/S _c1 , tb=(yb-y)/S _c2 , where xl, yt represent the target The position of the upper left corner of the tight box, xr, yb represent the position of the lower right corner of the tight frame of the target, S _c1 represents the average width of the target of the c-th category, and S _c2 represents the average height of the target of the c-th category. Thereby, a normalized offset can be obtained.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the pixel points of the image to be trained are selected from the pixels of the image to be trained according to the category and the expected cross-over-union ratio corresponding to the pixels of the image to be trained. The pixels whose expected intersection ratio is greater than the preset expected intersection ratio are optimized for the regression network, wherein multiple frames of different sizes constructed with the pixels of the image to be trained as the center point are used to obtain the plurality of The maximum value of the intersection and union ratios between the frame and the matching tight frame of the pixel point is used as the expected intersection and union ratio, and the matching tight frame is the gold standard of the tight frame where the pixels of the image to be trained fall into The gold standard of the tight frame with the smallest true offset relative to the position of the pixel. In this way, positive samples that meet the preset expected intersection ratio can be obtained.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the expected intersection and union ratio satisfies the formula:

Among them, r ₁ , r ₂ are the relative positions of the pixels of the image to be trained in the matching tight frame, 0<r ₁ , r ₂ <1, IoU ₁ (r ₁ , r ₂ )=4r ₁ r ₂ , IoU ₂ (r ₁ ,r ₂ )=2r ₁ /(2r ₁ (1-2r ₂ )+1), IoU ₃ (r ₁ ,r ₂ )=2r ₂ /(2r ₂ (1-2r ₁ )+ 1), IoU ₄ (r ₁ , r ₂ )=1/(4(1-r ₁ )(1-r ₂ )). Thus, a desired cross-merge ratio can be obtained.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the target is identified based on the first output and the second output to obtain the tight frame of each category of target as : Obtain the position of the pixel point with the highest local probability belonging to each category from the first output as the first position, and acquire based on the position corresponding to the first position in the second output and the target offset of the corresponding category Tight boxes for the targets of each category. In this case, an object or objects of each category can be identified.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the sizes of multiple targets of the same category differ from each other by less than 10 times. As a result, the accuracy of target recognition can be further improved.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the backbone network includes an encoding module and a decoding module, the encoding module is configured to extract image features at different scales, and the decoding module is configured to is to map the extracted image features at different scales back to the resolution of the input image to output the feature map. Thereby, it is possible to obtain a feature map matching the resolution of the input image.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the input image is a fundus image, and the target is an optic cup and/or an optic disc. Thereby, a tight frame of the optic cup and/or optic disc can be obtained.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the target is identified based on the first output and the second output to obtain the optic cup and /or the tight frame of the optic disc so as to realize the measurement as follows: from the first output, the position of the pixel point with the highest probability belonging to each category is obtained as the first position, based on the second output corresponding to the first position and the target offset of the corresponding category to obtain the tight frame of the target of each category. Thereby, the optic cup and/or optic disc can be identified.

In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, based on the tight frame of the optic cup in the fundus image and/or the tight frame of the optic disc in the fundus image and/or optic disc measurements are taken to obtain cup and/or optic disc dimensions, and a cup to optic disc ratio is obtained based on the cup and optic disc dimensions in said fundus image. Thus, the ratio of the optic cup to the optic disc can be obtained.

The second aspect of the present disclosure provides a measurement device based on deep learning of a tight frame, which is a measurement device that uses a network module trained on a target-based tight frame to identify the target so as to achieve measurement. The tight frame is the minimum circumscribed rectangle of the target, the measurement device includes an acquisition module, a network module and a recognition module; the acquisition module is configured to acquire an input image comprising at least one target, and the at least one target belongs to at least one category of interest The network module is configured to receive the input image and obtain a first output and a second output based on the input image, the first output includes the probability that each pixel in the input image belongs to each category, the The second output includes the offset of the position of each pixel in the input image and the tight frame of the target of each category, and the offset in the second output is used as the target offset, wherein the network module includes a backbone network, a segmentation network based on image segmentation based on weakly supervised learning, and a regression network based on border regression, the backbone network is used to extract the feature map of the input image, and the segmentation network takes the feature map as input to obtain the the first output, the regression network takes the feature map as input to obtain the second output, wherein the feature map is consistent with the resolution of the input image; and the identification module is configured to be based on the The first output and the second output identify the objects to obtain tight boxes for each category of objects.

In addition, in the measurement device according to the first aspect of the present disclosure, optionally, the input image is a fundus image, and the target is an optic cup and/or an optic disc. Thereby, a tight frame of the optic cup and/or optic disc can be obtained.

In addition, in the measurement device according to the first aspect of the present disclosure, optionally, the target is identified based on the first output and the second output to obtain the optic cup and /or the tight frame of the optic disc so as to realize the measurement as follows: from the first output, the position of the pixel point with the highest probability belonging to each category is obtained as the first position, based on the second output corresponding to the first position and the target offset of the corresponding category to obtain the tight frame of the target of each category. Thereby, the optic cup and/or optic disc can be identified.

According to the present disclosure, there are provided a measurement method and a measurement device based on tight frame deep learning that can identify a target and accurately measure the target.

Description of drawings

The present disclosure will now be explained in further detail by way of example only with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram showing an application scenario of a tight-frame-based deep learning measurement method involved in an example of the present disclosure.

FIG. 2( a ) is a schematic diagram showing a fundus image related to an example of the present disclosure.

FIG. 2( b ) is a schematic diagram showing a recognition result of a fundus image according to an example of the present disclosure.

FIG. 3 is a schematic diagram showing one example of a network module involved in an example of the present disclosure.

FIG. 4 is a schematic diagram showing another example of a network module related to an example of the present disclosure.

FIG. 5 is a flowchart illustrating a training method of a network module related to an example of the present disclosure.

FIG. 6 is a schematic diagram showing positive packets involved in an example of the present disclosure.

FIG. 7 is a schematic diagram showing a frame constructed centering on a pixel point involved in an example of the present disclosure.

FIG. 8 is a flow chart showing a measurement method of tight-frame-based deep learning related to an example of the present disclosure.

FIG. 9 is a block diagram illustrating a measurement device for tight-framework-based deep learning according to an example of the present disclosure.

FIG. 10 is a flowchart illustrating a measurement method for a fundus image according to an example of the present disclosure.

Fig. 11 is a block diagram showing a measurement device for a fundus image according to an example of the present disclosure.

Detailed ways

Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the drawings. In the following description, the same reference numerals are given to the same components, and repeated descriptions are omitted. In addition, the drawings are only schematic diagrams, and the ratio of dimensions between components, the shape of components, and the like may be different from the actual ones. It should be noted that the terms "comprising" and "having" and any variations thereof in the present disclosure, such as a process, method, system, product or device that includes or has a series of steps or units, are not necessarily limited to the clearly listed instead, may include or have other steps or elements not explicitly listed or inherent to the process, method, product or apparatus. All methods described in this disclosure can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context.

The present disclosure relates to a measurement method and a measurement device based on tight frame deep learning, which can identify targets and improve the accuracy of target measurement. The measurement method involved in the present disclosure may also be referred to as an identification method or an auxiliary measurement method. The measurement method involved in the present disclosure may be applicable to any application scenario in which the width and/or height of an object in an image is accurately measured.

The measurement method involved in the present disclosure is a measurement method for realizing measurement by using a network module trained based on a target tight frame to identify a target. The tight frame can be the smallest bounding rectangle of the target. In this case, the target is in contact with the four sides of the tight frame and does not overlap with the area outside the tight frame (that is, the target is tangent to the four sides of the tight frame). Thus, the tight frame can represent the width and height of the object. In addition, training the network module based on the tight frame of the target can reduce the time and labor cost of collecting pixel-level annotation data, and the network module can accurately identify the tight frame of the target.

The input images referred to in the present disclosure may come from cameras, CT scans, PET-CT scans, SPECT scans, MRI, ultrasound, X-rays, angiograms, fluorographs, images taken by capsule endoscopy, or combinations thereof. In some examples, the input image may be an image of a tissue object (eg, a fundus image). In some examples, the input image can be a natural image. The natural image may be an image observed or captured in a natural scene. Thus, it is possible to measure objects in natural images. For example, the size of a human face in a natural image or the height of a pedestrian can be measured. The following describes examples of the present disclosure by taking the input image as an example of a fundus image captured by a fundus camera, and such description does not limit the scope of the present disclosure.

FIG. 1 is a schematic diagram showing an application scenario of a tight-frame-based deep learning measurement method involved in an example of the present disclosure. FIG. 2( a ) is a schematic diagram showing a fundus image related to an example of the present disclosure. FIG. 2( b ) is a schematic diagram showing a recognition result of a fundus image according to an example of the present disclosure. In some examples, the measurement method involved in the present disclosure can be applied to the application scenario as shown in FIG. 1 . In the application scenario, the image of the target object 51 including the corresponding position of the target can be collected by the acquisition device 52 (such as a camera) as an input image (see FIG. 1), and the input image is input to the network module 20 to identify the target in the input image and obtain The tight frame B of the target (see Figure 1), and then the target can be measured based on the tight frame B. Taking the fundus image as an example, inputting the fundus image shown in FIG. 2(a) into the network module 20 can obtain the recognition result shown in FIG. 2(b). The tight frame mark, wherein, the tight frame mark B11 is the tight frame mark of the optic disc, and the tight frame mark B12 is the tight frame mark of the optic cup. In this case, the optic cup and optic disc can be measured based on the tight frame.

The network module 20 involved in the present disclosure may be based on multitasking. In some examples, the network module 20 may be a deep learning based neural network. In some examples, the network module 20 may include two tasks, one task may be a segmentation network 22 (described later) for image segmentation based on weakly supervised learning, and the other task may be a regression network 23 based on bounding box regression (described later). describe).

In some examples, segmentation network 22 may segment an input image to obtain objects (eg, optic cup and/or optic disc). In some examples, segmentation network 22 may be based on Multiple-Instance learning (MIL) and used to supervise tight labels. In some examples, the problem solved by segmentation network 22 may be a multi-label classification problem. In some examples, the input image may contain objects of at least one category of interest (may be simply referred to as category). Thus, the segmentation network 22 is able to recognize input images containing objects of at least one class of interest. In some examples, the input image may also be free of any objects. In some examples, the number of objects for each category of interest may be at least greater than one.

In some examples, regression network 23 may be used to predict tight boxes by category. In some examples, the regression network 23 may further predict the tight frame by predicting the offset of the tight frame relative to each pixel of the input image.

In some examples, the network module 20 may also include a backbone network 21 . The backbone network 21 can be used to extract the feature map of the input image (that is, the original image input to the network module 20). In some examples, backbone network 21 may extract high-level features for object representation. In some examples, the resolution of the feature map can be consistent with the input image (ie, the feature map can be single-scale and consistent with the size of the input image). As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved. In some examples, image features of different scales can be continuously fused to obtain a feature map consistent with the scale of the input image. In some examples, feature maps may serve as input to segmentation network 22 and regression network 23 .

In some examples, backbone network 21 may include an encoding module and a decoding module. In some examples, the encoding module can be configured to extract image features at different scales. In some examples, the decoding module may be configured to map image features extracted at different scales back to the resolution of the input image to output a feature map. Thereby, it is possible to obtain a feature map matching the resolution of the input image.

FIG. 3 is a schematic diagram showing one example of the network module 20 related to the example of the present disclosure. In some examples, as shown in FIG. 3 , the network module 20 may include a backbone network 21 , a segmentation network 22 and a regression network 23 . The backbone network 21 can receive an input image and output a feature map. The feature maps can be used as input to segmentation network 22 and regression network 23 to obtain corresponding outputs. Specifically, the segmentation network 22 may use the feature map as input to obtain a first output, and the regression network 23 may use the feature map as input to obtain a second output. In this case, the input image can be input to the network module 20 to obtain the first output and the second output.

In some examples, the first output may be the result of image segmentation prediction. In some examples, the second output may be the result of the bounding box regression prediction.

In some examples, the first output may include the probability that each pixel in the input image belongs to each category. In some examples, the probability that each pixel belongs to each category can be obtained through an activation function. In some examples, the first output can be a matrix. In some examples, the size of the matrix corresponding to the first output may be M×N×C, where M×N may represent the resolution of the input image, M and N may correspond to the rows and columns of the input image respectively, and C may represent The number of categories. For example, for the two types of fundus images whose targets are the optic cup and the optic disc, the size of the matrix corresponding to the first output may be M×N×2.

In some examples, the value corresponding to the pixel at each position in the input image in the first output may be a vector, and the number of elements in the vector may be consistent with the number of categories. For example, for a pixel at position k in the input image, the corresponding value in the first output may be a vector p _k , the vector p _k may include C elements, and C may be the number of categories. In some examples, the values of the elements of the vector p _k may be values from 0 to 1.

In some examples, the second output may include an offset between the position of each pixel point in the input image and the tight bounding box of each category of objects. That is, the second output may include the offset of the tight box for the object of the definite class. In other words, what the regression network 23 predicts may be the offset of a tight box for an object of a definite class. In this case, when the overlapping degree of objects of different categories is high, the tight frames of the objects of the corresponding categories can be distinguished, and then the tight frames of the objects of the corresponding categories can be obtained. As a result, recognition or measurement of objects with a high overlap between objects of different classes can be compatible. In some examples, the offset in the second output may be used as the target offset.

In some examples, the target offset may be a normalized offset. In some examples, the object offset may be an offset normalized based on the average size of objects of each class. In some examples, the object offset is an offset that may be normalized based on the average width and average height of objects of each category. A target offset and a predicted offset (described later) may correspond to a real offset (described later). That is, if the real offset when training the network module 20 (which can be referred to as the training phase) is normalized, then the target offset (corresponding to the measurement phase) and The prediction offset (corresponding to the training stage) can also be automatically normalized accordingly. As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved.

In some examples, the average size of the objects can be obtained by averaging the average width and average height of the objects. In some examples, the average size of objects, or the average width and average width of objects may be empirical values. In some examples, the average size of the target can be obtained by statistically collecting samples corresponding to the input image. In some examples, the width and height of the tight boxes of objects in the sample's label data may be averaged separately by class to obtain an average width and an average height. In some examples, the average width and average height may be averaged to obtain an average size of objects of that class. In some examples, the samples may be training samples (described later). Thereby, the average width and the average width, or the average size of the objects can be acquired through the training samples.

In some examples, the second output can be a matrix. In some examples, the size of the matrix corresponding to the second output can be M×N×A, where A can represent the size of all target offsets, M×N can represent the resolution of the input image, and M and N can correspond to The rows and columns of the input image. In some examples, if the size of a target offset is a 4×1 vector (that is, it can be represented by 4 numbers), then A can be C×4, and C can represent the number of categories. For example, for the two types of fundus images whose targets are the optic cup and the optic disc, the size of the matrix corresponding to the second output may be M×N×8.

In some examples, the value corresponding to the pixel at each position in the input image in the second output may be a vector. For example, the pixel at the kth position in the input image, the corresponding value in the second output can be expressed as: v _k =[v _k1 , v _k2 , . . . , v _kC ]. Among them, C can be the number of categories, and each element in _vk can be expressed as the target displacement of the target of each category. Thus, the target displacement and the corresponding category can be conveniently represented. In some examples, the elements of v _k may be 4-dimensional vectors.

In some examples, the backbone network 21 may be based on a U-net network. In this embodiment, the coding module of the backbone network 21 may include a unit layer and a pooling layer (pooling layers). The decoding module of the backbone network 21 may include a unit layer, an up-sampling layer (up-sampling layers, Up-sampling) and a skip connection unit (skip connection units, Skip-connection).

In some examples, the unit layers may include convolutional layers, batch normalization layers, and rectified linear unit layers (ReLu). In some examples, the pooling layers (Pooling) may be max pooling layers (Max-pooling). In some examples, skip connection units may be used to combine image features from deep layers and image features from shallow layers.

In addition, the segmentation network 22 may be a feed-forward neural network. In some examples, segmentation network 22 may include multiple layers of units. In some examples, segmentation network 22 may include multiple unit layers and convolutional layers (Conv). In addition, the regression network 23 may include dilated convolution layers (dilated convolution layers, Dilated Conv) and corrected linear unit layers (batch normalization layers, BN). In some examples, regression network 23 may include dilated convolutional layers, rectified linear unit layers, and convolutional layers.

FIG. 4 is a schematic diagram showing another example of the network module 20 related to the example of the present disclosure. It should be noted that, in order to describe the network structure of the network module 20 more clearly, in FIG. 4 , the network layers in the network module 20 are distinguished by the numbers in the arrows, where the arrow 1 represents the convolutional layer and the batch normalization layer And the network layer composed of the corrected linear unit layer (that is, the unit layer), the arrow 2 indicates the network layer composed of the expansion convolution layer and the corrected linear unit, the arrow 3 indicates the convolution layer, the arrow 4 indicates the maximum pooling layer, and the arrow 5 indicates In the upsampling layer, arrow 6 represents the skip connection unit.

As an example of the network module 20. As shown in Figure 4, an input image with a resolution of 256×256 can be input to the network module 20, and the image features are extracted through the unit layer (see arrow 1) and the maximum pooling layer (see arrow 4) of different levels of the encoding module , and through the different levels of unit layers (see arrow 1), upsampling layer (see arrow 5) and skip connection unit (see arrow 6) of the decoding module to continuously fuse image features of different scales to obtain the same scale as the input image Then the feature map 221 is input into the segmentation network 22 and the regression network 23 respectively to obtain the first output and the second output.

In addition, as shown in Figure 4, the segmentation network 22 can be composed of a unit layer (see arrow 1) and a convolutional layer (see arrow 3) in turn, and the regression network 23 can be composed of a plurality of dilated convolutional layers and corrected linear unit layers in turn. composed of network layers (see arrow 2), and convolutional layers (see arrow 3). Among them, the unit layer can be composed of convolution layer, batch normalization layer and rectified linear unit layer.

In some examples, the size of the convolution kernel of the convolution layer in the network module 20 may be set to 3×3. In some examples, the size of the convolution kernel of the maximum pooling layer in the network module 20 may be set to 2×2, and the convolution step may be set to 2. In some examples, the scale-factor of the up-sampling layer in the network module 20 may be set to 2. In some examples, as shown in FIG. 4, the expansion coefficients (dilation-factor) of multiple dilated convolutional layers in the network module 20 can be set to 1, 1, 2, 4, 8 and 16 in sequence (see arrow 2 above numbers). In some examples, as shown in FIG. 4 , the number of max pooling layers may be five. Thus, the size of the input image can be divided by 32 (32 may be 2 to the fifth power).

As described above, the measurement method involved in the present disclosure is a measurement method that uses the network module 20 trained based on the tight frame of the target to recognize the target and realize the measurement. Hereinafter, the training method of the network module 20 involved in the present disclosure (which may be referred to as the training method for short) will be described in detail with reference to the accompanying drawings. FIG. 5 is a flowchart showing a training method of the network module 20 according to an example of the present disclosure.

In some examples, the segmentation network 22 and the regression network 23 in the network module 20 can be trained simultaneously on an end-to-end basis.

In some examples, the segmentation network 22 and the regression network 23 in the network module 20 can be jointly trained to simultaneously optimize the segmentation network 22 and the regression network 23 . In some examples, through joint training, the segmentation network 22 and the regression network 23 can adjust the network parameters of the backbone network 21 through backpropagation, so that the feature map output by the backbone network 21 can better express the characteristics of the input image and input the segmentation Network 22 and Regression Network 23. In this case, both the segmentation network 22 and the regression network 23 perform processing based on the feature maps output by the backbone network 21 .

In some examples, segmentation network 22 may be trained using multi-instance learning. In some examples, the pixel points used for training the regression network 23 may be selected by using the expected cross-over-union ratio corresponding to the pixel points of the image to be trained (described later).

In some examples, as shown in FIG. 5 , the training method may include constructing a training sample (step S120), inputting the training sample into the network module 20 to obtain prediction data (step S140), and determining the network module 20 based on the training sample and the prediction data. and optimize the network module 20 based on the training loss (step S160). Thus, an optimized (also called trained) network module 20 can be obtained.

In some examples, in step S120, the training samples may include input image data and label data. In some examples, the input image data may include multiple images to be trained. For example, the image to be trained may be a fundus image to be trained.

In some examples, the plurality of images to be trained may include images containing objects. In some examples, the plurality of images to be trained may include images containing objects and images not containing objects. In some examples, a target may belong to at least one category. In some examples, the number of objects of each category in the image to be trained may be greater than or equal to one. For example, taking the fundus image as an example, if the optic cup and the optic disc are identified or measured, the target in the fundus image may be an optic disc and an optic cup. The examples of the present disclosure do not intend to limit the number of objects, the categories to which the objects belong, and the number of objects of each category.

In some examples, the label data may include a gold standard of the class to which the target belongs (the gold standard of the class may also sometimes be referred to as the true class) and a gold standard of the target's tight frame (the gold standard of the tight frame may also sometimes be called the true class). box mark). It should be noted that, unless otherwise specified, the tight frame of the target in the label data or the category of the target in the training method can be the gold standard by default.

In some examples, a labeling tool can be used to mark the tight frame (that is, the smallest bounding rectangle) of the target in the training image, and set the corresponding category for the tight frame to represent the true category of the target to obtain label data .

In some examples, in step S140, the prediction data corresponding to the training samples can be obtained through the network module 20 based on the input image data of the training samples. The predicted data may include predicted segmentation data output by the segmentation network 22 and predicted offsets output by the regression network 23.

In addition, the predicted split data may correspond to the first output, and the predicted offset may correspond to the second output (ie, may correspond to the target offset). That is, the predicted segmentation data may include the probability that each pixel in the image to be trained belongs to each category, and the predicted offset may include the offset between the position of each pixel in the image to be trained and the tight frame mark of each category. In some examples, corresponding to the object offset, the predicted offset may be an offset normalized based on the average size of objects of each category. As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved. Preferably, the sizes of multiple objects of the same category may differ from each other by less than 10 times. For example, the sizes of multiple objects of the same category may differ from each other by 1 time, 2 times, 3 times, 5 times, 7 times, 8 times or 9 times, etc. As a result, the accuracy of target recognition or measurement can be further improved.

In order to more clearly describe the offset between the position of the pixel point and the tight frame of the target, and the offset after normalization, the following formulas are used to describe it. It should be noted that the predicted offset, the target offset and the actual offset are one type of offset, and are also applicable to the following formula (1).

Specifically, the position of a pixel can be expressed as (x, y), the tight frame of a target corresponding to the pixel is expressed as b=(xl, yt, xr, yb), and the tight frame b of the target is relative to the The offset of the position of the pixel (that is, the offset between the position of the pixel and the tight frame of the target) is expressed as t=(tl, tt, tr, tb), then tl, tt, tr, tb can satisfy the formula (1 ):

tl=(x-xl)/S _c1 ,

tt=(y-yt)/S _c2 ,

tr=(xr-x)/S _c1 ,

tb=(yb-y)/S _c2 ,

Among them, xl, yt can represent the position of the upper left corner of the tight frame of the target, xr, yb can represent the position of the lower right corner of the tight frame of the target, c can represent the index of the category to which the target belongs, S _c1 can represent the cth category The average width of the target, S _c2 can represent the average height of the target of the c-th category. Thereby, a normalized offset can be obtained. In some examples, S _c1 and S _c2 may both be the average size of objects of the c-th category.

But the example of the present disclosure is not limited thereto. In other examples, the tight frame of the target can also be represented by the position of the lower left corner and the upper right corner, or the tight frame of the target can be represented by the position, length and width of any corner . In addition, in some other examples, normalization may also be performed in other ways, for example, the offset may be normalized by using the length and width of the tight frame of the target.

In addition, the pixel points in formula (1) may be the pixel points of the image to be trained or the input image. That is, formula (1) can be applied to the real offset corresponding to the image to be trained in the training phase and the target offset corresponding to the input image in the measurement phase.

Specifically, for the training phase, the pixels can be the pixels in the image to be trained, the tight frame b of the target can be the gold standard of the tight frame of the target in the image to be trained, and the offset t can be the real offset (also can be called the gold standard for offset). Thus, the regression loss of the regression network 23 can be subsequently obtained based on the predicted offset and the actual offset. In addition, if the pixel point is the pixel point in the image to be trained, and the offset t is the prediction offset, then the tight frame of the predicted target can be deduced according to the formula (1).

In addition, for the measurement stage, the pixel points can be the pixel points in the input image, and the offset t can be the target offset, then the tight frame of the target in the input image can be deduced according to the formula (1) and the target offset (also That is, the object offset and the position of the pixel point can be substituted into formula (1) to obtain the object's tight frame). Thereby, a tight frame of the object in the input image can be obtained.

In some examples, in step S160 , the training loss of the network module 20 can be determined based on the label data corresponding to the training samples, the predicted segmentation data and the predicted offset, and then the network module 20 is trained based on the training loss to optimize the network module 20 .

As mentioned above, the network module 20 may include a segmentation network 22 and a regression network 23 . In some examples, based on the segmentation loss and the regression loss, the training loss of the network module 20 can be obtained. Thus, the network module 20 can be optimized based on the training loss. In some examples, the training loss may be the sum of segmentation loss and regression loss. In some examples, the segmentation loss may indicate the extent to which the pixels in the image to be trained in the predicted segmentation data belong to each real category, and the regression loss may indicate the closeness of the predicted offset to the actual offset.

In some examples, the segmentation loss of the segmentation network 22 can be obtained based on the predicted segmentation data and label data corresponding to the training samples. In this way, the predicted segmented data of the segmented network 22 can be approximated to the label data by the segmented loss. In some examples, segmentation loss can be obtained using multi-instance learning. In multi-instance learning, multiple bags to be trained can be obtained by category based on the real tight frame of the target in each image to be trained (that is, each category can correspond to multiple bags to be trained). Segmentation loss can be obtained based on multiple bags to be trained for each category. In some examples, the plurality of packets to be trained may include a plurality of positive packets and a plurality of negative packets. Thus, segmentation loss can be obtained based on the positive and negative packets of multi-instance learning. It should be noted that unless otherwise specified, the following positive and negative packages are for each category.

In some examples, multiple positive packets may be obtained based on the area within the true tight bounding box of the object. As shown in FIG. 6 , the area A2 in the image P1 to be trained is the area within the real tight frame B21 of the target T1.

In some examples, all the pixel points on each of the multiple straight lines connecting two opposite sides of the real tight frame of the target can be divided into a positive bag (that is, a straight line can correspond to a positive bag) . Specifically, the two ends of each straight line may be at the upper end and the lower end, or the left end and the right end of the real tight frame. As an example, as shown in FIG. 6 , the pixel points on the straight line D1 , straight line D2 , straight line D3 , straight line D4 , straight line D5 , straight line D6 , straight line D7 and straight line D8 can be divided into a positive packet respectively. However, the examples of the present disclosure are not limited thereto, and in other examples, other ways may also be used to divide positive packets. For example, the pixels at a specific position of the real tight frame can be divided into a positive bag.

In some examples, the plurality of straight lines may include at least one set of first parallel lines that are parallel to each other. For example, the plurality of straight lines may include one set of first parallel lines, two sets of first parallel lines, three sets of first parallel lines, or four sets of first parallel lines, and the like. In some examples, the number of straight lines in the first parallel line may be greater than or equal to two.

In some examples, the plurality of straight lines may include at least one set of first parallel lines that are parallel to each other and second parallel lines that are respectively perpendicular to each set of first parallel lines. Specifically, if the multiple straight lines include a set of first parallel lines, then the multiple straight lines may also include a set of second parallel lines perpendicular to the set of first parallel lines; if the multiple straight lines include multiple sets of first parallel lines, Then the multiple straight lines may further include multiple sets of second parallel lines perpendicular to each set of first parallel lines. As shown in Figure 6, a group of first parallel lines may include parallel straight lines D1 and straight lines D2, and a group of second parallel lines corresponding to the group of first parallel lines may include parallel straight lines D3 and straight lines D4, wherein the straight line D1 Can be perpendicular to straight line D3; Another group of first parallel lines can include parallel straight line D5 and straight line D6, and a group of second parallel lines corresponding to this group of first parallel lines can include parallel straight line D7 and straight line D8, wherein, The straight line D5 may be perpendicular to the straight line D7. In some examples, the number of straight lines in the first parallel line and the second parallel line may be greater than or equal to two.

As noted above, in some examples, the plurality of straight lines may include multiple sets of first parallel lines (ie, the plurality of straight lines may include parallel lines at different angles). In this case, it is possible to optimize the segmentation network 22 by dividing positive packets from different angles. Thereby, the accuracy of the segmented data predicted by the segmenting network 22 can be improved.

In some examples, the angle of the first parallel line may be the angle between the extension line of the first parallel line and the extension line of any non-intersecting side of the real tight frame, and the angle of the first parallel line may be greater than -90 ° and less than 90°. For example, the included angle may be -89°, -75°, -50°, -25°, -20°, 0°, 10°, 20°, 25°, 50°, 75° or 89°. Specifically, if the extension line of the non-intersecting side rotates clockwise less than 90° to the angle formed by the extension line of the first parallel line, the angle can be greater than 0° and less than 90°, if the extension line of the non-intersecting side The angle formed by the counterclockwise rotation less than 90° (that is, clockwise rotation greater than 270°) to the extension line of the first parallel line can be greater than -90° and less than 0°, if the unintersected side is parallel to the first If the lines are parallel, the included angle can be 0°. As shown in Figure 6, the angles of the straight lines D1, D2, D3 and D4 can be 0°, and the angles of the straight lines D5, D6, D7 and D8 (that is, the angle C1) can be 25°. In some examples, the angle of the first parallel can be a hyperparameter that can be optimized during training.

In addition, the angle of the first parallel line can also be described in a manner of rotation of the image to be trained. The angle of the first parallel line may be the angle of rotation. Specifically, the angle of the first parallel line can be the rotation angle of rotating the image to be trained so that any side of the image to be trained that does not intersect the first parallel line is parallel to the first parallel line, wherein the first parallel line is parallel to The angle can be greater than -90° and less than 90°, the rotation angle for clockwise rotation can be positive degrees, and the rotation angle for counterclockwise rotation can be negative degrees. However, the examples of the present disclosure are not limited thereto. In some other examples, the angle of the first parallel line may also be in other ranges according to the way of describing the angle of the first parallel line. For example, if the description is based on the side of the real tight frame intersecting the first parallel line, the angle of the first parallel line may also be greater than 0° and less than 180°.

In some examples, multiple negative bags may be obtained based on regions outside the true tight bounding box of the object. As shown in FIG. 6 , the area A1 in the image P1 to be trained is an area outside the real tight frame B21 of the target T1 . In some examples, the negative packet may be a single pixel in an area outside the true tight frame of all objects of a category (that is, one pixel may correspond to one negative packet).

As mentioned above, in some examples, a segmentation loss may be obtained based on a number of bags to be trained for each category. In some examples, the segmentation loss may include a unary term (also known as a unary loss) and a pairwise term (also known as a pairwise loss). In some examples, unary terms may describe the degree to which each bag to be trained belongs to each true class. In this case, the tight box can be constrained by both the positive and negative packets through the unary loss. In some examples, the pair item may describe the degree to which the pixel of the image to be trained and the pixels adjacent to the pixel belong to the same category. In this case, the pairwise loss smoothes the predicted split results.

In some examples, the segmentation loss by category can be obtained, and the segmentation loss (ie, the total segmentation loss) can be obtained based on the segmentation loss of the category. In some examples, the total segmentation loss L _seg can satisfy the formula:

Among them, L _c can represent the segmentation loss of category c, and C can represent the number of categories. For example, if the optic cup and the optic disc in the fundus image are identified, C can be 2, and if only the optic cup or only the optic disc is identified, then C can be 1.

In some examples, the segmentation loss _Lc for class c can satisfy the formula:

Among them, φ _c can represent a unary term,

Can represent paired items, P can represent the degree (also can be called probability) that each pixel point of segmentation network 22 prediction belongs to each category,

can represent a collection of multiple positive bags,

Can represent a collection of multiple negative packets, and λ can represent a weight factor. The weight factor λ can be a hyperparameter, which can be optimized during the training process. In some examples, a weighting factor λ can be used to switch the two losses (ie unary and pairwise terms).

Generally speaking, in multi-instance learning, if each positive bag of a category includes at least one pixel belonging to the category, then the pixel with the highest probability of belonging to the category in each positive bag can be used as the positive sample of the category; If there is no pixel belonging to this category in each negative bag of a category, even the pixel with the highest probability in the negative bag is a negative sample of this category. Based on this situation, in some examples, the unary term φ _c corresponding to category c can satisfy the formula:

Among them, P _c (b) can represent the probability that a package to be trained belongs to category c (also can be called the degree of belonging to category c or the probability of package to be trained), b can represent a package to be trained,

can represent a collection of multiple positive bags,

can represent a collection of multiple negative bags,

max can represent the maximum value function,

It can represent the cardinality of a set of multiple positive packets (that is, the number of elements in the set), β can represent a weight factor, and γ can represent a focusing parameter. In some examples, the value of the unary term is minimum when P _c (b) corresponding to the positive packet is equal to 1 and P _c (b) corresponding to the negative packet is equal to 0. That is, the unary loss is the smallest. In some examples, weighting factor β may be between 0-1. In some examples, the focus parameter γ may be greater than or equal to zero.

In some examples, P _c (b) may be the maximum probability of belonging to category c among the pixels of a package to be trained. In some examples, P _c (b) can satisfy the formula: P _c (b)=max _k∈b (p _kc ), where p _kc can indicate that the pixel at the kth position of the package b to be trained belongs to category c The probability.

In some examples, the maximum probability of belonging to a class among the pixels of a packet to be trained may be obtained based on a smooth maximum approximation function (ie, obtain P _c (b)). Thus, a relatively stable maximum probability can be obtained.

In some examples, the maximum smoothing approximation function may be at least one of an α-softmax function and an α-quasimax function. In some examples, for the maximum value function f(x)=max _1≤i≤n x _i , max may represent the maximum value function, and n may represent the number of elements (may correspond to the number of pixels in the package to be trained) , _xi can represent the value of the element (can correspond to the probability that the pixel at the i-th position of the package to be trained belongs to a category. In this case, the α-softmax function can satisfy the formula:

Wherein, α can be a constant. In some examples, the larger α is, the closer to the maximum value of the maximum function. In addition, the α-quasimax function can satisfy the formula:

Wherein, α can be a constant. In some examples, the larger α is, the closer to the maximum value of the maximum function.

As mentioned above, in some examples, the pairwise term may describe the degree to which a pixel of the image to be trained and its neighbors belong to the same category. That is, the pairwise term can evaluate the closeness of the probability that adjacent pixels belong to the same class. In some examples, the pairwise term for class c

The formula can be satisfied:

Among them, ε can represent the set of all pairs of adjacent pixels, (k, k _' ) can represent a pair of adjacent pixels, k and k' can represent the positions of two pixels of adjacent pixel pairs, p _kc can represent the probability that the pixel at the kth position belongs to class c, and p _k'c can represent the probability that the pixel at the k'th position belongs to class c. In some examples, adjacent pixel points may be eight-neighborhood or four-neighborhood pixel points. In some examples, adjacent pixel points of each pixel point in the image to be trained may be acquired to obtain a set of adjacent pixel point pairs.

As mentioned above, training loss can include regression loss. In some examples, the regression loss of the regression network 23 can be obtained based on the predicted offset corresponding to the training samples and the actual offset corresponding to the label data. In this case, the predicted offset of the regression network 23 can be approximated to the true offset by the regression loss.

In some examples, the real offset may be the offset between the position of the pixel of the image to be trained and the real tight frame of the target in the label data. In some examples, corresponding to the predicted offset, the real offset may be an offset normalized based on the average size of objects of each category. For specific content, refer to the relevant description about the offset in the above formula (1).

In some examples, corresponding pixel points in the image to be trained may be selected as positive samples to train the regression network 23 . That is, the regression network 23 can be optimized by using positive samples. Specifically, the regression loss can be obtained based on the positive samples, and then the regression network 23 can be optimized using the regression loss.

In some examples, the regression loss can satisfy the formula:

Among them, C can represent the number of categories, M _c can represent the number of positive samples of the c-th category, t _ic can represent the true offset corresponding to the i-th positive sample of the c-th category, and v _ic can represent the c-th positive sample The prediction offset corresponding to the i-th positive sample of the category, s(x) can represent the sum of smooth L1 losses of all elements in x. In some examples, for x is t _ic -v _ic , s(t _ic -v _ic ) can represent the prediction offset corresponding to the i-th positive sample of the c-th category using smooth L1 loss and the i-th positive sample The degree to which the corresponding true offsets agree. Here, the positive samples may be pixels in the image to be trained that are selected for training the regression network 23 (that is, for calculating the regression loss). Thereby, the regression loss can be obtained.

In some examples, the true offset corresponding to the positive sample may be the offset corresponding to the true tight frame. In some examples, the true offset corresponding to the positive sample may be the offset corresponding to the matching tight box. Therefore, it can be applied to the situation where positive samples fall into multiple real tight frames.

In some examples, the smooth L1 loss function can satisfy the formula:

Among them, σ can represent a hyperparameter, which is used to switch between the smooth L1 loss function and the smooth L2 loss function, and x can represent the variable of the smooth L1 loss function.

As mentioned above, in some examples, corresponding pixel points in the image to be trained may be selected as positive samples to train the regression network 23 .

In some examples, the positive samples can be the pixels in the image to be trained that fall into at least one real tight frame of the target (that is, the pixels that fall into the real tight frame of at least one target can be selected from the image to be trained pixel as a positive sample). In this case, optimizing the regression network 23 based on the pixels falling within the true tight frame of at least one object can improve the efficiency of the regression network 23 optimization. In some examples, pixels falling within at least one real tight bounding box of an object may be selected from the image to be trained by category as positive samples of each category. In some examples, the regression loss of each category can be obtained based on the positive samples of each category.

As mentioned above, pixels that fall within at least one real tight frame of an object can be selected from the images to be trained by category as positive samples for each category. In some examples, the aforementioned positive samples of each category may be screened, and the regression network 23 may be optimized based on the screened positive samples. That is, the positive samples used to calculate the regression loss can be filtered positive samples.

In some examples, after obtaining the positive samples of each category (that is, after selecting the pixels that fall into the real tight frame of at least one target as the positive samples from the image to be trained), the corresponding positive samples can be obtained. Match the tight frame, and then filter the positive samples of each category based on the matched tight frame. In this way, the regression network 23 can be optimized by using the positive samples of each category screened based on matching tight frames.

In some examples, the real tight frame that a pixel (for example, a positive sample) falls into can be filtered to obtain a matching tight frame for the pixel. In some examples, the matching tight frame may be the real tight frame in which the pixel of the image to be trained falls into and which has the smallest real offset relative to the position of the pixel. For the positive sample, the matching tight frame may be the real tight frame in which the positive sample falls into the true tight frame with the smallest real offset relative to the position of the positive sample.

Specifically, in a category, if a pixel point (such as a positive sample) only falls within the real tight frame of an object to be measured, then the real tight frame is used as the matching tight frame (that is, the matching tight frame can be the real tight frame that the pixel falls into), if the pixel falls into the real tight frame of multiple objects to be measured, then the real tight frame of multiple objects to be measured can be compared to the pixel The true tight frame with the smallest true offset of the position is taken as the matching tight frame. In this way, the matching tight frame corresponding to the pixel can be obtained.

In some examples, the smallest real offset (that is, the real tight frame with the smallest real offset) can be obtained by comparing L1 normal forms of the real offsets. In this case, the smallest real offset can be obtained based on the L1 normal form, and then the matching tight frame can be obtained. Specifically, the absolute value of the elements of each real offset in the multiple real offsets can be calculated and then summed to obtain multiple offset values, and the real offset with the smallest offset value can be obtained by comparing multiple offset values as the smallest true offset.

In some examples, the positive samples of each category may be screened by using the expected intersection and union comparison corresponding to the pixel points (for example, positive samples). In this case, pixels far away from the center of the true tight frame or the matching tight frame can be filtered out. In this way, it is possible to reduce the adverse effect of pixels away from the center on the optimization of the regression network 23 and to improve the efficiency of the optimization of the regression network 23 .

In some examples, the expected intersection ratio corresponding to the positive sample can be obtained based on the matching tight frame, and the positive samples of each category can be screened based on the expected intersection ratio. Specifically, after obtaining the positive samples of each category, the matching tight frame corresponding to the positive sample can be obtained, and then based on the matching tight frame, the expected intersection ratio corresponding to the positive sample can be obtained and the positive intersection ratio of each category can be compared based on the expected intersection. The samples are screened, and finally the regression network 23 can be optimized by using the screened positive samples of each category. But the examples of the present disclosure are not limited thereto. In some examples, the pixel points of the images to be trained can be screened by categories and using the expected intersection and comparison of the pixels of the images to be trained (that is, the pixels of the images to be trained can not be selected first. In the case of selecting at least the pixels that fall into the real tight frame of one target as positive samples, use the expected intersection and comparison to filter the pixels of the image to be trained). In addition, the pixels that do not fall into any real tight frame (that is, there is no pixel matching the tight frame) can be identified. In this way, subsequent screening of the pixel can be facilitated. For example, the expected intersection ratio of a pixel point may be set to 0 to identify the pixel point. Specifically, the pixels of the image to be trained can be screened by category and based on the expected intersection and comparison corresponding to the pixels of the image to be trained, and the regression network 23 can be optimized based on the screened pixels.

In some examples, the regression network 23 may be optimized by selecting pixels whose expected intersection ratio is greater than a preset expected intersection ratio from the pixels of the image to be trained. In some examples, the regression network 23 may be optimized by selecting positive samples whose expected intersection ratio is greater than a preset expected intersection ratio from the positive samples of each category. In this way, pixels (for example, positive samples) meeting the preset expected intersection ratio can be obtained. In some examples, the preset expected intersection ratio may be greater than 0 and less than or equal to 1. For example, the preset expected cross-merging ratio may be 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1. In some examples, the preset desired intersection ratio may be a hyperparameter. The preset expected intersection and union ratio can be adjusted during the training process of the regression network 23 .

In some examples, the expected intersection ratio corresponding to the pixel point can be obtained based on the matching tight frame of the pixel point (for example, the positive sample). In some examples, if the pixel does not correspond to a matching tight frame, the pixel may be ignored or the expected intersection ratio corresponding to the pixel may be set to 0. In this case, it is possible to make pixels that do not have a matching tight frame not used for the training of the regression network 23 or reduce the contribution to the regression loss. It should be noted that, unless otherwise specified, the following description of the expected intersection ratio corresponding to a pixel point is also applicable to the expected intersection ratio corresponding to a positive sample.

In some examples, the expected intersection-over-union ratio may be a maximum value among intersection-over-union ratios (Intersection-over-union, Iou) between the matching tight frame of the pixel point and multiple borders constructed around the pixel point as the center. Thus, a desired cross-merge ratio can be obtained. However, the examples of the present disclosure are not limited thereto. In other examples, the expected intersection ratio may be the maximum value of the intersection ratios between the real tight frame of the pixel and multiple borders constructed around the pixel. In some examples, a plurality of frames may be constructed with a pixel point of the image to be trained as the center point, and the maximum value among the intersection ratios of the plurality of frames and the matching tight frame labels of the pixel point may be obtained as the expected intersection ratio. In some examples, the multiple borders may be of different sizes. Specifically, each frame in the plurality of frames may have a different width or height from other frames.

FIG. 7 is a schematic diagram showing a frame constructed centering on a pixel point involved in an example of the present disclosure. In order to describe the desired cross-join ratio more clearly, it will be described below in conjunction with FIG. 7 . As shown in FIG. 7 , the pixel M1 has a tight matching frame B31 , and the frame B32 is an exemplary frame constructed centering on the pixel M1 .

In some examples, let W be the width of the matching tight frame, H be the height of the matching tight frame, (r ₁ W, r ₂ H) represent the position of the pixel, r ₁ , r ₂ are the pixels in the matching tight frame The relative position of the target, and satisfy the conditions: 0<r1, r2<1. Multiple borders can be constructed based on pixels. As an example, as shown in FIG. 7 , the position of the pixel M1 can be expressed as (r ₁ W, r ₂ H), and the width and height of the matching tight frame B31 can be W and H respectively.

In some examples, the tight matching frame may be divided into four regions by the two centerlines of the tight matching frame. The four areas may be an upper left area, an upper right area, a lower left area, and a lower right area. For example, as shown in FIG. 7 , the center line D9 and center line D10 of the matching tight frame B31 can divide the matching tight frame B31 into an upper left area A3 , an upper right area A4 , a lower left area A5 and a lower right area A6 .

The following describes the desired cross-over-union ratio by taking the pixels in the upper left region (that is, r ₁ , r ₂ satisfying the condition: 0<r ₁ , r ₂ ≤0.5) as an example. For example, as shown in FIG. 7 , the pixel point M1 may be a point in the upper left area A3.

First, construct multiple borders centered on pixels. Specifically, for r ₁ , r ₂ satisfy the condition: 0<r ₁ , r ₂ ≤0.5, the four boundary conditions corresponding to pixel M1 can be respectively:

w ₁ =2r ₁ W, h ₁ =2r ₂ H;

w ₂ =2r ₁ W, h ₂ =2(1-r ₂ )H;

w ₃ =2(1-r ₁ )W, h ₃ =2r ₂ H;

w ₄ =2(1-r ₁ )W, h ₄ =2(1-r ₂ )H;

Among them, w ₁ and h ₁ can represent the width and height of the first boundary condition, w ₂ and h ₂ can represent the width and height of the second boundary condition, w ₃ and h ₃ can represent the width of the third boundary condition and height, _w4 and _h4 can represent the width and height of the fourth boundary condition.

Second, calculate the intersection-union ratio between the frame and the matching tight frame under each boundary condition. Specifically, the intersection and union ratios corresponding to the above four boundary conditions can satisfy formula (2):

IoU ₁ (r ₁ ,r ₂ )=4r ₁ r ₂ ,

IoU ₂ (r ₁ ,r ₂ )=2r ₁ /(2r ₁ (1-2r ₂ )+1),

IoU ₃ (r ₁ ,r ₂ )=2r ₂ /(2r ₂ (1-2r ₁ )+1),

IoU ₄ (r ₁ ,r ₂ )=1/(4(1-r ₁ )(1-r ₂ )),

Among them, IoU ₁ (r ₁ ,r ₂ ) can represent the IoU ratio corresponding to the first boundary condition, IoU ₂ (r ₁ ,r ₂ ) can represent the IoU ratio corresponding to the second boundary condition, and IoU ₃ (r ₁ , r ₂ ) can represent the intersection ratio corresponding to the third boundary condition, and IoU ₄ (r ₁ , r ₂ ) can represent the intersection ratio corresponding to the fourth boundary condition. In this case, the intersection and union ratio corresponding to each boundary condition can be obtained.

Finally, the largest intersection and union ratio among multiple boundary conditions is the expected intersection and union ratio. In some examples, for r ₁ , r ₂ satisfy the condition: 0<r ₁ , r ₂ ≤0.5, and the expected cross-over-union ratio can satisfy formula (3):

In addition, the expected intersection-over-union ratios for pixels located in other regions (ie, the upper-right region, the lower-left region, and the lower-right region) can be obtained based on a similar method for the upper-left region. In some examples, for r ₁ satisfying the condition: 0.5≤r ₁ <1, r ₁ in formula (3) can be replaced by 1-r ₁ , and for r ₂ satisfying the condition: 0.5≤r ₂ <1, the formula The r ₂ in (3) is replaced by 1-r ₂ . In this way, the expected intersection ratio of pixels located in other regions can be obtained. That is, the pixels located in other regions can be mapped to the upper left region through coordinate conversion, and then the expected intersection ratio can be obtained based on the consistent manner of the upper left region. Therefore, for r ₁ , r ₂ satisfies the conditions: 0<r ₁ , r ₂ <1, and the expected intersection and union ratio can satisfy the formula (4):

Among them, IoU ₁ (r ₁ , r ₂ ), IoU ₂ (r ₁ , r ₂ ), IoU ₂ (r ₁ , r ₂ ) and IoU ₂ (r ₁ , r ₂ ) can be obtained by formula (2). Thus, a desired cross-merge ratio can be obtained.

As mentioned above, in some examples, the expected intersection ratio corresponding to the pixel point can be obtained based on the matching tight frame of the pixel point (eg positive sample). However, the examples of the present disclosure are not limited thereto. In some other examples, during the process of screening the positive samples of each category or the pixel points of the image to be trained, no matching tight frame may be obtained. Specifically, the expected intersection ratio corresponding to the pixel point can be obtained based on the real tight frame corresponding to the pixel point (such as a positive sample), and the pixel points of each category can be screened based on the expected intersection ratio. In this case, the expected intersection and union ratio may be the maximum value among the expected intersection and union ratios corresponding to each real tight frame. In obtaining the expected intersection ratio corresponding to the pixel based on the real tight frame, you can refer to the relevant description of obtaining the expected intersection ratio corresponding to the pixel based on the matching tight frame of the pixel.

Hereinafter, the measurement method involved in the present disclosure will be described in detail with reference to the accompanying drawings. The network module 20 involved in the measurement method can be trained by the above-mentioned training method. FIG. 8 is a flow chart showing a measurement method of tight-frame-based deep learning related to an example of the present disclosure.

In some examples, as shown in FIG. 8 , the measurement method may include obtaining an input image (step S220), inputting the input image into the network module 20 to obtain a first output and a second output (step S240), and based on the first output and The second output identifies the objects to obtain the tight frames of the objects of each category (step S260).

In some examples, in step S220, the input image may include at least one object. In some examples, at least one object may belong to at least one category of interest (category of interest may be simply referred to as category). Specifically, if the input image includes one object, the object may belong to one category of interest, and if the input image includes multiple objects, the multiple objects may belong to at least one category of interest. In some examples, the input image may also not include objects. In this case, it is possible to judge an input image in which no object exists.

In some examples, in step S240, the first output may include the probability that each pixel in the input image belongs to each category. In some examples, the second output may include an offset between the position of each pixel point in the input image and the tight bounding box of each category of objects. In some examples, the offset in the second output may be used as the target offset. In some examples, the network module 20 may include a backbone network 21 , a segmentation network 22 and a regression network 23 . In some examples, segmentation network 22 may be image segmentation based on weakly supervised learning. In some examples, regression network 23 may be based on bounding box regression. In some examples, backbone network 21 may be used to extract feature maps of input images. In some examples, segmentation network 22 may take the feature map as input to obtain a first output, and regression network 23 may take the feature map as input to obtain a second output. In some examples, the resolution of the feature map may be consistent with the input image. For details, refer to the relevant description of the network module 20 .

As mentioned above, the first output may include the probability that each pixel in the input image belongs to each category, and the second output may include the offset between the position of each pixel in the input image and the tight frame of each category of objects. In some examples, in step S260 , based on the first output, an object offset of a category corresponding to a pixel at a corresponding position may be selected from the second output, and a tight frame of each category of objects may be obtained based on the object offset. Therefore, the target can be accurately measured based on the tight frame of the target.

In some examples, the position of the pixel with the highest local probability belonging to each category can be obtained from the first output as the first position, and each pixel can be obtained based on the position corresponding to the first position in the second output and the target offset of the corresponding category. A tight box for the category's target. In this case, an object or objects of each category can be identified. In some examples, a non-maximum suppression method (Non-Maximum Suppression, NMS) may be used to obtain the first position. In some examples, the number of first positions corresponding to each category may be greater than or equal to one. But the example of the present disclosure is not limited thereto. For an input image with only one object in each category, in some examples, the position of the pixel with the highest probability belonging to each category can be obtained from the first output as the first position, based on the second In the output, the position corresponding to the first position and the target offset of the corresponding class obtain the tight frame of the target of each class. That is, the first position can be obtained by using the maximum value method. In some examples, the first position may also be obtained by using a smooth maximum suppression method.

In some examples, tight boxes for objects of various categories may be obtained based on the first position and the object offset. In some examples, the first position and the target offset can be substituted into equation (1) to infer the tight frame of the target. Specifically, the first position can be used as the position (x, y) of the pixel point in the formula (1) and the target offset can be used as the offset t to obtain the tight frame b of the target.

In some examples, the measuring method may further include measuring the size of each target based on the tight frame of the target (not shown). Thereby, the target can be accurately measured based on the tight frame of the target. In some examples, the dimensions of the object may be the width and height of the tight box of the object.

Hereinafter, the measurement device 100 based on deep learning of tight frames involved in the present disclosure will be described in detail with reference to the accompanying drawings. The measurement device 100 may also be called an identification device or an auxiliary measurement device. The measuring device 100 according to the present disclosure is used to implement the above-mentioned measuring method. FIG. 9 is a block diagram illustrating a measurement device 100 based on tight-framework deep learning according to an example of the present disclosure.

As shown in FIG. 9 , in some examples, the measurement device 100 may include an acquisition module 10 , a network module 20 and an identification module 30 .

In some examples, acquisition module 10 may be configured to acquire an input image. For details, refer to the relevant description in step S220. In some examples, network module 20 may be configured to receive an input image and obtain a first output and a second output based on the input image. For details, refer to the relevant description of the network module 20 . In some examples, the identification module 30 may be configured to identify objects based on the first output and the second output to obtain tight frames of objects of each category. For details, refer to the relevant description in step S260. In some examples, the measurement device 100 may further include a measurement module (not shown). The measurement module can be configured to measure the size of each target based on the tight frame of the target. Thereby, the target can be accurately measured based on the tight frame of the target. In some examples, the dimensions of the object may be the width and height of the tight box of the object.

The measurement method and measurement device 100 involved in this disclosure construct a network module 20 including a backbone network 21, a segmentation network 22 for image segmentation based on weakly supervised learning, and a regression network 23 based on frame regression. The network module 20 is a tight frame based on the target. The backbone network 21 receives an input image (such as a fundus image) and extracts a feature map consistent with the resolution of the input image, and inputs the feature map into the segmentation network 22 and the regression network 23 respectively to obtain the first output and the second output, Then, based on the first output and the second output, a tight frame of the target in the input image is obtained to realize the measurement. In this case, the network module 20 based on the training of the target's tight frame can accurately predict the target's tight frame in the input image, and then can accurately measure based on the target's tight frame. In addition, predicting the normalized offset through the regression network 23 can improve the accuracy of identifying or measuring objects with small size changes. In addition, by using the expected cross-over-union ratio to screen the pixels for optimizing the regression network 23 , it is possible to reduce the negative impact of pixels far away from the center on the optimization of the regression network 23 and to improve the efficiency of the regression network 23 optimization. In addition, the regression network 23 predicts the offset of a definite category, which can further improve the accuracy of target recognition or measurement.

Hereinafter, the measurement method involved in the present disclosure will be further described in detail by taking the input image as an example of a fundus image. The measurement method for fundus images can also be referred to as the measurement method for fundus images based on deep learning of tight frames. In addition, the fundus images described in the examples of the present disclosure are used to illustrate the technical solutions of the present disclosure more clearly, and do not constitute limitations on the technical solutions provided in the present disclosure. Unless otherwise specified, the measurement method for the input image, the measurement device 100, and the corresponding training method are all applicable to the fundus image. As an example of a fundus image, for example, FIG. 2( a ) shows a fundus image captured by a fundus camera.

The measurement method for the fundus image involved in this embodiment can use the network module 20 trained based on the tight frame of the target to identify at least one target in the fundus image so as to realize the measurement. The fundus image may include at least one object, which may be the optic cup and/or optic disc. That is, the network module 20 that can be trained based on the tight frame of the target can identify the optic cup and/or optic disc in the fundus image so as to realize the measurement of the optic cup and/or optic disc. Thereby, the optic cup and/or optic disc in the fundus image can be measured based on the tight frame. In some other examples, it is also possible to identify microvascular tumors in the fundus image so as to realize the measurement of microvascular tumors.

In some examples, as shown in FIG. 10 , the measurement method for the fundus image may include acquiring the fundus image (step S420), inputting the fundus image into the network module 20 to obtain the first output and the second output (step S440), and based on The first output and the second output identify the target to obtain the tight frame of the optic cup and/or optic disc in the fundus image to achieve measurement (step S460).

In some examples, in step S420, a fundus image may be acquired. In some examples, a fundus image may include at least one object. In some examples, at least one object may be identified to identify the object and the category to which the object belongs (ie, the category of interest). For fundus images, the category of interest (also referred to as category for short) may be the optic cup and/or the optic disc. The target for each category can be the optic cup or optic disc. Specifically, if the optic disc or optic cup in the fundus image is identified, the category of interest can be the optic cup or optic disc, and if the optic disc and optic cup in the fundus image are identified, the category of interest can be the optic cup and video discs. In some examples, the fundus image may also not include the optic disc or cup. In this case, it is possible to judge a fundus image in which no optic disc or optic cup exists.

In some examples, in step S440, the fundus image may be input into the network module 20 to obtain the first output and the second output. The first output can include the probability that each pixel in the fundus image belongs to each category (that is, the optic cup and/or optic disc), and the second output can include the position of each pixel in the fundus image and the tight frame of each category of targets. target offset. In some examples, the offset in the second output may be used as the target offset. For the fundus image, the backbone network 21 in the network module 20 can be used to extract the feature map of the fundus image. In some examples, the feature map may be consistent with the resolution of the fundus image. The decoding module in the network module 20 is configured to map the image features extracted at different scales back to the resolution of the fundus image to output a feature map. For details, refer to the relevant description of the network module 20 .

In addition, for fundus images, the training samples of the network module 20 may include fundus image data (that is, multiple fundus images to be trained) and label data corresponding to the fundus image data. The label data may include a gold standard for the class to which the optic cup and/or optic disc belongs, and a gold standard for the tight frame of the optic cup and/or optic disc.

In some examples, in step S460, the target may be identified based on the first output and the second output to obtain a tight frame of the optic cup and/or optic disc in the fundus image to achieve measurement. Thereby, the optic cup and/or optic disc can be accurately measured subsequently based on the tight frame. In some examples, based on the first output, the target offset of the category corresponding to the pixel point at the corresponding position (that is, the optic cup and/or optic disc) can be selected from the second output, and the optic cup and/or optic disc can be obtained based on the target offset. or the tight frame of the optic disc. For the fundus image, preferably, the position of the pixel point with the highest probability belonging to each category can be obtained from the first output as the first position, based on the position corresponding to the first position in the second output and the target offset of the corresponding category to obtain Tight frame for optic cup and/or optic disc. In some examples, the first position may be obtained using a maximum value method. For details, refer to the related description of step S260.

In some examples, a tight frame for the optic cup and/or optic disc may be obtained based on the first position and the target offset. For details, refer to the related description of step S260.

In some examples, the measurement method for the fundus image may further include obtaining a ratio of the optic cup to the optic disc based on the tight frames of the optic cup and the optic disc in the fundus image (not shown). Thus, the ratio of the optic cup to the optic disc can be accurately measured based on the tight framing of the optic cup and optic disc.

In some examples, after obtaining the tight frame of the optic cup and/or optic disc in step S460, the optic cup and/or optic disc can be measured based on the tight frame of the optic cup and/or the tight frame of the optic disc in the fundus image to obtain the size of the optic cup and/or optic disc (the size may be, for example, the vertical diameter and the horizontal diameter). Thus, the size of the optic cup and/or optic disc can be accurately measured. In some examples, the cup and/or disc size can be obtained by taking the height of the tight frame as the vertical diameter of the optic cup and/or optic disc and the width of the tight frame as the horizontal diameter of the optic cup and/or disc.

In some examples, after obtaining the sizes of the optic cup and optic disc based on the tight frame, the ratio of the optic cup to the optic disc (also referred to as the cup-to-disc ratio) can be obtained. In this case, the cup-to-disc ratio can be obtained based on the tight frame, so that the cup-to-disk ratio can be accurately measured.

In some examples, the cup-to-disk ratio may include a vertical cup-to-disk ratio and a horizontal cup-to-disk ratio. The vertical cup-to-disk ratio may be the ratio of the vertical diameters of the optic cup and optic disc. The horizontal cup-to-disk ratio may be the ratio of the horizontal diameters of the optic cup and optic disc. In some examples, let b _oc =(xl _oc ,yt _oc ,xr _oc ,yb _oc ) for the tight frame of the optic cup in the fundus image and b _od =(xl _od ,yt _od , xr _od , yb _od ), where the first two values of b _oc and b _od can represent the position of the upper left corner of the tight frame, and the latter two values can represent the position of the lower right corner of the tight frame, then

The vertical cup-to-disk ratio can satisfy the formula: VCDR=(yb _oc -yt _oc )/(yb _od -yt _od ),

The horizontal cup-to-disk ratio may satisfy the formula: HCDR=(xr _oc −xl _oc )/(xr _od −xl _od ).

Hereinafter, the measurement device 200 for fundus images involved in the present disclosure will be described in detail with reference to the accompanying drawings. The measurement device 200 for fundus images can also be referred to as the measurement device 200 for fundus images based on deep learning of tight frames. The measuring device 200 for fundus images involved in the present disclosure is used to implement the above-mentioned measuring method for fundus images. FIG. 11 is a block diagram showing a measurement device 200 for a fundus image according to an example of the present disclosure.

As shown in FIG. 11 , in some examples, the measuring device 200 for fundus images may include an acquisition module 50 , a network module 20 and an identification module 60 . In some examples, acquisition module 50 may be configured to acquire fundus images. For details, refer to the relevant description in step S420. In some examples, network module 20 may be configured to receive a fundus image and obtain a first output and a second output based on the fundus image. For details, refer to the relevant description of the network module 20 and step S440. In some examples, the identification module 60 may be configured to identify the target based on the first output and the second output to obtain a tight frame of the optic cup and/or optic disc in the fundus image for measurement. For details, refer to the relevant description in step S460. In some examples, the measuring device 200 may further include a cup-to-disk ratio module (not shown). The cup-to-disk ratio module may be configured to obtain a cup-to-disc ratio based on the tight framing of the cup and disc in the fundus image. For details, refer to the related description of obtaining the ratio of the optic cup and optic disc based on the tight frame of the optic cup and optic disc in the fundus image.

Although the present disclosure has been described in detail with reference to the drawings and examples, it should be understood that the above description does not limit the present disclosure in any form. Those skilled in the art can make modifications and changes to the present disclosure as needed without departing from the true spirit and scope of the present disclosure, and these modifications and changes all fall within the scope of the present disclosure.

Claims

A measurement method based on deep learning of a tight frame, characterized in that it is a measurement method that uses a network module trained based on a target-based tight frame to identify the target so as to achieve measurement, and the tight frame is the A minimum bounding rectangle of a target, the measuring method comprising: acquiring an input image comprising at least one target, the at least one target belonging to at least one category of interest; inputting the input image into the network module to obtain a first output and The second output, the first output includes the probability that each pixel in the input image belongs to each category, and the second output includes the position of each pixel in the input image and the tight frame of the target of each category Target offset, using the offset in the second output as the target offset, wherein the network module includes a backbone network, a segmentation network based on image segmentation based on weakly supervised learning, and a regression network based on frame regression, the The backbone network is used to extract the feature map of the input image, the segmentation network takes the feature map as input to obtain the first output, and the regression network takes the feature map as input to obtain the second output , wherein the feature map is consistent with the resolution of the input image; the target is identified based on the first output and the second output to obtain a tight frame of each category of target.
The measuring method according to claim 1, characterized in that:

The network module is trained by the following method:

Constructing a training sample, the input image data of the training sample includes a plurality of images to be trained, the plurality of images to be trained include images containing objects belonging to at least one category, and the label data of the training samples includes the objects to which the objects belong The gold standard of the category and the gold standard of the tight frame of the target; through the network module based on the input image data of the training sample, the prediction segmentation data output by the segmentation network corresponding to the training sample and the predicted segmentation data output by the segmentation network are obtained The prediction offset of the regression network output; determine the training loss of the network module based on the label data corresponding to the training sample, the prediction segmentation data and the prediction offset; and based on the training loss to the network Modules are trained to optimize the network modules.
The measuring method according to claim 2, characterized in that:

The determining the training loss of the network module based on the label data corresponding to the training sample, the predicted segmentation data, and the predicted offset includes: obtaining the predicted segmentation data and label data corresponding to the training sample. The segmentation loss of the segmentation network; based on the prediction offset corresponding to the training sample and the actual offset based on the label data, the regression loss of the regression network is obtained, wherein the actual offset is the image to be trained The position of the pixel point is offset from the gold standard of the tight frame of the target in the label data; and based on the segmentation loss and the regression loss, the training loss of the network module is obtained.
The measuring method according to any one of claims 1 to 3, characterized in that:

The object offset is an offset normalized based on an average width and an average height of objects of each category, or the object offset is an offset normalized based on an average size of objects of each category.
The measuring method according to claim 3, characterized in that:

Using multi-instance learning, a plurality of bags to be trained are obtained according to categories based on the gold standard of tight frames of objects in each image to be trained, and the segmentation loss is obtained based on a plurality of bags to be trained in each category, wherein the plurality of bags to be trained The training bag includes a plurality of positive bags and a plurality of negative bags, and all pixels on each of the straight lines in the multiple straight lines connecting the two opposite sides of the tight frame mark of the target are divided into a positive bag, so The multiple straight lines include at least one group of first parallel lines parallel to each other and second parallel lines perpendicular to each group of first parallel lines respectively, and the negative package is the gold standard of a tight frame for all targets of a category A single pixel point in the area outside the region, the segmentation loss includes a unary item and a paired item, the unary item describes the degree to which each bag to be trained belongs to the gold standard of each category, and the paired item describes the to-be-trained The degree to which a pixel of an image belongs to the same category as its adjacent pixels.
The measuring method according to claim 5, characterized in that:

The angle of the first parallel line is the angle between the extension line of the first parallel line and the extension line of any non-intersecting side of the gold standard of the tight frame of the target, and the first parallel line The angle of is greater than -90° and less than 90°.
The measuring method according to claim 2, characterized in that:

Select by category from the image to be trained the pixels that fall into the gold standard of the tight frame of at least one target as positive samples of each category and obtain the matching tight frame corresponding to the positive sample to be based on the matching tight frame Screen the positive samples of each category, and then use the screened positive samples of each category to optimize the regression network, wherein the matching tight frame is relative to the gold standard of the tight frame that the positive sample falls into. The true offset of the position of the positive samples is the gold standard for the smallest tight frame.
The measuring method according to claim 1, 3 or 7, characterized in that:

Let the position of the pixel point be expressed as (x, y), the tight frame mark of a target corresponding to the pixel point is represented as b=(xl, yt, xr, yb), and the tight frame mark b of the target is relative to the pixel point The offset of the position of is expressed as t=(tl, tt, tr, tb), then tl, tt, tr, tb satisfy the formula:

tl=(x-xl)/S c1 ,

tt=(y-yt)/S c2 ,

tr=(xr-x)/S c1 ,

tb=(yb-y)/S c2 ,

Among them, xl, yt represent the position of the upper left corner of the tight frame of the target, xr, yb represent the position of the lower right corner of the tight frame of the target, S c1 represents the average width of the target of the c-th category, and S c2 represents the c-th category The average height of the target.
The measuring method according to claim 2, characterized in that:

According to the category and using the expected intersection and union ratio corresponding to the pixel of the image to be trained to filter out the pixels with the expected intersection and union ratio greater than the preset expected intersection and union ratio from the pixels of the image to be trained, the regression network is selected. Optimizing, wherein, a plurality of borders of different sizes constructed with the pixel point of the image to be trained as the center point, obtaining the maximum value in the intersection ratio of the plurality of borders and the matching tight frame mark of the pixel point respectively, and As the expected intersection-over-union ratio, the matching tight frame is the gold standard of the tight frame in which the pixel of the image to be trained falls into and has the smallest deviation relative to the true position of the pixel.
The measuring method according to claim 9, characterized in that:

The expected intersection and union ratio satisfies the formula:

Among them, r 1 , r 2 are the relative positions of the pixels of the image to be trained in the matching tight frame, 0<r 1 , r 2 <1, IoU 1 (r 1 , r 2 )=4r 1 r 2 , IoU 2 (r 1 ,r 2 )=2r 1 /(2r 1 (1-2r 2 )+1), IoU 3 (r 1 ,r 2 )=2r 2 /(2r 2 (1-2r 1 )+ 1), IoU 4 (r 1 , r 2 )=1/(4(1-r 1 )(1-r 2 )).
The measuring method according to claim 1, characterized in that:

The tight box for identifying the target based on the first output and the second output to obtain targets of each category is:

Obtain the position of the pixel point with the highest local probability belonging to each category from the first output as the first position, and obtain each pixel based on the position corresponding to the first position in the second output and the target offset of the corresponding category. A tight box for the category's target.
The measuring method according to claim 1, characterized in that:

The sizes of multiple objects of the same class differ from each other by a factor of less than 10.
The measuring method according to claim 1, characterized in that:

The backbone network includes an encoding module configured to extract image features at different scales and a decoding module configured to map image features extracted at different scales back to the resolution of the input image to output the feature map.
The measuring method according to claim 1, characterized in that:

The input image is a fundus image, and the target is an optic cup and/or an optic disc.
The measurement method according to claim 14, characterized in that:

The tight box for identifying the target based on the first output and the second output to obtain targets of each category is:

Obtain the position of the pixel point with the highest probability belonging to each category from the first output as the first position, and acquire each category based on the position corresponding to the first position in the second output and the target offset of the corresponding category A tight frame for the target of .
The measurement method according to claim 14, characterized in that:

Based on the tight frame of the optic cup in the fundus image and/or the tight frame of the optic disc in the fundus image, the optic cup and/or optic disc are measured to obtain the size of the optic cup and/or optic disc, based on the Cup and Disc Sizes in Fundus Images Obtain the cup and disc ratio.
A measurement device based on deep learning of a tight frame, characterized in that it is a measurement device that uses a network module trained based on a target-based tight frame to identify the target so as to achieve measurement, and the tight frame is the The minimum circumscribed rectangle of the target, the measurement device includes an acquisition module, a network module and a recognition module; the acquisition module is configured to acquire an input image comprising at least one target, and the at least one target belongs to at least one category of interest; the The network module is configured to receive the input image and obtain a first output and a second output based on the input image, the first output includes the probability that each pixel in the input image belongs to each category, and the second output Including the offset of the position of each pixel in the input image and the tight frame of each category of targets, using the offset in the second output as the target offset, wherein the network module includes a backbone network, based on A segmentation network for image segmentation of weakly supervised learning, and a regression network based on border regression, the backbone network is used to extract the feature map of the input image, and the segmentation network uses the feature map as input to obtain the first output, the regression network takes as input the feature map to obtain the second output, wherein the feature map is consistent with the resolution of the input image; and the identification module is configured to be based on the first output and the second output to identify the objects to obtain a tight box for each category of objects.
The measuring device according to claim 17, characterized in that:

The input image is a fundus image, and the target is an optic cup and/or an optic disc.
The measuring device according to claim 18, characterized in that:

The tight box for identifying the target based on the first output and the second output to obtain targets of each category is:

Obtain the position of the pixel point with the highest probability belonging to each category from the first output as the first position, and acquire each category based on the position corresponding to the first position in the second output and the target offset of the corresponding category A tight frame for the target of .