CN113158822A

CN113158822A - Eye detection data classification method and device based on cross-modal relationship reasoning

Info

Publication number: CN113158822A
Application number: CN202110336212.7A
Authority: CN
Inventors: 乔宇; 张秀兰; 宋迪屏; 李飞; 熊健; 何军军; 付彬
Original assignee: Shenzhen Institute of Advanced Technology of CAS; Zhongshan Ophthalmic Center
Current assignee: Shenzhen Institute of Advanced Technology of CAS; Zhongshan Ophthalmic Center
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2021-07-23
Anticipated expiration: 2041-03-29
Also published as: CN113158822B; WO2022205780A1

Abstract

The application is suitable for the technical field of artificial intelligence, and provides a method and a device for classifying eye detection data based on cross-modal relationship reasoning, wherein the method comprises the following steps: acquiring visual field VF data and video disc data; inputting the VF data and the video data into a trained convolutional neural network model to obtain a classification result corresponding to the VF data and the video data, wherein the processing process of the VF data and the video data by the convolutional neural network model comprises the following steps: the method comprises the steps of extracting data characteristics of VF data and video disc data respectively to obtain VF data characteristics and video disc data characteristics, carrying out combined processing on the VF data characteristics and the video disc data characteristics to obtain enhanced characteristics of the VF data and enhanced characteristics of the video disc data, carrying out characteristic fusion on the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data to obtain fusion characteristics, and classifying the fusion characteristics to obtain classification results. By the method, a more accurate classification result can be obtained.

Description

Eye detection data classification method and device based on cross-modal relationship reasoning

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for classifying eye detection data based on cross-modal relationship reasoning, terminal equipment and a computer-readable storage medium.

Background

Glaucoma is the first irreversible blinding disease worldwide, a heterogeneous disease that damages the optic nerve and results in vision loss. Since glaucoma is characterized by irreversible and progressive loss of vision, early detection and timely treatment of glaucoma are critical to the prevention of visual field loss, blindness.

Clinically, a medical professional will perform a structural assessment of the patient's eye, such as an Optical Coherence Tomography (OCT) examination of the patient's eye. OCT is a non-contact and non-invasive imaging modality that can provide objective and quantitative assessment of various retinal structures. Medical staff consults and analyzes images (OCT image data for short) presented by OCT detection to obtain diagnosis results. Considering that it takes a lot of time for medical staff to review and analyze images displayed by OCT detection, in the existing method, often a Convolutional Neural Networks (CNNs) is combined to analyze OCT image data to obtain a classification result of whether the data is glaucoma, and the medical staff then combines the classification result with other data to analyze to obtain a diagnosis result. However, since glaucoma is complicated in appearance, the accuracy of classification results output from the CNNs is low, and the assistance to medical staff is low, and therefore, it is necessary to provide a new classification result determination method.

Disclosure of Invention

The embodiment of the application provides a method for classifying eye detection data based on cross-modal relationship reasoning, which can solve the problem that the accuracy of classification results output by the conventional CNNs is low.

In a first aspect, an embodiment of the present application provides a method for classifying eye detection data based on cross-modal relationship inference, including:

acquiring visual field VF data and video disc data;

inputting the VF data and the video disc data into a trained convolutional neural network model to obtain a classification result corresponding to the VF data and the video disc data, wherein the processing process of the VF data and the video data by the convolutional neural network model comprises the following steps: extracting the data characteristics of the VF data and the video disc data respectively to obtain VF data characteristics and video disc data characteristics, performing combined processing on the VF data characteristics and the video disc data characteristics to obtain the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data, performing characteristic fusion on the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data to obtain fusion characteristics, and classifying the fusion characteristics to obtain the classification result.

In a second aspect, an embodiment of the present application provides an apparatus for classifying eye detection data based on cross-modal relationship inference, including:

a data acquisition unit for acquiring view VF data and video disc data;

a classification result output unit, configured to input the VF data and the video disc data into a trained convolutional neural network model, so as to obtain a classification result corresponding to the VF data and the video disc data, where a processing process of the VF data and the video data by the convolutional neural network model includes: extracting the data characteristics of the VF data and the video disc data respectively to obtain VF data characteristics and video disc data characteristics, performing combined processing on the VF data characteristics and the video disc data characteristics to obtain the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data, performing characteristic fusion on the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data to obtain fusion characteristics, and classifying the fusion characteristics to obtain the classification result. In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the method according to the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to the first aspect.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method of the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that:

in the embodiment of the application, since the output classification result is related to the enhanced features of the video disc data and the enhanced features of the VF data, the enhanced features of the video disc data and the enhanced features of the VF data are obtained by carrying out combined processing on the VF data features and the video disc data features, namely, the output classification result is related to the relationship between the VF data features and the video disc data features besides the extracted VF data features and the video disc data features, the VF data characteristics and the optic disc data characteristics can reflect whether the tested eye is glaucoma or not to a certain extent, therefore, the accuracy of the classification result output by the embodiment of the present application is higher than that output by the convolutional neural network model trained based on the monomodal data, for example, higher than that output by the convolutional neural network model trained based on the OCT image data.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below.

Fig. 1 is a flowchart of a method for classifying eye detection data based on cross-modal relationship inference according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a convolutional neural network model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of determining a first global relationship vector according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a visual field area corresponding to a position of a retinal nerve fiber layer according to an embodiment of the present disclosure;

fig. 5 is a schematic view of a calculation flow of a local relationship vector of 2 feature region pairs according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a calculation process of a VF enhancement feature according to an embodiment of the present application;

FIG. 7 is a schematic diagram of PDPs according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of OCT image data provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram of an apparatus for classifying eye detection data based on cross-modal relationship inference according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.

The first embodiment is as follows:

the existing CNNs are obtained by training based on OCT (optical coherence tomography) image data, and because the CNNs are obtained by training based on monomodal data, the accuracy of output classification results is low, and the auxiliary effect on medical staff is also low. In order to improve the accuracy of classification results, the embodiment of the application provides a new classification method for eye detection data based on cross-modal relationship inference, in the classification method, a convolutional neural network model is trained through Visual Field (VF) data and optic disc data, so that the trained convolutional neural network model can respectively extract the data characteristics of input VF data and optic disc data to obtain corresponding VF data characteristics and optic disc data characteristics, corresponding enhancement characteristics are respectively determined based on the extracted data characteristics, then characteristic fusion is performed according to the enhancement characteristics of VF data and the enhancement characteristics of optic disc data, and finally the classification results are obtained by classifying the fused characteristics. The output classification result is related to the enhancement feature of the optic disc data and the enhancement feature of the VF data, and the enhancement feature of the optic disc data and the enhancement feature of the VF data are obtained by performing combined processing on the VF data feature and the optic disc data feature, namely, the output classification result is related to the relationship between the VF data feature and the optic disc data feature besides the extracted VF data feature and the optic disc data feature, and the VF data feature and the optic disc data feature can reflect whether the tested eye is glaucoma or not to a certain degree.

The method for classifying eye detection data based on cross-modal relationship inference provided in the embodiments of the present application is described below with reference to the accompanying drawings.

Fig. 1 shows a flowchart of a method for classifying eye detection data based on cross-modal relationship inference, which is provided in an embodiment of the present application, and is detailed as follows:

in step S11, the field of view VF data and the disk data are acquired.

Specifically, if VF detection is performed on the eyes of the user, corresponding VF data will be obtained; when the optic disc is detected for the eyes of the user, corresponding optic disc data is obtained, and the optic discs are all called optic nerve discs, also called optic nerve papillae. It should be noted that the VF data and the optic disc data are data corresponding to the same eye to be tested, that is, data obtained after VF detection is performed on the same eye to be tested is VF data, and data obtained after optic disc nerve is performed is optic disc data, and the VF data and the optic disc data are used for reflecting the situation of the same eye to be tested on the visual field and the optic disc, respectively.

In this embodiment, the view VF data and the video data may be obtained locally or from the cloud. The view VF data and the video disk data may be data obtained by processing raw data, for example, if the raw data is a view detection report, the view detection report is processed to obtain VF data.

Step S12, inputting the VF data and the video disc data into the trained convolutional neural network model to obtain a classification result corresponding to the VF data and the video disc data, wherein the processing procedure of the VF data and the video data by the convolutional neural network model includes: the method comprises the steps of extracting data characteristics of VF data and video disc data respectively to obtain VF data characteristics and video disc data characteristics, carrying out combined processing on the VF data characteristics and the video disc data characteristics to obtain enhanced characteristics of the VF data and enhanced characteristics of the video disc data, carrying out characteristic fusion on the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data to obtain fusion characteristics, and classifying the fusion characteristics to obtain classification results.

Referring to fig. 2, a schematic structural diagram of a convolutional neural network model is shown, and in fig. 2, the convolutional neural network model includes a VF data feature extraction module 21, a disc data feature extraction module 22, a multi-modal enhanced feature processing module 23, a feature fusion module 24, and a classifier 25. The VF data feature extraction module 21 is configured to extract data features of VF data; the video disc data feature extraction module 22 is used for extracting data features of the video disc data; the multimodal enhanced feature processing module 23 is configured to perform joint processing on the VF data feature and the video disc data feature to obtain an enhanced feature of the VF data and an enhanced feature of the video disc data; the feature fusion module 24 is configured to fuse the enhanced features of the VF data and the enhanced features of the video disc data to obtain fusion features; the classifier 25 is configured to classify the fusion features to obtain a classification result.

In some embodiments, jointly processing the VF data characteristics and the video disc data characteristics to obtain enhanced characteristics of the VF data includes:

a1, determining VF data characteristics obtained by video disc data characteristic conversion according to the VF data characteristics and the video disc data characteristics;

a2, obtaining a VF enhancement characteristic according to the VF data characteristic and the VF data characteristic obtained by the video disk data characteristic conversion, wherein the VF enhancement characteristic is the enhancement characteristic of the VF data;

the data characteristics of the VF data and the data characteristics of the video disk data are processed in a combined mode, so that the enhanced characteristics of the video disk data are obtained, and the method comprises the following steps:

b1, determining the video disc data characteristics obtained by the conversion of the VF data characteristics according to the VF data characteristics and the video disc data characteristics;

and B2, obtaining the video disc enhancement characteristic according to the video disc data characteristic and the video disc data characteristic obtained by conversion of the VF data characteristic, wherein the video disc enhancement characteristic is the enhancement characteristic of the video disc data.

In the above-mentioned a1, a2, B1 and B2, the VF data characteristics obtained by the video data characteristic conversion means: the data characteristic represented by the video data characteristic is used and is used as a compensation of the acquired original VF data characteristic. Similarly, the video disc data characteristics obtained by the conversion of the VF data characteristics will be used as a compensation for the original video disc data characteristics obtained. Because the structure of the Optic Nerve Head (ONH) of the eye changes when glaucoma occurs, and the eye also has corresponding visual function defects, there is a strong correlation between the structural damage and the visual function damage caused by glaucoma, that is, the thinning of the Optic Nerve Fiber Layer (RNFL) and the defect of VF are spatially consistent, that is, the Optic disc data characteristic and the VF data characteristic have a strong correlation. Additionally, for early stage glaucoma, there is a time "lag" if detected using VF techniques, and RNFL thickness is a sensitive indicator of changes in early stage glaucoma (e.g., pre-visual field glaucoma, where a significant decrease in RNFL thickness can be observed before visual field damage occurs); for the middle and late stages of glaucoma, however, VF detection is an effective means of monitoring the progression of glaucoma. As can be seen from the above, the optic disc data features and the VF data features have complementary characteristics, and the relationship between the structural damage and the functional damage is comprehensively evaluated (for example, determining the VF data features converted from the optic disc data features) so as to facilitate the understanding of the convolutional neural network model on glaucoma, and further output a more accurate classification result, where the classification result is glaucoma or non-glaucoma.

In some embodiments, the VF data characteristics transformed from the video disc data characteristics are determined according to the following:

and C1, determining a first global relationship vector for converting the optic disc data characteristics to the VF data characteristics according to the VF data characteristics and the optic disc data characteristics, and determining a first local relationship vector for converting the optic disc data characteristics to the VF data characteristics.

Specifically, for each VF data feature, a relationship between the VF data feature and each video disk data feature is determined, and then a first global relationship vector is determined according to each determined relationship. For example, assuming VF data characteristics are a1, a2, and video data characteristics are b1 and b2, the relationship of a1 to b1 and the relationship of a1 to b2 are determined for a 1; for a2, determining the relationship of a2 to b1 and the relationship of a2 to b 2; and finally, determining a corresponding first global relationship vector according to the relationship between a1 and b1, the relationship between a1 and b2, the relationship between a2 and b1 and the relationship between a2 and b 2.

And C2, determining the VF data characteristics obtained by the video disk data characteristic conversion according to the first global relation vector and the first local relation vector.

In C1 and C2, since the VF data feature obtained by converting the video data feature is determined according to the first global relationship vector and the first local relationship vector, the VF data feature obtained by converting the video data feature has a global relationship (for example, a global relationship between each VF data feature and each video data feature) and a local relationship, thereby ensuring the accuracy of the VF data feature obtained by converting the video data feature.

It should be noted that, assuming that the global relationship vector for converting the VF data feature into the video disk data feature is the second global relationship vector, the determination method of the second global relationship vector is similar to the determination method of the first global relationship vector, and details thereof are not repeated here.

In some embodiments, the determining, in C1, a first global relationship vector for converting the video data characteristics into the VF data characteristics according to the VF data characteristics and the video data characteristics includes:

and C11, splicing the video disc data characteristics and the VF data characteristics to obtain a splicing vector.

And C12, executing mapping operation on the splicing vector to obtain a mapping value of the splicing vector.

Wherein the mapping value is a scalar.

And C13, determining a first global relationship vector for converting the optic disc data characteristics to the VF data characteristics according to the mapping values.

Specifically, after obtaining the mapping values corresponding to the splicing vectors of the VF data features and the video disc data features, the obtained mapping values are normalized to obtain a first global relationship vector. Referring to fig. 3, V represents a feature diagram corresponding to VF data features, the number of channels is C, and H × W is the feature diagram size; and O is a feature map corresponding to the features of the video disc data, the number of channels is C, and the feature map size is H × W. V and O are used as inputs to a global relationship inference module in the multi-modal enhanced feature processing module 23, which is configured to determine a first global relationship vector and a second global relationship vector. In some embodiments, if the feature dimensions of V and O are not consistent, V and O are processed to make the feature dimensions of V and O consistent, for example, convolutional layer transport, resume (the resume is a function of copying matrix elements), and Repeat (the Repeat is a function of copying matrix elements) are respectively performed on V and O to make the feature dimensions of V and O consistent, that is, become C × HW × HW. And then splicing the VF data characteristics and the video disk data characteristics after the operation processing to obtain OCT-VF fusion characteristics (or called fusion characteristic graphs), wherein the dimensionality after splicing is 2 CxHW xHW, HW xHW characteristic points are arranged on the fusion characteristic graphs, each characteristic point is a group of OCT-VF characteristic pairs, and the number of channels of the characteristic pairs is 2C. Each video data characteristic and all the VF data characteristics form a point-to-point characteristic pair, and similarly, each VF data characteristic and all the video data characteristics form a point-to-point characteristic pair. Using global pairwise relational functions

Each set of OCT-VF feature pairs is mapped to a scalar that represents the relationship between the set of OCT-VF feature pairs. Finally, performing Softmax normalization on the rows and the columns respectively to obtain 1 group (2) of global relationship vectors: in fig. 3, one row represents one VF data feature and all the video disc data features, and one column represents one video disc data feature and all the VF data features, at this time, the first global relationship vector obtained by normalizing by row is used for converting the video disc data features into the VF data features, and the second global relationship vector obtained by normalizing by column is used for converting the VF data features into the video disc data features. I.e. alpha in fig. 3^gRepresenting a first global relationship vector, beta^gRepresenting a second global relationship vector.

It is noted that (the dimensions of V and O in fig. 3 are consistent, but in practice non-consistent dimensions may be used, without limitation).

In some embodiments, to improve the processing efficiency of the convolutional neural network model, the mapped values are divided by M before they are normalized, where M is greater than 1. Of course, M may also be a value corresponding to a dimension, for example, the first global relationship vector and the second global relationship vector may be determined by the following equation:

wherein d is_oAnd d_vThe dimensions representing the features of the video disc data and the VF data, respectively, if FIG. 3 is taken as an example, d_oAnd d_vIs the ratio of H to W, and the weight ratio of H to W,

as a global pairwise relation function, α^gIs a first global relationship vector for converting the features of the video disk data into the features of the VF data, beta^gIs a second global relationship vector for converting VF data characteristics to video disk data characteristics, W_g、W_vAnd W_oAll are learnable convolution layer weights, v_iIndicates the ith VF data characteristic, o_jIndicating the jth disc data characteristic.

Medical research shows that there is a position corresponding relationship between the visual field region and the retinal nerve fiber layer, fig. 4 is a partition map found by Garway-Heath and the like, which divides VF data characteristics and optic disc data characteristics into 6 regions respectively, the leftmost map of fig. 4 is a visual field mode deviation probability map, the middle map is a partition map of the optic disc, the rightmost map of fig. 4 is an OCT scanning map of the all-round optic disc, and the regions numbered in the same number in the map have a structure-function corresponding relationship, for example, a region numbered "1" in the visual field mode deviation probability map has a structure-function corresponding relationship with a region numbered "121 ° -230 °" in the partition map of the optic disc. That is, if a visual field damage caused by glaucoma is observed in the 1 st region of VF, then a thinning of the RNFL is observed in the 1 st region of OCT. That is, in some embodiments, determining the first local relationship vector for conversion of the VF data characteristics into the video data characteristics according to the VF data characteristics and the video data characteristics in C1 includes:

c11', divide VF data features into 6 VF regions, and disc data features into 6 disc regions.

C12', determining 6 feature region pairs according to the 6 VF regions and 6 optic disc regions, wherein each feature region pair is respectively corresponding to a VF region and a optic disc region.

C13', for the VF area and the optic disc area of each feature area pair, determining the relationship vector between the VF data feature in the VF area and the optic disc data feature in the optic disc area, and obtaining the first local relationship vector for converting the optic disc data feature into the VF data feature.

In the above-mentioned C11 '-C13', medical priors (OCT-VF partition mapping relationships) are introduced into the design of the neural network, and then a more accurate relationship diagram between OCT and VF is learned. In the embodiment of the present application, the first local relationship vector is determined by a guiding region relationship module of the enhanced feature processing module 23 of the multi-modality, and the processing flow of the guiding region relationship module is as shown in fig. 5. Similar to the processing procedure of the global relationship reasoning module, the difference is that the calculation of the local relationship vector is limited to the OCT-VF feature set with the partition mapping relationship. Fig. 5 (a) and (b) are the calculation flows of the local relationship vectors of the 1 st feature region pair and the 2 nd feature region pair, respectively, and taking the partition relationship diagram of the 1 st feature region pair as an example, we only select the VF data feature of the 1 st feature region pair and the disc data feature of the 1 st feature region pair for calculation,

as a function of the pairwise relation of the 1 st feature region pair, α^r(1)For the first local relationship vector, β, in the 1 st feature region pair for conversion of the disk data features to VF data features^r(1)The second local relationship vector for the conversion of VF data features to disc data features in the 1 st feature region pair.

Wherein each first local relationship vector for the conversion of the optic disc data characteristics to VF data characteristics can be determined according to the following equation:

wherein, W_r、W_vAnd W_oIs a learnable convolutional layer weight, C_rIs the normalization parameter, k ∈ {1, 2, 3, 4, 5, 6}, is the partition number,

is the VF data characteristic of the k-th partition,

is the disc data characteristic of the k-th zone.

It should be noted that the calculation manner of each second local relationship vector for converting the VF data feature into the video data feature is similar to the calculation manner of the above first local relationship vector for converting the video data feature into the VF data feature, and is not described herein again.

FIG. 6 is a schematic diagram showing the calculation process of a VF enhancement feature, in FIG. 6, "Matmul" represents matrix multiplication, "Element-wise sum" represents bit-by-bit addition, and α^gIs a first global relationship vector (i.e. a global relationship map for the conversion of the features of the disc data), alpha^r(1)For the first local relationship vector (i.e. the 1 st partition relationship graph for the conversion of the features of the video disk data into the features of the VF) in the 1 st feature region pair, α^r(6)For the partition relationship diagram of group 6 for the transformation of the features of the disc data, V and O are the VF features and the disc data features, respectively, the arrow pointing to "Matmul" for the Conv (convolution operation) + Reshape operation, and the 3 arrows pointing from "Matmul" for the Reshape operation. O is first subjected to convolutional layer operation and Reshape operation, and the arrow indicated from "Fusion" indicates Conv operation. Respectively with alpha^g、α^r(1)...α^r(6)Performing a matrix multiplication as follows:

where σ ∈ { g, r (k) },

is the VF data characteristic transformed by the disc data characteristic in the σ -th characteristic region pair,

the feature of the video disk data converted from the VF data feature in the sigma-th feature region pair is added with the first local relation vectors of the 6 feature region pairs to obtain the partition enhanced feature

That is, after the first global relationship vector and the first local relationship vector are determined respectively, step a2 may determine the VF data characteristics converted from the video data characteristics according to equation (7) above.

Enhancing features with partitioning

With global enhancement features

Carrying out fusion:

Z_VO＝V+V_o→v.......(11)；

Z_OV＝O+O_v→o.......(12)；

wherein, V_o→vRepresenting the VF data characteristics converted from the video disc data characteristics, Fusion representing the Fusion function, O_v→oRepresenting features of video disc data, Z, transformed from VF data features_VORepresenting the VF enhancement feature derived from the VF data feature and the VF data feature transformed from the video disk data feature, Z_OVThe video disc enhancement features obtained according to the video disc data features and the video disc data features obtained by converting the VF data features are shown.

Wherein, the fusion mode comprises at least one of the following modes: addition (averaging or weighted averaging), bitwise maximum, concatenation, etc.

The fusion mode of the addition is a vector obtained by adding the partition enhanced features and the global enhanced features according to bits.

The fusion mode of taking the maximum value according to the bit is to compare the partition enhancement feature and the global enhancement feature according to the bit, and take the maximum value as an output value.

The splicing is to connect the partition enhanced features and the global enhanced features together, and if the first global relationship vector and the first local relationship vector to be spliced are both 1 × 30, a1 × 60 fusion feature is obtained.

It should be noted that, when the partition enhanced feature and the global enhanced feature are fused by using the above fusion method, weights corresponding to the partition enhanced feature and the global enhanced feature may also be set. That is, the flexibility of the converted data characteristics (VF data characteristics or video data characteristics) is improved by setting different weights.

In some embodiments, the VF data is Pattern deviation probability maps (PDPs) data, and the obtained VF data is PDPs data; the optic disc data is OCT image data obtained by performing a circular scan of the optic disc by Optical Coherence Tomography (OCT).

Wherein, obtaining the PDPs data comprises: PDPs data are extracted from the visual field detection report.

In this embodiment, PDPs data is extracted from a pdf-formatted file or a tif-formatted image corresponding to the visual field detection report. Specifically, whether the visual field detection report meets the requirement is judged according to the reliability index in the visual field detection report, and if the visual field detection report meets the requirement, corresponding PDPs data are extracted from the visual field detection report meeting the requirement. Since the PDPs data retains the view partition (location) information, it can provide more detailed and comprehensive view function information, and thus, using the PDPs data as VF data can improve the accuracy of the obtained classification results.

In some embodiments, extracting PDPs data from the visual field inspection report comprises:

d1, dividing the designated position in the visual field detection report into N blocks, and determining the gray value of each block, wherein N is greater than or equal to 9.

The designated position is a region position in which PDPs are displayed in the visual field detection report, N is determined according to the number of content items included in the region position, where the number of content items includes the number of icon markers, i.e., the number of detection sites in the visual field detection test, and N is usually 10.

D2, determining icon identifications corresponding to the blocks according to the gray values of the blocks and a preset mapping table to obtain PDPs data, wherein the preset mapping table is used for storing the corresponding relation between the gray values of the blocks and the icon identifications, and one icon identification is used for uniquely marking one icon in the PDPs.

Fig. 7 shows a schematic diagram of PDPs, and in fig. 7, 4 abnormal probability icons are further displayed beside the pattern deviation probability map, and the darker the color of the icon is, the smaller the corresponding probability value P is, i.e. the lower the possibility that the visual field of the site is normal is. Referring to fig. 7,0 represents a blank grid, and 1-5 represent 5 probability icons (4 abnormalities +1 normality), respectively. For example, the icon designation "5" indicates P < 2% (less than 2% of normal people would have such a low sensitivity field of view, that is, 98% of the field of view of the site is abnormal), the icon designation "4" indicates P < 1% (99% of the field of view of the site is abnormal), the icon designation "3" indicates P < 0.5% (99.5% of the field of view of the site is abnormal), and so forth.

It should be noted that, in practical cases, other information may also be used as the icon identifier, and is not limited herein.

In the above D1 and D2, the gray values of the blocks divided from the designated position are respectively compared with the gray values stored in the preset mapping table to determine the same gray value, and then the icon identifications corresponding to the same gray value are determined, and the icon identifications corresponding to the gray values of the blocks form PDPs data, which are two-dimensional discrete data or can be regarded as a gray map.

In some embodiments, after acquiring the PDPs data and the OCT image data, the method comprises:

e1, performing first preprocessing on the PDPs data, wherein the first preprocessing comprises normalization processing.

The normalization process is to map each icon id to 0-1, for example, assuming that 6 icon ids 0-5 are normalized to obtain 6 values "0,075, 0.693,0.8825,0.9107, 0.9924" in 0-1. Of course, the above 6 values in the interval of 0 to 1 are only an example, and in practical cases, other values in the interval of 0 to 1 may be mapped, and the method is not limited herein. Because the first preprocessing including the normalization processing is performed on the PDPs data, the VF data feature extraction module is simpler to perform data feature extraction on the PDPs data subjected to the first preprocessing. In some embodiments, the VF data feature extraction module includes at least two convolutional layers, and the parameters of different convolutional layers are usually different, and these parameters include the number of channels, the size of convolutional kernel, step size, padding, hole convolution, and so on. When the number of layers of the VF data feature extraction module is more, the more the learned parameters are, the stronger the semantic property of the data features extracted by the VF data feature extraction module is, and the more abstract the data features are.

And E2, performing second preprocessing on the OCT image data, wherein the second preprocessing comprises normalization processing and scale scaling processing.

The normalization processing included in the second preprocessing refers to normalization processing on image pixel values of OCT image data, specifically: (1) and (2) counting the mean value and the variance of each OCT image data in the training data set in advance, subtracting the counted mean value from the image pixel value of the OCT image data in the eye detection data to be processed, and dividing the image pixel value by the counted variance.

Wherein, the second preprocessing comprises a scaling processing that scales the size of the OCT image data to a specified size. It should be noted that the size of the OCT image data sample used by the training optic disc data feature extraction module is the above-mentioned specified size, and the OCT image data sample used is also subjected to the normalization process. In some embodiments, in order to improve the generalization performance of the optic disc data feature extraction module, the OCT image data samples used by the optic disc data feature extraction module are trained to be samples obtained using different optical coherence tomography instruments. The OCT image data is shown in fig. 8, from which the user can view the thickness of the Retinal Nerve Fiber Layer (RNFL). Because before extracting the data features of the OCT image data, the OCT image data is subjected to second preprocessing including normalization processing and scaling processing, the optical disk data feature extraction module does not need to pay attention to a large data range and other sizes when extracting the data features, and thus can quickly extract corresponding data features from the OCT image data after the second preprocessing.

In some embodiments, the optical disk data feature extraction module includes at least two convolution layers, and the at least two convolution layers process data extracted from the OCT image data after the second preprocessing by using batch normalization and example normalization to obtain corresponding data features.

The video disc data feature extraction module comprises at least two convolution layers, wherein parameters of different convolution layers are usually different, and the parameters comprise channel number, convolution kernel size, step length, padding, cavity convolution and the like. When the number of layers of the video disc data feature extraction module is more, the learned parameters are more, and the semanteme of the data features extracted by the video disc data feature extraction module is stronger and more abstract.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Example two:

corresponding to the method for classifying eye detection data based on cross-modal relationship inference in the foregoing embodiment, fig. 9 shows a block diagram of a device for classifying eye detection data based on cross-modal relationship inference in the embodiment of the present application, and for convenience of description, only the relevant parts of the embodiment of the present application are shown.

Referring to fig. 9, the apparatus 9 for classifying eye detection data based on cross-modal relationship inference includes: a data acquisition unit 91 and a classification result output unit 92. Wherein:

a data acquisition unit 91 for acquiring the field of view VF data and the disk data.

A classification result output unit 92, configured to input the VF data and the video data into the trained convolutional neural network model, so as to obtain a classification result corresponding to the VF data and the video data, where a processing process of the convolutional neural network model on the VF data and the video data includes: the method comprises the steps of extracting data characteristics of VF data and video disc data respectively to obtain VF data characteristics and video disc data characteristics, carrying out combined processing on the VF data characteristics and the video disc data characteristics to obtain enhanced characteristics of the VF data and enhanced characteristics of the video disc data, carrying out characteristic fusion on the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data to obtain fusion characteristics, and classifying the fusion characteristics to obtain classification results.

determining VF data characteristics obtained by video disc data characteristic conversion according to the VF data characteristics and the video disc data characteristics; obtaining a VF enhancement characteristic according to the VF data characteristic and the VF data characteristic obtained by converting the video disk data characteristic, wherein the VF enhancement characteristic is the enhancement characteristic of the VF data;

determining the video disc data characteristics obtained by the conversion of the VF data characteristics according to the VF data characteristics and the video disc data characteristics; and obtaining the video disc enhancement characteristic according to the video disc data characteristic and the video disc data characteristic obtained by converting the VF data characteristic, wherein the video disc enhancement characteristic is the enhancement characteristic of the video disc data.

determining a first global relationship vector for converting the video data characteristics to the VF data characteristics according to the VF data characteristics and the video data characteristics, and determining a first local relationship vector for converting the video data characteristics to the VF data characteristics; and determining the VF data characteristics obtained by converting the video disk data characteristics according to the first global relationship vector and the first local relationship vector.

In some embodiments, determining a first global relationship vector for conversion of the disc data features to the VF data features from the disc data features and the VF data features comprises:

splicing the video disc data characteristics and the VF data characteristics to obtain a splicing vector; performing mapping operation on the spliced vectors to obtain mapping values of the spliced vectors; and determining a first global relationship vector for converting the video disk data characteristics to the VF data characteristics according to the mapping values.

In some embodiments, determining a first local relationship vector for conversion of the optic disc data features to the VF data features from the optic disc data features and the VF data features comprises:

dividing the VF data characteristics into 6 VF areas, and dividing the video disk data characteristics into 6 video disk areas; determining 6 characteristic region pairs according to the 6 VF regions and the 6 optic disc regions, wherein each characteristic region pair is respectively in one-to-one correspondence with one VF region and one optic disc region; and for the VF area and the optic disc area of each characteristic area pair, determining a relation vector between the VF data characteristics in the VF area and the optic disc data characteristics in the optic disc area to obtain a first local relation vector for converting the optic disc data characteristics into the VF data characteristics.

In some embodiments, splicing the video disc data characteristics and the VF data characteristics to obtain a splicing vector includes:

if the dimension of the VF data characteristic is not consistent with the dimension of the video disc data characteristic, adjusting the dimension of the VF data characteristic and/or the dimension of the video disc data characteristic to obtain the VF data characteristic and the video disc data characteristic with consistent dimensions; and splicing according to the VF data characteristics and the video disk data characteristics with consistent dimensions to obtain a splicing vector.

In some embodiments, the VF data is pattern deviation probability map data, and the optic disc data is OCT map data obtained by performing a ring scan on the optic disc by optical coherence tomography OCT.

In some embodiments, the apparatus 9 for classifying eye detection data based on cross-modal relationship inference further includes:

the first preprocessing unit is used for performing first preprocessing on the mode deviation probability map data, and the first preprocessing comprises normalization processing;

the second preprocessing unit is used for performing second preprocessing on the OCT image data, and the second preprocessing comprises normalization processing and scale scaling processing;

when the VF data and the video data are input into the pre-trained convolutional neural network model, the classification result output unit 92 is specifically configured to: and inputting the mode deviation probability graph data after the first preprocessing and the OCT graph data after the second preprocessing into a pre-trained convolutional neural network model.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

Example three:

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 10, the terminal device 10 of this embodiment includes: at least one processor 100 (only one processor is shown in fig. 10), a memory 101, and a computer program 102 stored in the memory 101 and executable on the at least one processor 100, the processor 100 implementing the steps in any of the various method embodiments described above when executing the computer program 102:

acquiring visual field VF data and video disc data;

inputting the VF data and the video data into a trained convolutional neural network model to obtain a classification result corresponding to the VF data and the video data, wherein the processing process of the VF data and the video data by the convolutional neural network model comprises the following steps: the method comprises the steps of extracting data characteristics of VF data and video disc data respectively to obtain VF data characteristics and video disc data characteristics, carrying out combined processing on the VF data characteristics and the video disc data characteristics to obtain enhanced characteristics of the VF data and enhanced characteristics of the video disc data, carrying out characteristic fusion on the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data to obtain fusion characteristics, and classifying the fusion characteristics to obtain classification results.

The terminal device 10 may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The terminal device may include, but is not limited to, a processor 100, a memory 101. Those skilled in the art will appreciate that fig. 10 is merely an example of the terminal device 10, and does not constitute a limitation of the terminal device 10, and may include more or less components than those shown, or combine some of the components, or different components, such as an input-output device, a network access device, etc.

The Processor 100 may be a Central Processing Unit (CPU), and the Processor 100 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 101 may in some embodiments be an internal storage unit of the terminal device 10, such as a hard disk or a memory of the terminal device 10. In other embodiments, the memory 101 may also be an external storage device of the terminal device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the terminal device 10. Further, the memory 101 may also include both an internal storage unit and an external storage device of the terminal device 10. The memory 101 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 101 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

An embodiment of the present application further provides a network device, where the network device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for classifying eye detection data based on cross-modal relationship reasoning is characterized by comprising the following steps:

acquiring visual field VF data and video disc data;

2. The method for classifying eye detection data based on cross-modal relationship inference according to claim 1, wherein jointly processing the VF data features and the video disc data features to obtain enhanced features of the VF data comprises:

determining the VF data characteristics obtained by the video disc data characteristic conversion according to the VF data characteristics and the video disc data characteristics;

obtaining a VF enhancement characteristic according to the VF data characteristic and the VF data characteristic obtained by converting the video disk data characteristic, wherein the VF enhancement characteristic is the enhancement characteristic of the VF data;

performing joint processing on the data characteristics of the VF data and the data characteristics of the video disc data to obtain enhanced characteristics of the video disc data, wherein the enhanced characteristics comprise:

determining the video disc data characteristics obtained by converting the VF data characteristics according to the VF data characteristics and the video disc data characteristics;

and obtaining the video disc enhancement characteristic according to the video disc data characteristic and the video disc data characteristic obtained by converting the VF data characteristic, wherein the video disc enhancement characteristic is the enhancement characteristic of the video disc data.

3. The method for classifying eye detection data based on cross-modal relationship inference according to claim 2, wherein the VF data feature transformed from the video disc data feature is determined according to the following method:

determining a first global relationship vector for converting the video disc data characteristic to the VF data characteristic according to the VF data characteristic and the video disc data characteristic, and determining a first local relationship vector for converting the video disc data characteristic to the VF data characteristic;

and determining VF data characteristics obtained by the video disk data characteristic conversion according to the first global relationship vector and the first local relationship vector.

4. The method for classifying eye detection data based on cross-modal relational inference according to claim 3, wherein the determining a first global relationship vector for converting the optic disc data features into the VF data features according to the optic disc data features and the VF data features comprises:

splicing the video disc data characteristics and the VF data characteristics to obtain a splicing vector;

executing mapping operation on the spliced vector to obtain a mapping value of the spliced vector;

and determining a first global relationship vector for converting the video disc data characteristic to the VF data characteristic according to the mapping value.

5. The method for classifying eye detection data based on cross-modal relational inference according to claim 3, wherein determining a first local relationship vector for conversion of the optic disc data features to the VF data features from the optic disc data features and the VF data features comprises:

dividing the VF data characteristics into 6 VF regions, and dividing the video disc data characteristics into 6 video disc regions;

determining 6 characteristic region pairs according to the 6 VF regions and the 6 optic disc regions, wherein each characteristic region pair is respectively in one-to-one correspondence with one VF region and one optic disc region;

and for the VF area and the video disk area of each characteristic area pair, determining a relationship vector between VF data characteristics in the VF area and video disk data characteristics in the video disk area to obtain a first local relationship vector for converting the video disk data characteristics to the VF data characteristics.

6. The method for classifying eye detection data based on cross-modal relationship inference according to claim 4, wherein said stitching the video disc data features and the VF data features to obtain a stitching vector comprises:

if the dimension of the VF data feature is not consistent with the dimension of the video disc data feature, adjusting the dimension of the VF data feature and/or the dimension of the video disc data feature to obtain the VF data feature and the video disc data feature with consistent dimensions;

and splicing according to the VF data characteristics with consistent dimensions and the video disc data characteristics to obtain a spliced vector.

7. The method for classifying eye detection data based on cross-modal relationship inference according to any one of claims 1 to 6, wherein the VF data is mode deviation probability map data, and the optic disc data is OCT map data obtained by performing a circular scan on an Optic Coherence Tomography (OCT) optic disc.

8. The method for classifying eye detection data based on cross-modal relationship inference as claimed in claim 7, after acquiring the pattern deviation probability map data and the OCT map data, comprising:

performing first preprocessing on the mode deviation probability map data, wherein the first preprocessing comprises normalization processing;

performing second preprocessing on the OCT image data, wherein the second preprocessing comprises normalization processing and scale scaling processing;

inputting the VF data and the video disk data into a pre-trained convolutional neural network model, comprising: inputting the mode deviation probability map data after the first preprocessing and the OCT map data after the second preprocessing into a pre-trained convolutional neural network model.

9. A classification device for eye detection data based on cross-modal relationship inference, comprising:

a data acquisition unit for acquiring view VF data and video disc data;

a classification result output unit, configured to input the VF data and the video disc data into a trained convolutional neural network model, so as to obtain a classification result corresponding to the VF data and the video disc data, where a processing process of the VF data and the video data by the convolutional neural network model includes: extracting the data characteristics of the VF data and the video disc data respectively to obtain VF data characteristics and video disc data characteristics, performing combined processing on the VF data characteristics and the video disc data characteristics to obtain the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data, performing characteristic fusion on the enhanced characteristics of the VF data and the enhanced characteristics of the video disc data to obtain fusion characteristics, and classifying the fusion characteristics to obtain the classification result.

10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.