CN116385806A

CN116385806A - Method, system, equipment and storage medium for classifying strabismus type of eye image

Info

Publication number: CN116385806A
Application number: CN202310613349.1A
Authority: CN
Inventors: 刘陇黔; 张海仙; 吴达文; 李彦霏; 杨国渊; 毛轶绩; 封毅; 魏文远
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-07-04
Anticipated expiration: 2043-05-29
Also published as: CN116385806B

Abstract

The invention discloses a method, a system, equipment and a storage medium for classifying strabismus type of eye images, relates to strabismus classification of eye images in the field of artificial intelligence, and aims to solve the technical problem that the accuracy rate of classifying strabismus type of eye images in the prior art is low. The method comprises the steps of inputting text data containing basic information of a patient and image data comprising an eye image of the patient, adopting a feature extraction module based on a ResNet50V2 model residual block connection mechanism to extract features of the eye image data, then adopting a feature fusion module based on a joint multi-head attention mechanism to fuse the image features extracted by the ResNet50V2 and the text features subjected to normalization processing, and finally adopting a multi-classification module based on a hierarchical classification method to output classification results (10 classifications of normal and strabismus). The whole model improves multi-classification precision through multi-mode and hierarchical classification architecture, reduces errors among classes, and has stronger practical significance and clinical value.

Description

Method, system, equipment and storage medium for classifying strabismus type of eye image

Technical Field

The invention relates to the technical field of artificial intelligence, relates to a classification method of an eye image strabismus type, in particular to a classification method, a system, equipment and a storage medium of an eye image strabismus type.

Background

Strabismus is a clinically common disease of the eye with a prevalence of about 3% and can cause monocular suppression and retinal abnormalities in the patient, resulting in permanent visual impairment. In addition, strabismus can have serious psychological social consequences for the patient. In summary, strabismus has a significant and long-term impact on patients in terms of visual function, appearance, learning ability, working opportunities, mental health, etc. The onset of strabismus is hidden, and many young strabismus patients will get better cure opportunities if diagnosed early. Therefore, screening and diagnosis of strabismus is important as early as possible. Currently, strabismus screening, diagnosis is mainly performed manually by an ophthalmologist through several tests, such as a covering and uncovering test, a triple prism covering test, etc., requiring a high degree of cooperation between the patient and the doctor, and a long examination time. The tests are very dependent on the skills and experience of doctors, and the examination results are subjective, but at present, the ophthalmologist in China has huge resource gaps and uneven quality, so that risks of missed diagnosis and misdiagnosis exist. Therefore, a reliable artificial intelligence system is developed by using a deep learning method, rapid and automatic strabismus screening and diagnosis are realized, a relatively more objective diagnosis result is provided, and further therapeutic intervention is started as soon as possible, so that the method has great significance for protecting visual functions of strabismus patients and improving life quality.

In the current artificial intelligence eye image classification technical field, by classifying eye images, classification results of eye strabismus in the images are obtained, and two main research methods are provided: an eye key region segmentation algorithm based on traditional step learning and a classification algorithm based on end-to-end learning. In the research of the eye key region segmentation algorithm, researchers often adopt a pre-trained face detection model to extract an eye region of a face image, obtain coordinates of key regions such as pupil centers, cornea light spots and the like for operation, and then use a preset threshold value and a coordinate operation result for numerical comparison to judge whether strabismus exists or not and the type of strabismus. Choi et al propose an image processing based strabismus screening model that uses a first eye image, samples all pixel points on the eye contour edge obtained by a segmentation algorithm, and applies a least squares method to obtain the coordinates of the pupil center. The similarity of the positions of the eyes in the strabismus photo is measured by calculating the distance from the pupil center to the inner and outer corners of eyes, and whether strabismus exists is judged; the Ma et al respectively obtain the coordinates of the cornea center and the cornea reflecting points of the eyes by adopting a similar method, respectively calculate the horizontal offset and the vertical offset of the cornea reflecting points relative to the corresponding cornea center through the relative positions of the coordinates, and judge whether strabismus exists or not; the first time of Kang et al uses the first, second and third eye images to obtain cornea reflecting points, pupil centers, inner and outer eyes and upper and lower eyelid edge point coordinates based on a U-Net segmentation algorithm, and coordinate operation is performed by translating coordinates on the second and third eye images onto the first eye image according to a reference system, so as to realize multi-classification tasks of inner strabismus, outer strabismus, upper strabismus and lower strabismus. In the research of an end-to-end classification algorithm, zheng et al train a deep learning model based on an R-CNN architecture by using a first eye image of horizontal strabismus and orthotopic, so that horizontal strabismus and normal classification tasks are realized; lin et al then trains a deep learning model based on the InceptionResNetV2 architecture using various types of strabismus and orthotopic first eye images, achieving strabismus and normal classification tasks.

Based on the research of the traditional step-by-step learning eye key region segmentation algorithm, comparing the eye feature point coordinate operation result with a preset threshold value to judge whether strabismus exists or not and the type of strabismus; the method utilizes statistical data in a smaller range to perform threshold selection, is subjective, and performs verification on smaller data sets, so that the selected threshold is easy to bias, difficult to popularize in a large range and low in classification accuracy of strabismus types. The research of the classification algorithm based on end-to-end learning utilizes a large amount of image data to train, has better popularization, but still stays in strabismus and normal classification tasks at present, has limited help to clinical practice application and has lower classification accuracy rate for strabismus types.

Disclosure of Invention

The invention aims at: in order to solve the technical problem of low accuracy in classifying strabismus types of eye images in the prior art, the invention provides a method, a system, equipment and a storage medium for classifying strabismus types of eye images.

The invention adopts the following technical scheme for realizing the purposes:

a method for classifying strabismus type of eye images, comprising the following steps:

Step S1, obtaining sample data

Acquiring sample data, wherein the sample data comprises eye image sample data and text sample data;

s2, constructing a feature extraction network model

Constructing a feature extraction network model, wherein the feature extraction network model comprises a feature pre-extraction network model, a feature coarse extraction network model, a feature fine extraction network model and a classification network, and the feature pre-extraction network model comprises a ResNet50V2 image extraction model and a text extraction model;

step S3, training a feature extraction network model

Training the feature extraction network model constructed in the step S2 by adopting the sample data acquired in the step S1;

taking eye image sample data as input of a ResNet50V2 image extraction model, taking text sample data as input of a text extraction model, taking output of the ResNet50V2 image extraction model and the text extraction model as input of a characteristic coarse extraction network model, taking output of the characteristic coarse extraction network model as input of a characteristic fine extraction network model, taking output of the characteristic fine extraction network model as input of a classification network, and outputting a classification result by the classification network;

step S4, strabismus real-time classification

And (3) acquiring real-time eye image data and text data, inputting the eye image data and the text data into the feature extraction network model trained in the step (S3), and outputting a classification result by the feature extraction network model.

Further, in step S2, the res net50V2 image extraction model includes a zero-padding layer, a two-dimensional convolution layer, a zero-padding layer, a maximum pooling layer, a residual module, a batch normalization layer, a linear rectification unit, an average pooling layer, and a full connection layer, which are sequentially connected.

Still further, the residual block includes a plurality of residual blocks connected in sequence, the last residual block includes two basic blocks, and the remaining residual blocks include three basic blocks.

Further, each basic block comprises a first batch of normalization layers and a first linear rectifying unit which are sequentially connected; the output of the first linear rectifying unit is divided into two parallel paths, one path sequentially passes through the first two-dimensional convolution layer, the second batch normalization layer, the second linear rectifying unit, zero filling, the second two-dimensional convolution layer, the third batch normalization layer and the third linear rectifying unit to be input into the third two-dimensional convolution layer, the other path is input into the fourth two-dimensional convolution layer, and the inputs of the third two-dimensional convolution layer and the fourth two-dimensional convolution layer are fused to be used as the output of the whole basic block.

Further, in step S3, when the feature extraction network model is trained, the residual block of the feature pre-extraction network model is activated in a nonlinear manner by means of a relu function, and the specific formula is as follows:

wherein ,

representing a sequence of residual units, ">

Representing the input of the residual unit,

is some column weights and variations associated with the residual unit,/is>

Is the number of network layers of the residual unit, +.>

Representing the activation function, a ReLU is typically used.

Further, in step S2, the feature coarse extraction network model and the feature fine extraction network model are both feature extraction networks of a joint attention mechanism, and include a plurality of self-attention network blocks sequentially arranged;

taking the image information as K and V in an attention mechanism, taking the text information as Q in the attention mechanism, and scoring the attention degree of the image information by taking the text information as a model in the whole attention mechanism

And obtaining a matrix, wherein the product of the matrix acts on a result V obtained by analyzing the image information as a weight to obtain a final attention calculation result.

Further, in step S3, when the feature extraction network model is trained, the calculation formulas of the forward learning attentiveness of the feature coarse extraction network model and the feature fine extraction network model are as follows:

wherein ,

representing the dimension of the matrix, the dimension of information contained in one sample contained in the matrix QK; />

Representing the transpose.

A classification system for an eye image strabismus type, comprising:

the sample data acquisition module is used for acquiring sample data, wherein the sample data comprises eye image sample data and text sample data;

the feature extraction network model construction module is used for constructing a feature extraction network model, wherein the feature extraction network model comprises a feature pre-extraction network model, a feature coarse extraction network model, a feature fine extraction network model and a classification network, and the feature pre-extraction network model comprises a ResNet50V2 image extraction model and a text extraction model;

the feature extraction network model training module is used for training the feature extraction network model constructed by the feature extraction network model construction module by adopting the sample data acquired by the sample data acquisition module;

And the strabismus real-time classification module is used for acquiring real-time eye image data and text data, inputting the eye image data and the text data into the feature extraction network model trained by the feature extraction network model training module, and outputting a classification result by the feature extraction network model.

A computer device, characterized by: comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method described above.

A computer-readable storage medium, characterized by: a computer program is stored which, when executed by a processor, causes the processor to perform the steps of the above method.

The beneficial effects of the invention are as follows:

1. compared with the prior art that the research of the eye critical area segmentation algorithm based on traditional step learning is subject to the problems that smaller data sets are easy to form bias and difficult to popularize in a large range, the method and the device can utilize a large amount of bimodal (image+text) data training networks, reduce the bias of the model and enable the model to have universality.

2. Compared with the prior research of classification algorithm based on end-to-end learning, the method is only limited in strabismus and normal classification tasks, has limited help to clinical practice application, and can realize strabismus classification which covers all strabismus types common in clinic, thereby having stronger practical significance and clinical value.

3. Compared with the conventional artificial intelligence strabismus screening and diagnosis research which only uses eye position pictures to screen and diagnose strabismus, the method and the device have the advantages that through multi-mode feature fusion and reference of information sources of two modes of text information and image information, the model is comprehensively learned to the electronic case features corresponding to the pictures, the accuracy of the classification model is further improved, and the accuracy of the classification result is improved.

4. In the invention, considering that part of the horizontal strabismus subtype and the vertical strabismus subtype are highly similar in characteristics, if classification tasks are directly carried out, the classification accuracy of similar subclasses is not high. According to the method, the problem of model class confusion caused by the high similarity of the characteristics of part of horizontal strabismus subtypes and part of vertical strabismus subtypes is solved by a hierarchical classification method, so that multi-classification precision is improved, and class errors are reduced

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a network architecture diagram of feature extractor ResNet50V2 of the present invention;

FIG. 3 is a schematic diagram of the structure of a residual block in the present invention;

fig. 4 is a diagram of a feature extraction network of a joint attention mechanism in the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention.

Thus, all other embodiments, which can be made by one of ordinary skill in the art without undue burden from the invention, are intended to be within the scope of the invention.

Example 1

The embodiment provides a classification method of eye image strabismus type, which reduces the correlation between features by normalizing text data to the same scale and range in a data preprocessing stage, thereby improving the distinguishing and interpretability of the features, and strengthens the scale of a data set by using a data augmentation means by using fewer samples in an image training set, so that samples among the data set are balanced as much as possible, and the recognition capability of a model to a few categories is improved; in the feature extraction and fusion stage, feature extraction is carried out on the eye image data through a ResNet50V2 model residual block connection mechanism, and then a feature fusion module based on a joint multi-head attention mechanism is adopted to fuse the image features extracted by the ResNet50V2 and the text features subjected to normalization processing. The model comprehensively learns the characteristics of the electronic cases related to the pictures and the patients, and the accuracy of classifying the model is further improved; the multi-classification output stage adopts a multi-classification module based on a hierarchical classification method, firstly classifies the categories into three major categories of normal strabismus and vertical strabismus, then subdivides the major categories under the guidance of the major category classification, and further classifies the horizontal strabismus and the vertical strabismus into a plurality of subtypes. Through a hierarchical classification mode, the problem of model class confusion caused by the high similarity of the characteristics of part of horizontal strabismus subtypes and vertical strabismus subtypes is solved, so that multi-classification precision is improved, and class errors are reduced. As shown in fig. 1, the specific steps of classification are:

Step S1, obtaining sample data

Sample data is acquired, wherein the sample data comprises eye image sample data and text sample data.

The eye image sample data is an image of the eye, and the text sample data includes the sex, age of the illness, and age of the illness of the patient. Because the data value ranges of different features are greatly different, instability and oscillation phenomenon can occur in the gradient updating process, so that the convergence speed and the accuracy of the model are reduced, and therefore, the text sample data are required to be normalized. By normalizing the text sample data to the same scale and range, the correlation between features can be reduced, thereby improving the discriminativity and interpretability of the features. Therefore, the embodiment integrally converts the text sample data into [0,1], so that the comparison between different features is more reasonable and accurate. Specifically, the sexes of the male and female were changed to 1 and 0, respectively, and the disease years and the disease ages were subtracted by the minimum value in the sample and divided by the difference.

For the problem of fewer samples in the image training set, the embodiment uses simple and standardized data augmentation means such as scaling, rotation, overturning, contrast enhancement and the like to strengthen the scale of the data set, so that samples among the classes in the data set are balanced as much as possible, the recognition capability of the model to a few classes is improved, meanwhile, the overfitting of the model is prevented, and the variance of the model is reduced.

S2, constructing a feature extraction network model

Constructing a feature extraction network model, wherein the feature extraction network model comprises a feature pre-extraction network model, a feature coarse extraction network model, a feature fine extraction network model and a classification network, and the feature pre-extraction network model comprises a ResNet50V2 image extraction model and a text extraction model.

The feature pre-extraction network model is used for respectively extracting features of eye image sample data of an eye image and text sample data of basic information of a patient.

Firstly, a ResNet50V2 image extraction model is used as a feature extractor of image data, a connection mode of a residual block is used, a full connection layer used for classification of the last layer is removed, and a result of the previous layer is extracted to be used as a feature vector of the mode, so that feature fusion is convenient to carry out subsequently. As shown in fig. 2, the res net50V2 image extraction model includes a zero-padding, a two-dimensional convolution layer, a zero-padding, a max-pooling layer, a residual module, a batch normalization layer, a linear rectification unit, an average pooling layer, and a full connection layer, which are sequentially connected. The residual error module comprises a plurality of residual error blocks which are connected in sequence, wherein the last residual error block comprises two basic blocks, and the rest residual error blocks comprise three basic blocks. Each basic block comprises a first batch of normalization layers and a first linear rectification unit which are sequentially connected; the output of the first linear rectifying unit is divided into two parallel paths, one path sequentially passes through the first two-dimensional convolution layer, the second batch normalization layer, the second linear rectifying unit, zero filling, the second two-dimensional convolution layer, the third batch normalization layer and the third linear rectifying unit to be input into the third two-dimensional convolution layer, the other path is input into the fourth two-dimensional convolution layer, and the inputs of the third two-dimensional convolution layer and the fourth two-dimensional convolution layer are fused to be used as the output of the whole basic block.

The ResNet50V2 image extraction model uses a connection mode of residual blocks, and can solve the problems of gradient disappearance and gradient explosion in a deep neural network, thereby realizing a deeper network structure and higher classification precision. Wherein, the residual block is the core of the ResNet50V2 image extraction model, and the structure of the residual block is shown in fig. 3.

The embodiment aims at the low-dimensional text data which is preprocessed, and the text extraction model adopts two full-connection layers to perform feature extraction on the text data so as to obtain feature vectors of the text data. The low-dimensional feature vectors can better describe semantic information and structural features of text data, and provide more powerful support for subsequent feature fusion processing.

For two modes with relatively large differences, namely image data and text data, the embodiment adopts late fusion, namely, feature extraction is respectively carried out on the two modes, and feature fusion based on a joint multi-head attention mechanism is carried out on the feature vectors of the two modes after the feature vectors are obtained. For this reason, as shown in fig. 4, the feature coarse extraction network model and the feature fine extraction network model are feature extraction networks of a joint attention mechanism, which are different from the self-attention structure of the conventional transducer, and take image information as K and V in the attention mechanism and text information as Q. Whereas the meaning of the QKV matrix is as follows:

Q represents all information of the entire text, which is the content to be learned. For the network, the Q matrix indicates as text information which region on the corresponding picture the network notices.

K is keyword information, which refers to prompt information of the model before the content is not seen, and is also called keyword information. A region of interest of text is given over the network corresponding to the image.

V is learning content, meaning that the model represents information according to keywords. That is, V is typically initialized to be the same as Q. Corresponding to the image information acted upon by the attention vector obtained for the text information acting upon the image in the network.

The dimension of the matrix refers to the dimension of information contained in one sample contained in the matrix QK.

The characteristic coarse extraction network model and the characteristic fine extraction network model comprise a plurality of self-attention network blocks which are sequentially arranged;

The characteristic coarse extraction network model and the characteristic fine extraction network model have the characteristics that as K and V are obtained from images, the characteristic dimension is the dimension M at the pixel level, the characteristic dimension is relatively large, the text characteristic N is obtained from text information, the dimension is far smaller than M, and when the two QK matrixes are used as characteristics to be fused, each dimension of Q is mapped to the K and corresponds to a region part, namely the part focused by the attention mechanism. After the first attention mechanism layer, the vector scale sent to the subsequent network is changed into the scale of N x dv, so that the subsequent calculation complexity is greatly reduced, and the specific time complexity is expressed as:

wherein ,

representing that one dimension of the text matrix is mapped to a corresponding region in the image, < >>

Dimension of text information matrix,/->

The dimension of the image area to which one text message is mapped.

General strabismus classification tasks can be classified into normal strabismus and vertical strabismus. The major classes of horizontal strabismus and vertical strabismus can be divided into a plurality of sub-classes, and the final purpose of the embodiment is to subdivide the sub-classes. Because part of the horizontal strabismus subtype is highly similar to the vertical strabismus subtype in characteristics, if classification tasks are directly performed, the classification accuracy of similar subclasses is not high. Since the learning ability of the neural network is that the smaller the number of classifications, the higher the learning accuracy. If the major class is classified three times, and then the minor class is subdivided under the guidance of the major class classification, the overall classification accuracy is relatively high, so that the classification network of the embodiment adopts a multi-layer classification based on one full-connection layer, and the existing classification network structure is adopted, so that the whole classification task is divided into two parts, namely parent class classification and sub-class classification. The parent class directs the progress of the child class, but the training process of both are independent at model training, have different loss functions and iterate separately. Wherein for gradient updating of parameters, parameter updating after a coarse feature extraction network in a classification network is only determined by loss back propagation of sub-classifications, while a network before the coarse feature extraction network in the classification network is simultaneously updated jointly by losses of both classifications. The method comprises the following steps:

For parent class classification: if the strabismus task is classified into three categories of normal, horizontal strabismus and vertical strabismus, the task is a relatively simple and accurate task, the overall difference among the categories is large, and the complete characteristics of the data do not need to be learned. In the embodiment, the design of dividing the characteristic extraction network into two parts is adopted, wherein the former part of the network is a coarse characteristic extraction network, and the latter part of the network is a fine characteristic extraction network; when the coarse feature extraction network of the former part is finished and the coarse feature enters the fine feature extraction network of the latter part, newly creating a network branch, and externally connecting a linear layer on the feature to perform three classification tasks, namely parent classification tasks. Although the parent class classification results are also updated continuously along with the network, the results of the classification tasks considered as the parent class in the same training step are accurate, so as to guide the sub classification tasks.

For subclass classification: and after the whole feature extraction network is finished, adding a linear layer and a softmax function finally to carry out a final classification task. The result of the parent class classification will be obtained before the classification is performed, at which time the parent class classification result is fully believed to be correct. And therefore, all the results which do not accord with the parent classification in the classification are discarded, the result with the largest softmax value is selected from the rest results as the classification result, and finally one of the following categories is output: normal, common external strabismus, external strabismus A sign, external strabismus V sign, other non-common external strabismus, common internal strabismus, internal strabismus A sign, internal strabismus V sign, other non-common internal strabismus, left upper strabismus, right upper strabismus.

Step S3, training a feature extraction network model

the method comprises the steps of taking eye image sample data as input of a ResNet50V2 image extraction model, taking text sample data as input of a text extraction model, taking output of the ResNet50V2 image extraction model and the text extraction model as input of a characteristic coarse extraction network model, taking output of the characteristic coarse extraction network model as input of a characteristic fine extraction network model, taking output of the characteristic fine extraction network model as input of a classification network, and outputting a classification result by the classification network.

When the feature extraction network model is trained, the residual block of the feature pre-extraction network model is activated in a nonlinear mode by means of a relu function, and redundancy of information in data is reduced. And the upper layer features are directly used in the x part, and feature sharing and information transmission of the front and back convolution layers are realized through cross-layer connection, so that the learning speed of a model can be increased, and convergence can be accelerated through parameter optimization. This structure also makes the mapping F (x) more sensitive to changes in output, which is specifically formulated as:

wherein ,

representing a sequence of residual units, ">

Representing the input of the residual unit,

Is some column weights and variations associated with the residual unit,/is>

Is the number of network layers of the residual unit, +.>

Representing the activation function, a ReLU is typically used.

The method aims at the task that a machine can only see the eye photos of a patient, but the range of the interesting region of the picture is corrected and fused through the relevant electronic case data of the patient, so that the model comprehensively learns the picture and the relevant electronic case characteristics of the patient. Therefore, when the feature extraction network model is trained, the calculation formulas of the forward learning attentiveness of the feature coarse extraction network model and the feature fine extraction network model are as follows:

wherein ,

Representing the transpose.

Because the network is obtained by data of two different modes through different coding networks, QKV matrixes are obtained by adding different full-connection layers to the data, the learning process is also from updating the full-connection layers, the image data is recorded as G, the text data is recorded as H, and the linear layer parameters of the three matrixes are

、/>

、/>

，

The actual propagation formula is:

where G represents image data, H represents text data,

representing the dimensions of the matrix >

、/>

、/>

Representing the linear layer parameters of the three matrices QKV, T representing the transpose.

The multi-head attention mechanism employed in this embodiment is to dimension data

Disassembling into (/ -)>

H) QKV, where h represents the number of heads of the multi-head attention mechanism, and not just one, each QKV can obtain a portion of the dimensional characteristics of the data over which to learn the attention mechanism, and finally stitch all heads together. The structural design can ensure that each attention mechanism is used for optimizing different characteristic parts of each data, so that the possible deviation generated by the same attention mechanism is balanced, the data has more expressions, and the model effect is obviously improved.

Step S4, strabismus real-time classification

Example 2

The embodiment provides a classification system of an eye image strabismus type, which specifically includes:

and the sample data acquisition module is used for acquiring sample data, wherein the sample data comprises eye image sample data and text sample data.

The eye image sample data is an eye image, and the text sample data comprises gender, disease age and disease age of the patient. Because the data value ranges of different features are greatly different, instability and oscillation phenomenon can occur in the gradient updating process, so that the convergence speed and the accuracy of the model are reduced, and therefore, the text sample data are required to be normalized. By normalizing the text sample data to the same scale and range, the correlation between features can be reduced, thereby improving the discriminativity and interpretability of the features. Therefore, the embodiment integrally converts the text sample data into [0,1], so that the comparison between different features is more reasonable and accurate. Specifically, the sexes of the male and female were changed to 1 and 0, respectively, and the disease years and the disease ages were subtracted by the minimum value in the sample and divided by the difference.

The feature extraction network model construction module is used for constructing a feature extraction network model, wherein the feature extraction network model comprises a feature pre-extraction network model, a feature coarse extraction network model, a feature fine extraction network model and a classification network, and the feature pre-extraction network model comprises a ResNet50V2 image extraction model and a text extraction model.

For two modes with relatively large differences, namely image data and text data, the embodiment adopts late fusion, namely, feature extraction is respectively carried out on the two modes, and feature fusion based on a joint multi-head attention mechanism is carried out on the feature vectors of the two modes after the feature vectors are obtained. For this purpose, the feature coarse extraction network model and the feature fine extraction network model are feature extraction networks of a joint attention mechanism, and the networks are different from the self-attention structure of the conventional transducer, and take image information as K and V in the attention mechanism and text information as Q. Whereas the meaning of the QKV matrix is as follows:

Q represents all information of the whole text, which is the content to be learned

K is keyword information, which refers to prompt information of the model before the content is not seen, and is also called keyword information

V is learning content, meaning that the model represents information according to keywords. That is, V is typically initialized to be the same as Q.

The characteristic coarse extraction network model and the characteristic fine extraction network model have the characteristics that as K and V are obtained from images, the characteristic dimension is the dimension M at the pixel level, the characteristic dimension is relatively large, the text characteristic N is obtained from text information, the dimension is far smaller than M, and when the two QK matrixes are used as characteristics to be fused, each dimension of Q is mapped to the K and corresponds to a region part, namely the part focused by the attention mechanism. After the first attention mechanism layer, the vector scale sent to the subsequent network is N x d _v The subsequent calculation complexity is greatly reduced, and the specific time complexity is expressed as:

wherein ,

Dimension of text information matrix,/->

The dimension of the image area to which one text message is mapped.

General strabismus classification tasks can be classified into normal strabismus and vertical strabismus. The major classes of horizontal strabismus and vertical strabismus can be divided into a plurality of sub-classes, and the final purpose of the embodiment is to subdivide the sub-classes. Because part of the horizontal strabismus subtype is highly similar to the vertical strabismus subtype in characteristics, if classification tasks are directly performed, the classification accuracy of similar subclasses is not high. Since the learning ability of the neural network is that the smaller the number of classifications, the higher the learning accuracy. If the major class is classified three times, and then the minor class is subdivided under the guidance of the major class classification, the overall classification accuracy is relatively high, so the classification network of the embodiment adopts multi-level classification based on one full-connection layer to divide the whole classification task into two parts, namely parent class classification and sub-class classification. And the two classification models are all classification tasks by adding a linear layer after the characteristic extraction network. The parent class directs the progress of the child class, but the training process of both are independent at model training, have different loss functions and iterate separately. Wherein for gradient updating of parameters, parameter updating after a coarse feature extraction network in a classification network is only determined by loss back propagation of sub-classifications, while a network before the coarse feature extraction network in the classification network is simultaneously updated jointly by losses of both classifications. The method comprises the following steps:

wherein ,

representing a sequence of residual units, ">

Representing the input of the residual unit,

is some column weights and variations associated with the residual unit,/is>

Is the number of network layers of the residual unit, +.>

Representing the activation function, a ReLU is typically used.

wherein ,

Representing the transpose.

Because the network is obtained by data of two different modes through different coding networks, the QKV matrix is obtained by adding different full-connection layers to the data, and the learning process is also from updating the full-connection layers, the image data is recorded as G, and the text numberAccording to H, the linear layer parameters of the three matrixes are

、/>

、/>

，

The actual propagation formula is:

where G represents image data, H represents text data,

Representing the dimensions of the matrix>

、/>

、/>

Disassembling into (/ -)>

H) QKV, where h represents the number of heads of the multi-head attention mechanism, and not just one, each QKV can obtain a portion of the dimensional characteristics of the data over which to learn the attention mechanism, and finally stitch all heads together. The structural design can lead each attention mechanism to optimize different characteristic parts of each data, thereby balancing the possible deviation generated by the same attention mechanism and leading the data to have more elementsThe expression of (3) obviously improves the model effect.

Example 3

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of a method of classifying an eye image strabismus type.

The computer equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or D interface display memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like. Of course, the memory may also include both internal storage units of the computer device and external storage devices. In this embodiment, the memory is often used to store an operating system and various application software installed on the computer device, for example, program codes of the classifying method of the strabismus type of the eye image. In addition, the memory may be used to temporarily store various types of data that have been output or are to be output.

The processor may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute a program code stored in the memory or process data, for example, a program code for executing a classification method of the strabismus type of the eye image.

Example 4

A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of a method of classifying an eye image strabismus type.

Wherein the computer-readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the method for classifying an eye image strabismus type as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method for classifying an eye image strabismus type according to the embodiments of the present application.

Claims

1. The classifying method of the strabismus type of the eye image is characterized by comprising the following steps of:

step S1, obtaining sample data

s2, constructing a feature extraction network model

step S3, training a feature extraction network model

Step S4, strabismus real-time classification

2. A method of classifying an eye image strabismus type according to claim 1, wherein: in step S2, the res net50V2 image extraction model includes a zero-padding layer, a two-dimensional convolution layer, a zero-padding layer, a maximum pooling layer, a residual error module, a batch normalization layer, a linear rectification unit, an average pooling layer, and a full connection layer, which are sequentially connected.

3. A method of classifying an eye image strabismus type according to claim 2, wherein: the residual error module comprises a plurality of residual error blocks which are connected in sequence, wherein the last residual error block comprises two basic blocks, and the rest residual error blocks comprise three basic blocks.

4. A method of classifying an ocular image strabismus type as in claim 3, wherein: each basic block comprises a first batch of normalization layers and a first linear rectification unit which are sequentially connected; the output of the first linear rectifying unit is divided into two parallel paths, one path sequentially passes through the first two-dimensional convolution layer, the second batch normalization layer, the second linear rectifying unit, zero filling, the second two-dimensional convolution layer, the third batch normalization layer and the third linear rectifying unit to be input into the third two-dimensional convolution layer, the other path is input into the fourth two-dimensional convolution layer, and the inputs of the third two-dimensional convolution layer and the fourth two-dimensional convolution layer are fused to be used as the output of the whole basic block.

5. A method of classifying an eye image strabismus type according to claim 1, wherein: in step S3, when the feature extraction network model is trained, the residual block of the feature pre-extraction network model is activated in a nonlinear manner by means of a relu function, and the specific formula is as follows:

wherein ,

representing a sequence of residual units, ">

Representing the input of the residual unit,/->

Is some column weights and variations associated with the residual unit,/is>

Is the number of network layers of the residual unit, +.>

Representing an activation function.

6. A method of classifying an eye image strabismus type according to claim 1, wherein: in step S2, the characteristic coarse extraction network model and the characteristic fine extraction network model are both characteristic extraction networks of a joint attention mechanism, and comprise a plurality of self-attention network blocks which are sequentially arranged;

7. The method of classifying an eye image strabismus type of claim 6, wherein: in step S3, when the feature extraction network model is trained, the calculation formulas of the forward learning attentiveness of the feature coarse extraction network model and the feature fine extraction network model are as follows:

wherein ,

Representing the transpose.

8. A classification system of the strabismus type for an ocular image, comprising:

9. A computer device, characterized by: comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized by: a computer program is stored which, when executed by a processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.