CN111724374A

CN111724374A - Evaluation method of analysis result and terminal

Info

Publication number: CN111724374A
Application number: CN202010572829.4A
Authority: CN
Inventors: 林晨; 喻碧莺
Original assignee: Ke Junlong
Current assignee: Lin Chen; Wisdom Medical Shenzhen Co ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-09-29
Anticipated expiration: 2040-06-22
Also published as: CN111724374B

Abstract

The invention discloses an evaluation method and a terminal of an analysis result, which are used for acquiring files with a preset number as a test set; acquiring a first marking result of the first device on the test set and a second marking result of the AI model on the test set; obtaining a gold standard, and carrying out t test on a first difference value between the first marking result and the gold standard and a second difference value between the second marking result and the gold standard to obtain a first test result; judging whether the first detection result is larger than a threshold value or not, and if so, considering that the second marking result has accuracy; the method comprises the steps of obtaining a first mark and a second mark of a same test set by a first device and an AI model, obtaining a gold standard of the test set, calculating a difference value of the first mark and the gold standard and a difference value of the second gold standard, and carrying out t-test on the difference values to obtain the accuracy of the AI model compared with the first device, thereby realizing accuracy evaluation.

Description

Evaluation method of analysis result and terminal

Technical Field

The invention relates to the field of statistical methods, in particular to an evaluation method and a terminal for an analysis result.

Background

The existing method for evaluating the accuracy of the AI analysis result mainly calculates the difference between the AI analysis result and the gold standard to perform difference calculation, but the judgment method cannot see the distribution trend in space, for example, in some scenes, the AI measurement result tends to show difference in the horizontal direction, and the difference in the vertical position is very small, the distribution in space has important prompting significance for further improving the method, but the existing method cannot be embodied, so that the accuracy evaluation compared with a manual method needs to be designed.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the evaluation method and the terminal for the analysis result are provided, and the AI analysis result can be accurately evaluated.

In order to solve the technical problems, the invention adopts a technical scheme that:

a method of evaluating an analysis result, comprising the steps of:

s1, acquiring a preset number of files as a test set;

s2, acquiring a first marking result of the first device to the test set and a second marking result of the AI model to the test set;

s3, obtaining a gold standard, and carrying out t test on a first difference value between the first marking result and the gold standard and a second difference value between the second marking result and the gold standard to obtain a first test result;

and S4, judging whether the first detection result is larger than a threshold value, and if so, determining that the second marking result has accuracy.

In order to solve the technical problem, the invention adopts another technical scheme as follows:

an evaluation terminal for analyzing results, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

s1, acquiring a preset number of files as a test set;

The invention has the beneficial effects that: the method comprises the steps of marking the same test set by setting a first device and an AI model together, comparing a marking result with a gold standard to calculate a difference value, carrying out t-test on the difference value, calculating the difference value between different marking results and the gold standard, visually obtaining a comparison result of the difference value between different marking modes and the gold standard, carrying out t-test on the difference value to quantify the accuracy standard, directly judging whether the marking result of the AI model has accuracy according to the result of the t-test, and realizing the accuracy evaluation of an AI analysis result.

Drawings

FIG. 1 is a flowchart illustrating the steps of a method for evaluating an analysis result according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an evaluation terminal for analyzing a result according to an embodiment of the present invention;

FIG. 3 is a scatter plot of an embodiment of the present invention;

description of reference numerals:

1. an evaluation terminal for analysis results; 2. a processor; 3. a memory;

Detailed Description

In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.

Referring to fig. 1, an evaluation method of an analysis result includes the steps of:

s1, acquiring a preset number of files as a test set;

From the above description, the beneficial effects of the present invention are: the method comprises the steps of marking the same test set by setting a first device and an AI model together, comparing a marking result with a gold standard to calculate a difference value, carrying out t-test on the difference value, calculating the difference value between different marking results and the gold standard, visually obtaining a comparison result of the difference value between different marking modes and the gold standard, carrying out t-test on the difference value to quantify the accuracy standard, directly judging whether the marking result of the AI model has accuracy according to the result of the t-test, and realizing the accuracy evaluation of an AI analysis result.

Further, the method for calculating the first difference and the second difference in step S3 is as follows:

acquiring the coordinate of the first marking result, the coordinate of the second marking result and the coordinate of the gold standard;

and calculating a first difference value between the coordinate of the first marking result and the coordinate of the gold standard and a second difference value between the coordinate of the second marking result and the coordinate of the gold standard by utilizing a trigonometric function.

From the above description, the difference relationship among the first marking result, the second marking result and the gold standard is obtained through the coordinates of the first marking result, the second marking result and the gold standard, and the difference relationship among the first marking result, the second marking result and the gold standard is quantized, so that the difference relationship between the first marking result and the gold standard and the difference relationship between the second marking result and the gold standard can be compared conveniently.

Further, the step S3 further includes:

generating a scatter diagram according to the first difference value and the second difference value by taking the gold standard as a circle center and taking the difference value as a radius;

or generating a scatter diagram according to a third difference value between the coordinate of the first marking result and the coordinate of the second marking result by taking the first marking result as a circle center;

or generating a scatter diagram according to the third difference value by taking the second marking result as a circle center.

According to the description, the scatter diagram is generated according to the first difference value, the second difference value and the gold standard, so that the difference values between the first mark and the gold standard and between the second mark and the gold standard can be visually embodied, and the accuracy of the result of the first mark and the result of the second mark can be visually obtained; in addition, considering the situation that the gold standard cannot be acquired, the first mark or the second mark can be used as a circle center, a scatter diagram is generated by using a third difference value between the first mark and the second mark, and the difference value between the first mark and the second mark can be visually acquired, so that the accuracy of judging the second mark is more convenient.

Further, the step S2 further includes:

acquiring a third marking result of the second device on the test set and a fourth marking result of the first device on the test set, wherein the fourth marking result is generated at a time different from that of the first marking result;

the step S4 is followed by:

calculating a first intra-group correlation coefficient between the first marked result and the fourth marked result, a second intra-group correlation coefficient between the first marked result and the third marked result, and a third intra-group correlation coefficient between the first marked result and the second marked result;

carrying out t test on the first group internal correlation coefficient and the third group internal correlation coefficient to obtain a second test result, and carrying out t test on the second group internal correlation coefficient and the third group internal correlation coefficient to obtain a third test result;

and judging whether the second inspection result and the third inspection result are both larger than a threshold value, and if so, determining that the second marking result has repeatability.

As can be seen from the above description, the second device is added to obtain the third labeling result of the second device and obtain the fourth labeling result of the first device, and the generation time of the fourth labeling result is different from that of the first labeling result, so that the comparison group is added, the evaluation result is more reliable, the intra-group correlation coefficient between the comparison group and the comparison group is subjected to t-test, and the repeatability evaluation is further performed on the labeling result of the second labeling, i.e., AI, so that the evaluation dimension is more complete, and the evaluation result is more reliable.

Further, the calculating the first, second and third intra-group correlation coefficients specifically includes:

and calculating the first group internal correlation coefficient, the second group internal correlation coefficient and the third group internal correlation coefficient by using a self-help method to respectively obtain a plurality of the first group internal correlation coefficients, the second group internal correlation coefficients and the third group internal correlation coefficients.

As can be seen from the above description, the intra-group correlation coefficient is calculated by the self-service method, and a large number of intra-group correlation coefficients can be obtained for one group of samples, so that the intra-group correlation coefficients of different groups can be subjected to t-test subsequently to obtain a comparison result, and finally, repeatability evaluation is achieved.

Referring to fig. 2, an evaluation terminal for analyzing a result includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer program:

s1, acquiring a preset number of files as a test set;

Further, when the processor performs the calculation of the first difference and the second difference in step S3:

Further, the step S3 further includes:

Further, the step S2 further includes:

the step S4 is followed by:

Referring to fig. 1, a first embodiment of the present invention is:

an evaluation method of an analysis result specifically comprises the following steps:

s1, acquiring a preset number of files as a test set;

in an alternative embodiment, the document is an image;

a first marker marks the test set through the first equipment to obtain a first marking result;

the gold standard is the correct position of the mark, by taking the gold mark of the macular fovea of the fundus super-wide-angle image as an example, two examinations of OCTA (Optical Coherence Tomography imaging) and super-wide-angle fundus photography can be carried out on the same patient, according to the position of the macular fovea determined in the OCTA tomographic image and the relative position relation between the macular fovea and retinal blood vessels in the OCTA tomographic image, because the OCTA is the same as the blood vessels shot by the super-wide-angle fundus photography, a linear regression equation is established according to the relative position between the macular fovea and the retinal blood vessels in the OCTA tomographic image, the accurate position of the macular fovea can be obtained according to the position of the retinal blood vessels on the obtained fundus super-wide-angle image, and the gold standard of the macular fovea is obtained;

calculating the corresponding gold standard of each file in the test set;

on this basis, the calculation method for obtaining the first difference and the second difference in step S3 is as follows:

acquiring the coordinate of the first marking result, the coordinate of the second marking result and the coordinate of the gold standard; specifically, the marked files in the test set can be placed in the same coordinate system in the same manner, and the first marking result, the second marking result and the coordinates of the gold standard are obtained;

in an optional implementation manner, the sizes of the pictures in the test set are the same, and if the sizes of the pictures are 3000 × 4000 (pixels), the positions of the pixel points where the marking points are located can be directly used as the coordinates of the marking result;

calculating the first difference between the coordinates of the first marking result and the coordinates of the gold standard and the second difference between the coordinates of the second marking result and the coordinates of the gold standard by utilizing a trigonometric function according to the coordinates; if the coordinate of the mark for a file in the second marking result is [ X ]_AI,Y_AI]The coordinate of the mark for the same file in the first marking result is [ X ]_{Person 1}，Y_{Person 1}]If the difference is:

Sqrt[(X_AI-X_{person 1})²+(Y_AI-Y_{Person 1})²]

After the first difference value and the second difference value are acquired in step S3, the method further includes:

generating a scatter diagram according to the first difference value and the second difference value by taking the coordinate of the gold standard as a circle center and the difference value as a radius;

or generating a scatter diagram according to a third difference value between the coordinate of the first marking result and the coordinate of the second marking result by taking the coordinate of the first marking result as a circle center;

or generating a scatter diagram according to the third difference value by taking the coordinate of the second marking result as the center of a circle;

in an optional implementation manner, the direction of the coordinate of the first marking result relative to the coordinate of the gold standard and the direction of the coordinate of the second marking result relative to the coordinate of the gold standard are also obtained, and a scatter diagram is generated according to the difference value and the direction; or generating a scatter diagram directly according to the coordinates of the first marking result, the coordinates of the second marking result and the coordinates of the gold standard;

s4, judging whether the first detection result is larger than a threshold value or not, and if so, considering that the second marking result has accuracy;

wherein, the first inspection result is the value of the parameter P in the t-inspection, and the threshold value may be 0.05, that is, when the value of P is greater than 0.05, it is determined that there is no difference between the second marker and the gold standard and between the first marker and the gold standard, that is, the marking result of the AI and the manual marking result of the first marker have the same accuracy;

in an alternative embodiment, the gold standard may not be obtained, and a bland-alternaman diagram is used to describe the correspondence between the first marker and the second marker, i.e. to obtain the correspondence between the second marker result of the AI and the first marker result of the first marker;

and if the second marking result of the AI is better consistent with the first marking result of the first marker, the marking result of the AI model and the manual marking result of the first marker are considered to have the same accuracy.

The second embodiment of the invention is as follows:

an evaluation method of an analysis result, which is different from the first embodiment in that:

the step S2 further includes:

the second marker marks the test set through the second equipment to obtain a third marking result; a first marker marks the test set through the first equipment to obtain a fourth marking result, wherein the generation time of the fourth marking result is different from that of the first marking result, and if the first marker marks the test set to obtain the first marking result four days later, the first marker marks the test set again to obtain the fourth marking result;

the step S4 is followed by:

s5, calculating a first Intra-class Correlation (ICC) between the first marked result and the fourth marked result, a second Intra-class Correlation (hd) between the first marked result and the third marked result, and a third Intra-class Correlation (hd) between the first marked result and the second marked result;

calculating the consistency of the marks of the same test set by the first marker at different time, the consistency of the marks of the same test set by the first marker and the second marker, and the consistency of the marks of the same test set by the first marker and the AI model; that is, the consistency of the same person in the same test set at different times, different persons, people and AI;

in this embodiment, a self-service Method (Bootstrap Method) may be used to calculate the first intra-group correlation coefficient, the second intra-group correlation coefficient, and the third intra-group correlation coefficient, so as to obtain a plurality of first intra-group correlation coefficients, second intra-group correlation coefficients, and third intra-group correlation coefficients, respectively;

specifically, the first marking result and the fourth marking result are subjected to self-sampling, intra-group correlation coefficients are calculated according to each sampling result, and a plurality of first intra-group correlation coefficients are finally obtained; self-sampling is carried out on the first marking result and the third marking result, intra-group correlation coefficients are calculated according to sampling results of each time, and a plurality of second intra-group correlation coefficients are obtained finally; self-sampling is carried out on the first marking result and the second marking result, intra-group correlation coefficients are calculated according to sampling results of each time, and a plurality of third intra-group correlation coefficients are obtained finally;

in an optional embodiment, at least 50 first group internal correlation coefficients, second group internal correlation coefficients and third group internal correlation coefficients are respectively generated to ensure the accuracy of the t test;

judging whether the second inspection result and the third inspection result are both larger than a threshold value, if so, determining that the second marking result has repeatability;

performing a t-test on the first intra-group correlation coefficient and the third intra-group correlation coefficient, namely testing the difference between the repeatability of the marking result of the same person (a first marker) on the same test set at different times and the repeatability of the marking result of the person and the AI model on the same test set; performing a t-test on the second intra-group correlation coefficient and the third intra-group correlation coefficient, namely testing the difference between the repeatability of the marking result of one person (a first marker) and the repeatability of the marking result of another person (a second marker) on the same test set and the repeatability of the marking result of the person (the first marker) and the AI model on the same test set; if the threshold is 0.05, when the value of the result P of the t-test is greater than 0.05, the repeatability between the AI model and the marking result of the same test set by the person is considered to have no difference from the repeatability between the marking results of different persons on the same test set and the repeatability between the marking results of the same person on the same test set at different times, namely the repeatability of the AI model is the same as that of a manual method;

in an optional embodiment, the reproducibility of the AI model is evaluated by obtaining a fifth labeling result of the AI model on the test set, where the fifth labeling result is generated at a different time than the second labeling result; calculating a fourth group internal correlation coefficient between the second marking result and the fifth marking result by using a self-service method to obtain a plurality of fourth group internal correlation coefficients, and performing t test on the first group internal correlation coefficient and the fourth group internal correlation coefficient to obtain a fourth test result; and judging whether the fourth test result and the fourth intra-group correlation coefficient are both larger than a threshold value, if so, determining that the marking result of the AI model has reproducibility.

Referring to fig. 2, a third embodiment of the present invention is:

an evaluation terminal 1 for analyzing results comprises a processor 2, a memory 3 and a computer program stored on the memory 3 and capable of running on the processor 2, wherein the processor 2 implements the steps of the first embodiment or the second embodiment when executing the computer program.

In summary, the present invention provides an evaluation method and a terminal for analysis results, where a first marker marks a test set by a first device to obtain a first marking result and a fourth marking result, the marking times of the first marking result and the fourth marking result are different, an AI model marks the test set to obtain a second marking result, a second marker marks the test set by a second device to obtain a third marking result, a gold standard is obtained, a difference between the first marking result and the gold standard and a difference between the second marking result and the gold standard are calculated, the difference is represented by a scatter plot and a t-test is performed, a distribution trend of the marking results in space can be represented by the scatter plot, and a result of the t-test can determine whether the accuracy of the AI model marking and the accuracy of the human marking are consistent, if the result of the t test exceeds the threshold value, the marking result of the AI model and the manual marking result are considered to have consistent accuracy; in addition, the fourth marking result of the first marking person and the third marking result of the second marking person are also obtained, the repeatability of the marking result of the same person on the same test set between different times (the first marking result and the fourth editing machine result), different persons (the first marking result and the third marking result) and persons and AI (the first marking result and the second marking result) can be obtained, a large number of intra-group correlation coefficients can be obtained through a self-help method, so that t test can be carried out, the repeatability of the marking result of the same test set between the persons and AI can be obtained, the repeatability difference of the marking result of the same person on the same test set between different times and different persons can be obtained, the problems that the existing method adopts single values such as ICC and Cranbach coefficient to carry out repeatability evaluation and the analysis result cannot be systematically and comprehensively judged are solved, and the marking result of the AI model is evaluated systematically and comprehensively in repeatability, so that the AI analysis result is evaluated accurately.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. A method for evaluating an analysis result, comprising the steps of:

s1, acquiring a preset number of files as a test set;

2. The method of claim 1, wherein the first difference and the second difference are calculated in step S3 by:

3. The method for evaluating an analysis result according to claim 1, wherein the step S3 further comprises:

4. The method for evaluating an analysis result according to claim 1, wherein the step S2 further comprises:

the step S4 is followed by:

5. The method of claim 4, wherein the calculating the first, second and third intra-group correlation coefficients comprises:

6. An evaluation terminal for analyzing results, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, characterized in that the processor implements the following steps when executing the computer program:

s1, acquiring a preset number of files as a test set;

7. The terminal for evaluating an analysis result according to claim 6, wherein when the processor performs the calculation of the first difference and the second difference in the step S3:

8. The terminal for evaluating an analysis result according to claim 6, wherein the step S3 further comprises:

9. The terminal for evaluating an analysis result according to claim 6, wherein the step S2 further comprises:

the step S4 is followed by:

10. The terminal for evaluating an analysis result according to claim 9, wherein the calculating the first, second and third intra-group correlation coefficients specifically comprises: