CN116682519A

CN116682519A - Clinical experiment data unit analysis method

Info

Publication number: CN116682519A
Application number: CN202310971463.1A
Authority: CN
Inventors: 刘运
Original assignee: Guangdong Jiena Pharmaceutical Technology Co ltd
Current assignee: Guangdong Jiena Pharmaceutical Technology Co ltd
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-09-01
Anticipated expiration: 2043-08-03
Also published as: CN116682519B

Abstract

The invention discloses a clinical experiment data unit analysis method, which provides a set of unified and standardized clinical laboratory examination basic unit system and standardized unit expressions for uniformly describing detection result units of different test centers. According to the invention, the original units are disassembled into a series of simpler element discrimination problems by separating the units dimension, the units word head and the embedded numbers in the units, and the LLM model is introduced in the unit discrimination process, so that the prompt of the LLM model is properly designed, and the unit discrimination is more accurate and rapid.

Description

Clinical experiment data unit analysis method

Technical Field

The invention relates to the field of clinical experimental data analysis, in particular to a unit analysis method of clinical experimental data.

Background

Drug clinical trials produce large amounts of trial data, most of which are entered into databases manually by clinical operators. Clinical trial data is extremely accurate, and therefore, data is repeatedly checked and corrected. One type of problem that is prevalent in clinical trial data entry is the problem of irregular and incorrect data unit entry.

Laboratory tests typically include blood routine, blood biochemical, urine routine, and ten specific test categories, each of which contains several index test items, such as red blood cells, white blood cells, and the like. Because of different hospital detection devices and detection methods, the same examination item usually has differences in the examination units and the upper and lower reference limits given by different hospitals. Furthermore, some particular units may generate ambiguity due to lack of standardization; for example, if G/L is used as a unit, there are two explanations theoretically, one is gram/liter and the other is 10≡9/liter. As another example, some hospitals give a count of 10E9, which represents 10 according to scientific counting10≡9, i.e., 10≡10, but in practice this unit represents 10≡9, i.e., 1Giga; in addition, some hospitals have a 10E11 count unit, which represents 1E12, which is obtained by a scientific count method.

In the multi-center clinical test, since the statistical characteristics of the data can be calculated after unifying all units checked in the same type, it is necessary to perform normalization processing on the units first, convert all units into unified standard units, and convert the detection result into a standard result according to the conversion coefficient. This requirement for data statistics places a high demand on the data cleaning business at the data collection stage. Therefore, the original unit is accurately analyzed, and whether the unit is correct or not is judged by combining the check data, which is a problem to be solved in the data cleaning stage. Raw unit analysis is the first step to solve this problem. Erroneous units can cause little trouble in clinical trial data analysis and therefore it is desirable to find and resolve as soon as possible during the data collection phase.

The original unit analyzes the problem, and inputs the detection values including the detection category, the detection item, the original unit and the upper and lower reference limits; they are all stored in text form, where the reference upper and lower limits and the detection value can be numbersThe character can be a negative character, a positive character and the like; the examination category and the examination item are described in natural language, and the same examination item may have different names in different hospitals. The output is then the original unit described in a standardized way. The term "original unit described in a standardized manner" means a representation obtained by converting an original unit into a standard dimension. These standard dimensions may be referenced to the physical basic dimensions, but are modified and supplemented accordingly for common medical data. For example, unit mg/dL (milligrams/deciliter), converted to standardized mode description, is 0.1L ^-1 />g ¹ 。

Currently, the method mainly relies on experienced data management staff to confirm whether the units are correct, which is time-consuming and easy to leak. In addition to manual discrimination, technical solutions that may exist include: 1. classifying and archiving the original units which are seen to build an original unit library, and storing the original units and the conversion coefficients of the original units and the standard units in the library; 2. extracting each part in the original unit by a traditional NLP (Natural Language Processing ) method (such as a regular expression), and then looking up a table to give a standard representation of the original unit; 3. these units were directly analyzed and judged by LLM (Large Language Model ).

In these schemes, the manual discrimination accuracy is the highest, but the manual discrimination is too dependent on experience; the technical scheme 1 has two defects, namely that firstly, the inspection category and the inspection item are not standardized, and each inspection item can be classified and archived after being standardized manually, secondly, new units are endlessly layered, and the units which are not recorded in a library are frequently encountered, so that the units still need to be manually checked each time; the problem of the technical scheme 2 is that the words cannot be properly segmented, and the unit category cannot be determined according to the context. The analysis effect of the technical scheme 3 is superior to that of the former two schemes, but because the detection unit belongs to the technical term, the information density is higher, the LLM analysis still has errors, and the current LLM model has two problems, firstly, LLM can have no basis for constructing answers, so that whether the output result is reliable or not can not be distinguished only by the machine itself, and manual full-quantity inspection is still needed; second, LLM is a black box, and when LLM is found to be unable to output correct results for a certain class of problems, its behavior cannot be modified by simple operations.

Disclosure of Invention

The invention provides a clinical laboratory data unit analysis method which can accurately analyze most laboratory examination units.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a method for unit analysis of clinical laboratory data, comprising the steps of:

s1: acquiring an original data form to be processed, wherein the original data form comprises clinical experiment data;

s2: acquiring an original unit, an inspection classification standardized code and an inspection item standard code according to the original data form to be processed;

s3: classifying clinical experiment data according to the examination classification standardized codes, the examination item standard codes and the original units, and collecting typical values;

s4: inputting the inspection classification standardized codes, the inspection item standard codes, the original units and the typical values into a pre-trained model, and outputting word segmentation results of the original units by the model;

s5: judging the basic unit category or the word head category to which the word segmentation result belongs one by one and the attached number;

s6: combining the word head and the attached number into coefficients, and converting the basic unit category into a standard basic unit according to a prefabricated conversion coefficient table;

s7: the same items in the merging units are detected, forming a combination of basic units.

Preferably, step S2 is specifically:

extracting all original units and inspection classifications according to the original data form to be processed, and querying a standard coding database to obtain an inspection classification standardized code;

and according to the inspection classification, obtaining all inspection items in the inspection classification, and querying a standard code database to obtain standardized codes of the inspection items.

Preferably, in step S2, when the inspection classification and the inspection item are not in the standard code database, the inspection classification and the inspection item are respectively sent to the LLM model, and the inspection classification standardized code and the inspection item standard code are respectively obtained in cooperation with appropriate prompt information, where the appropriate prompt information includes an inspection classification name, an inspection item name and a corresponding standardized code sorted according to historical data.

Preferably, in step S3, the clinical test data are classified, specifically:

clinical trial data with identical examination categorization normalization codes, examination item criteria codes and original units are grouped into a set.

Preferably, typical values are collected in step S3, in particular:

if the lower limit of the reference value, the upper limit of the reference value and the detection value of the clinical test data are of a numerical value type, the lower limit of the reference value, the upper limit of the reference value, the value closest to the mean value in the detection value and the measured value near the value of the detection mean value plus or minus 3 times of the standard deviation are used as a typical value set;

if the lower limit of the reference value, the upper limit of the reference value and the detection value of the clinical test data are character-type, the first two values and the last value of the lower limit of the reference value, the upper limit of the reference value and the detection value are taken according to the occurrence frequency and are combined to be a typical value set.

Preferably, in step S5, the basic unit category or the prefix category to which the word segmentation result belongs is determined by combining word stock search with LLM query.

Preferably, the word library search is combined with LLM query, specifically:

each checking item has two independent word banks, namely a molecular word bank and a denominator word bank, wherein the molecular word bank and the denominator word bank are accumulated by historical data, and the molecular word bank and the denominator word bank are considered to be in the same checking item, and the same symbol appears in the molecule or denominator; when judging the basic unit category or the word head category to which the word segmentation result belongs, determining whether to search from a molecular word stock or a denominator word stock according to the position of the original unit;

if a word cannot be found in the word stock, LLM is used to determine the units.

Preferably, the standard base unit in step S6 is written as:

wherein each exponent is a positive integer, a negative integer, or 0.

Preferably, in step S7, the same items in the merging units are further arranged in a standard order, where the standard order is as follows: ratio, mass, volume, length, time, count, quantity of substance, biomass, and individual units.

Preferably, when the standard basic units are arranged according to the standard sequence, the standard basic units are arranged according to the letter sequence if the standard basic units belong to the same standard sequence.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention can rapidly and accurately analyze laboratory test units in clinical tests, and the whole process only needs less manual participation, so that a dependable analysis result can be provided for subsequent data cleaning and data statistics. The LLM model is introduced to improve accuracy, and meanwhile, each part of the original unit is split to be distinguished respectively, so that the method has strong interpretability, the problem of a black box special for the LLM model is avoided, the algorithm can obtain a better effect through continuous improvement, and the method has very good application prospect and commercial value.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a probability diagram of each word segmentation mode obtained after the reverse-order N-gram word segmentation model provided by the embodiment of the invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a unit analysis method of clinical experimental data, as shown in fig. 1, comprising the following steps:

The embodiment of the invention provides a unified and standardized clinical laboratory examination basic unit system and a standardized unit expression, which are used for uniformly describing detection result units of different test centers, and provides a series of standard analysis steps which can accurately convert a disordered original unit into a basic unit combination, and the original unit is disassembled into a series of simpler element discrimination problems by separating elements such as unit dimension, unit word head, embedded number in the unit and the like, and an LLM model is introduced in the unit discrimination process, so that the prompt of the LLM model is properly designed, and the unit discrimination is more accurate and quick.

Example 2

The present embodiment continues to disclose the following on the basis of embodiment 1:

the step S2 specifically comprises the following steps:

The raw table data includes a check category name, a check item name, a raw unit, and raw check data, and in one particular embodiment, the raw table data is as shown in table 1.

TABLE 1

In step S2, when the inspection classification and the inspection item are not in the standard code database, the inspection classification and the inspection item are respectively sent to the LLM model, and the inspection classification standardized code and the inspection item standard code are respectively obtained in cooperation with appropriate prompt information, where the appropriate prompt information includes the inspection classification name, the inspection item name and the corresponding standardized code sorted according to the historical data. The following is an example of a hint information when sending the inspection classification into the LLM model:

Youareaspecialistinclinicaldataprocessing. The laboratory test in different center has different names for their category and test items. For example, in some center, Hematology is called Complete Blood Count, but the test items are the same with Hematology. Now given some category clusters in json format, and one category name, you are going to tell in which cluster it belongs to.

The given categories are:

“””

{ "LBH" [ "blood routine", "Hematology", "therapeutics", "Complete Blood Count", "Clinical Laboratory Tests-therapeutics", "Hematology (UNS)" ],

"LBC" [ "blood Biochemical", "laboratory test-blood Biochemical", "local laboratory-blood Biochemical", "Serum Chemistry", "clinical Biochemical", "Hematology Dose Expansion", … ],

…

}

“””

Now the given category name is “Serum Chemistry- C-reactive protein”, which category should it be in? Please first give the category name (such as LBH, LBC, etc) in a separated paragraph, then give the reason。

the prompt information of sending the inspection items into the LLM model is similar, and the corresponding relation of the inspection items and standardized inspection item codes is exemplified:

RBC：

red Blood Cell Count Red Blood Cells, red cell count, total number of Red Blood Cells, and Red Blood Cells.

In a specific embodiment, the classification normalization code, the item normalization code, are checked, as shown in Table 2:

TABLE 2

In step S3, the clinical experimental data are classified, specifically:

In step S3, typical values are collected, specifically:

if the lower limit of the reference value, the upper limit of the reference value and the detection value of the clinical test data are character type, such as negative/positive, -/+/++, the lower limit of the reference value, the upper limit of the reference value and the value of the detection value are taken as the first two and the last one according to the occurrence frequency, and are combined to be used as a typical value set.

In a specific embodiment, the classified data is as follows:

in step S4, the inspection classification standardized code, the inspection item standard code, the original unit and the typical value are input into a pre-trained model, and the model outputs a word segmentation result of the original unit, wherein the model may be an algorithm capable of segmenting the original unit into dimensions, a word head and numbers in the unit, and a word segmentation process is described below by taking a specific model as an example.

When the input of the model is 10-6 cells/cu mm, the N-gram word segmentation model is used for word segmentation to obtain a word segmentation probability diagram shown in fig. 2, after the calculation of fig. 2, additional information such as classification codes, check item codes, typical values, molecules/denominators and the like can be used for correcting probability information in the probability diagram, and finally an N-shortest path word segmentation algorithm is used for obtaining a final word segmentation result, wherein the word segmentation result finally obtained by the input is as follows:

10≡6 (digital word head) cells (molecular units)

cu (cubic modifier) m (prefix) m (denominator unit)

The reverse order N-gram is an algorithm that calculates the probability that each character belongs to a certain specific combination gradually from the last character of the whole data. N-gram is a common NLP word segmentation algorithm. The reason for the reverse order is that the usual natural language reading order is from front to back, and the division of the following words is affected by the preceding words; the unit recognition problem is that the unit body appears at the end from the back to the front, and parts such as a word head, a decoration and the like appear in front of the unit body. The reverse order N-gram model may be trained from historical clinical trial data.

In a specific embodiment, the original unit word segmentation results are as follows:

in step S5, the basic unit category or the word head category to which the word segmentation result belongs is judged by combining word stock search with LLM inquiry.

In a specific embodiment, the results of the original unit type recognition, the word head recognition, and the coefficient recognition are as follows:

the word library search is combined with LLM inquiry, specifically:

each checking item has two independent word libraries, namely a molecular word library and a denominator word library, wherein the molecular word library and the denominator word library are accumulated by historical data, and the distinguishing molecular word library and the denominator word library are considered to be in the same checking item, and the same symbol appears in a molecule or a denominator, so that the representing meanings of the symbol are possibly different, and the appearing position of the symbol must be considered during distinguishing; when judging the basic unit category or the word head category to which the word segmentation result belongs, determining whether to search from a molecular word stock or a denominator word stock according to the position of the original unit;

if a word cannot be found in the word stock, LLM is used to determine units, and ChatGPT is taken as an example, a feasible method for designing the prompt word is as follows:

you are a specialist in clinical data processing You are recognizing the recorded unit of { checklist name } in { checklist category name } form, the unit is:

"{ units to be analyzed }"

what does you "{ unknown word head+unknown unit }" mean in the unit?

Does it belong to any of the following known unit categories?

A. mass unit, like gram

B, volumn, like litre

C, length, like meter

D, time, like second

E, count

F, other units.

Please first reply the option letter in a separated paragrah, then explain in detail in following paragraphs.

An example of reply is:

"""

A

{ unknown word } is a mass unit.

This is the reason: ...

"""

Now please give your answer.

After LLM replies, continue to challenge according to its options:

what is the transform coefficient between "unknown prefix+unknown unit" and { basic unit in class of LLM reply }?

And extracting the coefficients in the reply as the overall conversion coefficients of the unknown word head and the unknown unit. The units and coefficients extracted by using LLM can be recorded and put in storage only by manual confirmation.

In step S6, standard basic units are written as:

wherein each exponent is a positive integer, a negative integer, or 0.

In a specific embodiment, the conversion of the original units to standard units is as follows:

example 3

This example continued to disclose the following on the basis of examples 1 and 2:

in step S7And the same term in units, for example, if the original unit is a volume ratio: L/L, after step S6, is as follows: 1L ¹ />L ^-1 At this time, L is two standard units, which need to be combined to 1 +.>L ⁰ . Note that this unit of L cannot be removed directly at this time, but it is necessary to preserve it to the power of 0, as a whole as a ratio unit. The purpose of retaining L is that the original units of the ratio are also related to the unit conversion factor. For example, in the case of a volume ratio of alcohol to water of 1:1, the mass ratio is about 0.8:1, so the ratio of alcohol to water, which is the same, still satisfies the conversion relationship: 0.8/>L ⁰ = 1/>g ⁰ 。

For comparison, after merging the same items in the units in step S7, the standard basic units are further arranged according to a standard sequence, wherein the order of the standard sequence is as follows: ratio, mass, volume, length, time, count, amount of material, biomass, and individual units, as shown in table 2.

TABLE 2

The above basic units are combined together to form the vast majority of laboratory test units.

In addition to the basic units described above, each standard unit may be preceded by a prefix, e.g., m, d, k, μ, etc., for combining into derived units that are power of 10. However, in the data actually entered in hospitals, these parts of speech do not comply with standard part of speech rules specified in the international organization system, and even cases are often mixed. For example, "micro" is the simultaneous presence of the acronyms μ, u, mic, micro, mc, etc., which are combined directly with the basic unit, the unit of which often requires an experienced person to understand.

Examples: the following are the detection units of a certain enzyme that appears in a certain test:

Other: IU/L

Other: MKAT/L

Unite Internationale/litre

millimole/litre

some of them contain irrelevant text (Other:), some use abbreviations, and some use full names, in MKAT, M is capitalized, but does not represent the word head "mega" but represents the word head "milli". These units appear to be correctly identified to experienced staff, but are not easily resolved by machine analysis.

When the standard basic units are arranged according to the standard sequence, if the standard basic units belong to the same standard sequence, the standard basic units are arranged according to the letter sequence.

In one embodiment, after co-unit combining and order adjustment, the following is followed:

the same or similar reference numerals correspond to the same or similar components;

the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A method for analyzing clinical laboratory data units, comprising the steps of:

2. The method for analyzing clinical laboratory data units according to claim 1, wherein step S2 is specifically:

3. The method according to claim 2, wherein in step S2, when the inspection classification and inspection item are not in the standard code database, the inspection classification and inspection item are respectively sent to LLM model, and inspection classification standardized code and inspection item standard code are respectively obtained in combination with appropriate prompt information, wherein the appropriate prompt information includes inspection classification name, inspection item name and corresponding standardized code sorted according to historical data.

4. The method for analyzing clinical laboratory data units according to claim 1, wherein the step S3 classifies the clinical laboratory data, specifically:

5. The method according to claim 4, wherein the step S3 of collecting typical values is:

6. The method according to claim 1, wherein in step S5, the basic unit category or the heading category to which the word segmentation result belongs is determined by word stock search in combination with LLM query.

7. The method for analyzing clinical laboratory data units according to claim 6, wherein the word stock search is combined with LLM query, specifically:

8. The method according to claim 7, wherein the standard basic unit is written in step S6 as:

wherein each exponent is a positive integer, a negative integer, or 0.

9. The method according to claim 1, wherein the same items in the merging units in step S7 are further arranged in a standard order, the order of the standard order is: ratio, mass, volume, length, time, count, quantity of substance, biomass, and individual units.

10. The method according to claim 9, wherein the standard basic units are arranged in a standard order, and are arranged in a alphabetical order if they belong to the same standard order.