CN114743681B

CN114743681B - Case grouping screening method and system based on natural language processing

Info

Publication number: CN114743681B
Application number: CN202111564591.1A
Authority: CN
Inventors: 杨�远; 刘昊; 曹润卿; 史俊才; 钟炎萤; 陈华达
Original assignee: Health Data Beijing Technology Co ltd
Current assignee: Health Data Beijing Technology Co ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2024-01-30
Anticipated expiration: 2041-12-20
Also published as: CN114743681A

Abstract

The invention discloses a case group entering screening method and a system based on natural language processing, wherein the method comprises the following steps: primary recognition is carried out on the original case data by adopting an NLP model to obtain a text label set; constructing a PLSA classification model, and performing association mapping on the original case data, the text label set and the type nodes; determining a group entering tag set of a group entering rule text by adopting an NLP model; matching the set of the input group labels with the set of the text labels, and determining a specific type of node corresponding to the set of the input group labels; and extracting the group-in case data which are mapped by the nodes of the specific type. In the embodiment of the invention, the text label set and the group entering label set are obtained by adopting natural language processing, further the association mapping between the text label set and each type of nodes based on probability distribution is obtained by adopting PLSA, and then the group entering label set and the specific type of nodes are matched to extract the required group entering case data, so that the group entering screening of the original case data comprising unstructured data is completed, the process does not need manual intervention, and the accuracy is high.

Description

Case grouping screening method and system based on natural language processing

Technical Field

The invention relates to data processing, in particular to a case grouping and screening method and system based on natural language processing.

Background

The case data distribution is generated at each stage in the diagnosis and treatment process of the patient, is necessary information data in the processes of diagnosis, follow-up visit, scientific research and the like, and comprises basic patient information, medical history information, auxiliary examination information, operation information, doctor's advice information, image information and the like existing at each stage.

In the case data, besides the inherent information such as names and certificate numbers in the patient information, other text expressions with large space through manual input exist in other information such as medical history information and medical advice information, the image information further comprises image information different from the text expressions, the information is semi-structured and unstructured information, necessary effective information in the information is difficult to be completely extracted by means of detection based on keywords, the case data needs to be checked and recorded manually in the process of transferring, the process is seriously dependent on manual work, time and labor are wasted, and extremely high requirements are placed on career literacy of a processor.

In addition, when scientific research is performed on specific cases, huge case data with multiple mechanisms and multiple time periods are often involved, classification and identification are needed before research, case data meeting research requirements are screened out for group entry, and group entry rules mainly exist in unstructured data such as medical history information and operation information and are reflected in a semantic form and cannot be identified through keywords.

Disclosure of Invention

The embodiment of the invention discloses a case grouping screening method and a system based on natural language processing, which are characterized in that original case data and grouping rule texts are processed in a natural language processing mode to obtain a text label set and a grouping label set, then the original case data and the text label set are processed by a PLSA classification model to obtain association mapping between the two and various types of nodes based on probability distribution, and then the required grouping case data is obtained by matching the grouping label set and the specific type of nodes, so that grouping screening of the original case data comprising unstructured data is completed, manual intervention is not needed in the process, and the accuracy is high.

The first aspect of the embodiment of the invention discloses a case group entering screening method based on natural language processing, which comprises the following steps:

primary recognition is carried out on the original case data by adopting an NLP model, and a text label set is obtained;

constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data, the text label set and a plurality of types of nodes;

based on the rule text of the group entering, determining a label set of the group entering by adopting an NLP model;

matching the set of the group-entering labels with the set of the text labels, and determining a specific type node corresponding to the set of the group-entering labels in the implicit semantic space;

and extracting the original case data of the associated mapping of the specific type node to obtain the group-entering case data.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the rule text of the group includes at least a preference rule, an exclusion rule, and a remark rule;

and in the original case data, the original case data which accords with the optimization rule and does not accord with the exclusion rule at the same time, or the original case data which accords with the remark rule is the group-entering case data.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, after the extracting the original case data mapped by the specific type node to obtain the group-entering case data, the method further includes:

synchronizing the group-entering case data by using a Datax tool, and placing the structured data in the group-entering case data into a standard data table;

splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each group of case data relative to a standard data table, and respectively placing the semi-structured data and the unstructured data into the standard data table.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the method further includes:

performing value range verification on each standard data table constructed based on the group-entering case data, and eliminating standard data tables with overrun data;

and carrying out logic verification on each standard data table with the value range verification completed, and eliminating the standard data table with the defect data against the medical logic.

storing image data related to the group-entering case data in an image library, and establishing an association relation between the group-entering case data and the image data through a patient main index;

when any group of case data is retrieved and called, corresponding image data is synchronously called based on the association relation.

The second aspect of the embodiment of the invention discloses a case grouping and screening system based on natural language processing, which comprises the following steps:

the label identification unit is used for carrying out primary identification on the original case data by adopting an NLP model to obtain a text label set;

the model building unit is used for building a PLSA classification model based on the original case data and the text label set and carrying out association mapping on the original case data, the text label set and a plurality of types of nodes;

the tag determining unit is used for determining a group-entering tag set by adopting an NLP model based on the group-entering rule text;

the label matching unit is used for matching the group-entering label set with the text label set and determining a specific type node corresponding to the group-entering label set in the implicit semantic space;

and the data extraction unit is used for extracting the original case data which are mapped by the specific type of nodes in an associated mode to obtain the group-entering case data.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the rule text of the group includes at least a preference rule, an exclusion rule, and a remark rule;

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the system further includes:

the data synchronization unit is used for synchronizing the group-entering case data by using a Datax tool after the data extraction unit extracts the original case data which is mapped by the specific type node in an associated way to obtain the group-entering case data, and placing the structured data in the group-entering case data into a standard data table;

the data matching unit is used for splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each group of case data relative to the standard data table and respectively placing the semi-structured data and the unstructured data into the standard data table.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the method further includes:

the first eliminating unit is used for verifying the value range of each standard data table constructed based on the group-entering case data and eliminating the standard data table with the overrun data;

and the second eliminating unit is used for carrying out logic verification on each standard data table with the value range verification, and eliminating the standard data table with the defect data against the medical logic.

the image association unit is used for storing image data related to the group-entering case data in an image library and establishing association relation between the group-entering case data and the image data through a patient main index;

and the image calling unit is used for synchronously calling corresponding image data based on the association relation when any group of case data is searched and called.

The third aspect of the embodiment of the invention discloses a case group entering screening system based on natural language processing, which comprises the following steps:

a memory storing executable program code;

a processor coupled to the memory;

the processor invokes the executable program code stored in the memory to execute the case grouping screening method based on natural language processing disclosed in the first aspect of the embodiment of the present invention.

A fourth aspect of the embodiment of the present invention discloses a computer-readable storage medium storing a computer program, where the computer program causes a computer to execute a case entry group screening method based on natural language processing disclosed in the first aspect of the embodiment of the present invention.

A fifth aspect of the embodiments of the present invention discloses a computer program product which, when run on a computer, causes the computer to perform part or all of the steps of any one of the methods of the first aspect.

A sixth aspect of the embodiments of the present invention discloses an application publishing platform for publishing a computer program product, wherein the computer program product, when run on a computer, causes the computer to perform part or all of the steps of any one of the methods of the first aspect.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the original case data and the grouping rule text are processed in a natural language processing mode to obtain the text label set and the grouping label set, the PLSA classification model is further adopted to process the original case data and the text label set to obtain the association mapping between the original case data and various types of nodes based on probability distribution, and the required grouping case data is obtained by matching the grouping label set and the specific types of nodes, so that the grouping screening of the original case data comprising unstructured data is completed, manual intervention is not needed in the process, and the accuracy is high.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a case grouping screening method based on natural language processing according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a case entry group screening system based on natural language processing according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of another case-in-group screening system based on natural language processing according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present invention are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Referring to fig. 1, fig. 1 is a flow chart of a case grouping screening method based on natural language processing according to an embodiment of the present invention. As shown in fig. 1, the case-grouping screening method based on natural language processing may include the following steps.

101. And (5) performing primary recognition on the original case data by adopting an NLP model to obtain a text label set.

In this embodiment, an NLP model is used to extract text labels of keywords in each original case data, and the text labels are used as basic recognition and matching basis.

102. And constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data, the text label set and a plurality of types of nodes.

In this embodiment, a PLSA classification model is established in which original case data and type nodes are associated with each other, where each case data is represented by a probability distribution on a text label, and each type node is represented by a probability distribution on each case data, so as to form a probability distribution of a double-layer structure, thereby obtaining a probability relationship between the original case data and the type node, and determining the type node corresponding to the case data based on the probability relationship with the strongest association.

103. And determining the set of the group-entering tags by using an NLP model based on the group-entering rule text.

In this embodiment, the group-entering rule text is determined based on the research project, and most of the group-entering rule text is text expression in long sentence or multi-segment distribution, and is mainly embodied as unstructured data when the research field is relatively refined.

As an alternative implementation mode, the rule text of the group at least comprises a preference rule, an exclusion rule and a remark rule; and in the original case data, the original case data which accords with the optimization rule and does not accord with the exclusion rule at the same time, or the original case data which accords with the remark rule is the group-entering case data.

Specifically, taking breast cancer study as an example, the preferred rules may be: pathologically diagnosed left-or right-breast malignancy, which is defined in the following ranges: carcinoma, sarcoma, malignant or borderline phyllotor, interstitial, CDCIS, paget's disease. (at the time of initial diagnosis, no effect on the inclusion of the group was accompanied by other secondary tumors or not).

The exclusion rules may be:

a. patients with no histologically confirmed malignant breast tumor lesions;

b. breast cancer patients who did not receive surgical treatment for the breast and armpit at home;

c. histologically confirmed diagnosis is of patients with classical lobular carcinoma in situ, benign breast cancer, mastitis, papilloma, benign lobular tumor and no malignant focus;

d. receiving breast surgery treatment at the outer hospital, and obtaining a negative edge, and performing armpit surgery on the patient at the home;

e. primary focus surgery is not performed in home, and patients with recurrent metastasis appear after surgery;

f. other malignant tumors metastasize to patients with breast or axilla.

The remark rules may be:

a. the coarse needle biopsy is not positioned as a surgically treated patient;

b. primary stage iv breast cancer surgery patient;

c. patients who have undergone resection biopsy of the tumor (including minimally invasive surgery) at the hospital, who have undergone open surgery at the hospital, or who have undergone axillary surgery, have reported consultation with the clinical department of the hospital based on the white piece of pathological tissue of the tumor at the hospital.

The above criteria for the preference rule are broader, and the rule is applied to preliminary screening, the rule is excluded to supplement the preference rule, the situation that the group is not to be entered is clarified, and the rule is further remarked to supplement the preference rule and the rule is discharged, so that the special situation that the group is to be entered is clarified.

104. And matching the set of the group-entering labels with the set of the text labels, and determining the specific type of nodes corresponding to the set of the group-entering labels in the implicit semantic space.

In this embodiment, under the condition that the group entering tag set defines the group entering requirement and the PLSA classification model constructs a complete association mapping for the original case data, the group entering tag set is matched with the text tag set, and a specific type node consistent with the group entering requirement is determined.

105. And extracting the original case data of the associated mapping of the specific type node to obtain the group-entering case data.

In this embodiment, the original case data mapped and associated with the specific type node is the group entering case data meeting the text requirement of the group entering rule, and is extracted according to the specific type node.

In this embodiment, after the required group-entering case data is screened out, it is entered into a group,

as an optional implementation manner, synchronizing the group-entering case data by using a Datax tool, and placing the structured data in the group-entering case data into a standard data table; splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each group of case data relative to a standard data table, and respectively placing the semi-structured data and the unstructured data into the standard data table.

Thus, the structured data in the case data of the group is correspondingly classified into the standard data table, and the semi-structured data and the unstructured data are split into groups based on the matching values of the semi-structured data and parameters such as data formats, field requirements and the like of all fields in the standard data table.

In this embodiment, verification and cleaning are also performed on the group-entering case data, and incorrect rejection is performed to obtain accurate and usable group-entering case data.

As an optional implementation manner, checking the value range of each standard data table constructed based on the case data of the group, and eliminating the standard data table with overrun data;

Here, isomorphic value range verification is used for eliminating the group-entering case data with overrun data (such as negative age), and logic verification is used for eliminating the group-entering case data with medical logic errors (such as operation date earlier than pathological date before treatment), so that invalid data is prevented from negatively affecting the study.

In this embodiment, the image data is stored independently with respect to the group-entering case data, and establishes an association relationship with the group-entering case data in text format.

As an optional implementation manner, the image data related to the group-entering case data is stored in an image library, and an association relationship is established between the group-entering case data and the image data through a patient main index;

Therefore, special processing of the image data is not needed, and the influence on the accuracy of the image data is avoided.

In summary, the original case data and the grouping rule text are processed in a natural language processing mode to obtain a text label set and a grouping label set, the original case data and the text label set are processed in a PLSA classification model to obtain the association mapping between the two and various types of nodes based on probability distribution, and the required grouping case data is obtained by matching the grouping label set and the specific type of nodes, so that grouping screening of the original case data comprising unstructured data is completed, manual intervention is not needed in the process, and accuracy is high.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of a case entry group screening system based on natural language processing according to an embodiment of the present invention. As shown in fig. 2, the case-grouping screening system based on natural language processing may include:

a tag identification unit 201, configured to perform primary identification on original case data by using an NLP model, so as to obtain a text tag set;

the model building unit 202 is configured to build a PLSA classification model based on the original case data and the text label set, and perform association mapping on the original case data, the text label set, and a plurality of types of nodes;

a tag determining unit 203, configured to determine a set of tags entering a group by using an NLP model based on the rule text entering the group;

the rule text of the group at least comprises a preference rule, an exclusion rule and a remark rule;

The tag matching unit 204 is configured to match the set of in-group tags with the set of text tags, and determine a specific type of node corresponding to the set of in-group tags in the implicit semantic space;

the data extraction unit 205 is configured to extract original case data mapped by the specific type node to obtain group-entering case data;

the data synchronization unit 206 is configured to synchronize the group-entering case data by using a Datax tool after the data extraction unit extracts the original case data mapped by the specific type node, and place the structured data in the group-entering case data into the standard data table;

a data matching unit 207, configured to split the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data included in each group of case data relative to the standard data table, and put the split semi-structured data and the unstructured data into the standard data table respectively;

a first rejecting unit 208, configured to perform value range verification on each standard data table constructed based on the group-entering case data, and reject the standard data table with overrun data;

a second rejection unit 209, configured to perform logic verification on each standard data table for which the value range verification is completed, and reject the standard data table for which defect data that violates medical logic exists;

the image association unit 210 is configured to store image data related to the group-entering case data in an image library, and establish an association relationship between the group-entering case data and the image data through a patient main index;

the image retrieving unit 211 is configured to retrieve and retrieve any of the group of case data, and to retrieve corresponding image data synchronously based on the association relationship.

As an alternative implementation manner, the data synchronization unit 206 uses a Datax tool to synchronize the group-entering case data, and puts the structured data in the group-entering case data into the standard data table; the data matching unit 207 splits the semi-structured data and the unstructured data according to the matching values of the semi-structured data and the unstructured data contained in each set of case data with respect to the standard data table, and places the split semi-structured data and the unstructured data into the standard data table respectively.

As an alternative embodiment, the first culling unit 208 performs a value range check on each standard data table constructed based on the case data of the group, and culls the standard data table with overrun data;

the second culling unit 209 performs logic verification on each standard data table for which the value range verification is completed, and culls the standard data table for which defect data that violates medical logic exists.

As an alternative embodiment, the image association unit 210 stores the image data related to the group-entering case data in the image library, and establishes an association relationship between the group-entering case data and the image data through the patient main index;

when retrieving and retrieving any of the incoming set of case data, the image retrieving unit 211 synchronously retrieves the corresponding image data based on the association relationship.

Example III

Referring to fig. 3, fig. 3 is a schematic structural diagram of another case entry group screening system based on natural language processing according to an embodiment of the present invention. As shown in fig. 3, the case-grouping screening system based on natural language processing may include:

a memory 301 storing executable program code;

a processor 302 coupled with the memory 301;

wherein the processor 302 invokes executable program code stored in the memory 301 to perform a case-grouping screening method based on natural language processing of fig. 1.

The embodiment of the invention discloses a computer readable storage medium storing a computer program, wherein the computer program enables a computer to execute a case grouping screening method based on natural language processing of fig. 1.

The embodiments of the present invention also disclose a computer program product, wherein the computer program product, when run on a computer, causes the computer to perform some or all of the steps of the method as in the method embodiments above.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the above embodiments may be implemented by a program that instructs associated hardware, the program may be stored in a computer readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disk Memory, magnetic disk Memory, tape Memory, or any other medium that can be used for carrying or storing data that is readable by a computer.

The above describes in detail a case grouping screening method and system based on natural language processing disclosed in the embodiment of the present invention, and specific examples are applied to illustrate the principles and embodiments of the present invention, where the above description of the embodiment is only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method for case-group-entering screening based on natural language processing, the method comprising:

matching the set of the group-entering labels with the set of the text labels, and determining a specific type node corresponding to the set of the group-entering labels in a latent semantic space;

extracting original case data of the associated mapping of the specific type node to obtain group-entering case data;

2. The method of claim 1, wherein after extracting the original case data mapped by the specific type node to obtain the group-entering case data, the method further comprises:

3. The natural language processing based case grouping screening method of claim 2, further comprising:

4. The natural language processing based case grouping screening method of claim 1, further comprising:

5. A natural language processing based case entry group screening system, the system comprising:

the label matching unit is used for matching the group-entering label set with the text label set and determining a specific type node corresponding to the group-entering label set in a latent semantic space;

the data extraction unit is used for extracting the original case data which are mapped by the specific type of nodes in an associated mode to obtain group-entering case data;

6. The natural language processing based case entry group screening system of claim 5, further comprising:

7. The natural language processing based case entry group screening system of claim 6, further comprising:

8. The natural language processing based case entry group screening system of claim 5, further comprising: