CN114743681A

CN114743681A - Case grouping screening method and system based on natural language processing

Info

Publication number: CN114743681A
Application number: CN202111564591.1A
Authority: CN
Inventors: 杨�远; 刘昊; 曹润卿; 史俊才; 钟炎萤; 陈华达
Original assignee: Health Data Beijing Technology Co ltd
Current assignee: Health Data Beijing Technology Co ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-07-12
Anticipated expiration: 2041-12-20
Also published as: CN114743681B

Abstract

The invention discloses a case grouping screening method and system based on natural language processing, wherein the method comprises the following steps: performing primary identification on original case data by adopting an NLP model to obtain a text label set; constructing a PLSA (partial least squares) classification model, and performing associated mapping on original case data, a text label set and type nodes; determining an grouping label set of a grouping rule text by adopting an NLP model; matching the grouping label set with the text label set, and determining a specific type node corresponding to the grouping label set; and extracting the grouped case data which is associated and mapped by the specific type node. In the embodiment of the invention, the text label set and the grouping label set are obtained by adopting natural language processing, the association mapping between the text label set and the grouping label set and the nodes of various types based on probability distribution is further obtained by adopting PLSA, and the required grouping case data is extracted by matching the grouping label set and the nodes of specific types, so that the grouping screening of the original case data comprising the unstructured data is completed, the process does not need manual intervention, and the accuracy is high.

Description

Case grouping screening method and system based on natural language processing

Technical Field

The invention relates to data processing, in particular to a case grouping and screening method and system based on natural language processing.

Background

The case data distribution is generated at each stage in the diagnosis and treatment process of the patient, is necessary information data in the processes of diagnosis and treatment, follow-up visit, scientific research and the like, and comprises basic patient information, medical history information, auxiliary examination information, operation information, medical advice information, image information and the like in each stage.

In the case data, except that the inherent information such as the name, the certificate number and the like in the patient information is structured data, other information such as medical history information and medical advice information has large text expression input manually, the image information further comprises image information different from the text expression, the information is semi-structured and unstructured information, the necessary effective information is difficult to extract completely by means of detection based on keywords, manual examination and entry of the case data are needed in the process of hospital transfer and the like, the process depends on manual work seriously, time and labor are wasted, and the requirement on occupational literacy of a processor is high.

In addition, when scientific research is carried out on a specific case, the method usually relates to large case data of multiple mechanisms and multiple time periods, classification and identification are carried out before research, case data meeting research requirements are screened out and grouped, and grouping rules mainly exist in unstructured data such as medical history information and operation information and are reflected in a semantic form, and cannot be identified through keywords.

Disclosure of Invention

The embodiment of the invention discloses a case grouping screening method and system based on natural language processing.

The embodiment of the invention discloses a case grouping and screening method based on natural language processing in a first aspect, which comprises the following steps:

performing primary identification on original case data by adopting an NLP model to obtain a text label set;

constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data, the text label set and a plurality of type nodes;

determining an grouping label set by adopting an NLP model based on the grouping rule text;

matching the grouped label set with the text label set, and determining a specific type node corresponding to the grouped label set in the implicit semantic space;

and extracting the original case data which are mapped and associated with the specific type nodes to obtain grouped case data.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the grouping rule text at least includes a preference rule, an exclusion rule, and a remark rule;

and in the original case data, the original case data which conforms to the preferred rule and does not conform to the exclusion rule, or the original case data which conforms to the remark rule is grouped case data.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, after the extracting the original case data mapped in association with the specific type node to obtain the grouped case data, the method further includes:

synchronizing the grouped case data by adopting a Datax tool, and placing the structured data in the grouped case data into a standard data table;

and splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each incident case data relative to a standard data table, and respectively placing the split semi-structured data and the unstructured data into the standard data table.

As an optional implementation manner, in the first aspect of this embodiment of the present invention, the method further includes:

performing value domain verification on each standard data table constructed based on the grouped case data, and removing the standard data table with the overrun data;

and performing logic verification on each standard data table subjected to value range verification, and removing the standard data tables with the defect data violating the medical logic.

storing the image data related to the grouping case data in an image library, and establishing an association relation between the grouping case data and the image data through a patient main index;

when any of the group case data is retrieved and called, the corresponding image data is synchronously called based on the association relationship.

The second aspect of the embodiment of the invention discloses a case grouping and screening system based on natural language processing, which comprises:

the label identification unit is used for carrying out primary identification on the original case data by adopting an NLP model to obtain a text label set;

the model construction unit is used for constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data, the text label set and a plurality of type nodes;

the tag determining unit is used for determining an grouping tag set by adopting an NLP model based on the grouping rule text;

the label matching unit is used for matching the grouped label set with the text label set and determining a specific type node corresponding to the grouped label set in the implied semantic space;

and the data extraction unit is used for extracting the original case data associated and mapped with the specific type node to obtain the grouped case data.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the grouping rule text at least includes a preference rule, an exclusion rule, and a remarking rule;

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the system further includes:

the data synchronization unit is used for synchronizing the grouped case data by adopting a Datax tool after the data extraction unit extracts the original case data associated and mapped with the specific type node to obtain the grouped case data, and placing the structured data in the grouped case data into a standard data table;

and the data matching unit is used for splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each grouped case data relative to the standard data table and respectively placing the split semi-structured data and the unstructured data into the standard data table.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the method further includes:

the first removing unit is used for carrying out value domain verification on each standard data table constructed based on the grouped case data and removing the standard data table with the overrun data;

and the second eliminating unit is used for carrying out logic verification on each standard data table subjected to value range verification and eliminating the standard data tables with the defect data which violates medical logic.

the image association unit is used for storing the image data related to the grouping case data in an image library and establishing an association relation between the grouping case data and the image data through a patient main index;

and the image calling unit is used for synchronously calling the corresponding image data based on the association relation when any group case data is searched and called.

The third aspect of the embodiments of the present invention discloses a case grouping and screening system based on natural language processing, which includes:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute a case grouping and screening method based on natural language processing disclosed in the first aspect of the embodiment of the invention.

A fourth aspect of the embodiments of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program enables a computer to execute the method for screening grouping of cases based on natural language processing disclosed in the first aspect of the embodiments of the present invention.

A fifth aspect of embodiments of the present invention discloses a computer program product, which, when run on a computer, causes the computer to perform some or all of the steps of any one of the methods of the first aspect.

A sixth aspect of the present embodiment discloses an application publishing platform, where the application publishing platform is configured to publish a computer program product, where when the computer program product runs on a computer, the computer is caused to execute part or all of the steps of any one of the methods in the first aspect.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the original case data and the grouping rule text are processed by adopting a natural language processing mode to obtain the text label set and the grouping label set, the original case data and the text label set are further processed by adopting a PLSA classification model to obtain the association mapping between the original case data and the text label set and various types of nodes based on probability distribution, the required grouping case data are extracted by matching the grouping label set and the specific type of nodes, the grouping screening of the original case data comprising unstructured data is completed, the process does not need manual intervention, and the accuracy is high.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a case grouping screening method based on natural language processing according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a case grouping and screening system based on natural language processing according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of another system for screening case grouping based on natural language processing according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first", "second", "third", "fourth", etc. in the description and claims of the present invention are used for distinguishing different objects, and are not used for describing a specific order. The terms "comprises," "comprising," and "having," and any variations thereof, of embodiments of the present invention are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention discloses a case grouping screening method and system based on natural language processing, wherein original case data and a grouping rule text are processed by adopting a natural language processing mode to obtain a text label set and a grouping label set, the original case data and the text label set are further processed by adopting a PLSA classification model to obtain probability distribution-based association mapping between the original case data and the text label set and various types of nodes, the required grouping case data are extracted by matching the grouping label set and specific types of nodes, grouping screening of the original case data comprising unstructured data is completed, manual intervention is not needed in the process, and the accuracy is high.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of a case grouping screening method based on natural language processing according to an embodiment of the present invention. As shown in fig. 1, the case grouping screening method based on natural language processing may include the following steps.

101. And performing initial identification on the original case data by adopting an NLP model to obtain a text label set.

In this embodiment, an NLP model is used to extract the text labels of the keywords in each original case data as the basic identification and matching basis.

102. And constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data and the text label set and a plurality of types of nodes.

In this embodiment, a PLSA classification model in which the original case data and the type nodes are associated with each other is established, where each case data is represented by a probability distribution of a text label thereon, and the type nodes are represented by a probability distribution of each case data thereon, so as to form a probability distribution of a double-layer structure, thereby obtaining a probability relationship between the original case data and the type nodes, and determining the type nodes corresponding to the case data based on the probability relationship with the strongest association.

103. And determining an grouping label set by adopting an NLP model based on the grouping rule text.

In this embodiment, the grouping rule text is determined based on the research project, and most of the grouping rule text is a text expression with a long sentence pattern or multi-segment distribution, and is mainly embodied as unstructured data when the research field is more detailed.

As an optional implementation manner, the grouping rule text at least comprises a preference rule, an exclusion rule and a remarking rule; in the original case data, the original case data which meets the optimization rule and does not meet the exclusion rule, or the original case data which meets the remark rule is the grouped case data.

Specifically, taking a breast cancer study as an example here, the preferred rule may be: pathologically confirmed left or right breast malignancies, defined as: carcinomas, sarcomas, malignant or borderline phyllodes, mesenchyme, CDCIS, Paget's disease. (the group was not affected by the presence or absence of other secondary tumors at the time of initial diagnosis).

Its exclusion rules may be:

a. patients with no histologically confirmed malignant breast tumor lesions;

b. breast cancer patients who have not received surgical treatment for the breast and axilla at the home hospital;

c. patients with typical lobular carcinoma in situ, benign breast cancer, mastitis, papilloma, benign leaf tumor, and no malignant lesion are histologically diagnosed;

d. patients who receive mammary gland operation treatment in an outer hospital and obtain a negative margin and are subjected to axillary operation in the same hospital;

e. the primary focus operation is not carried out in the hospital, and the patients with relapse and metastasis appear after the operation;

f. other malignancies metastasize to breast or axillary patients.

The remark rules may be:

a. coarse needle biopsy does not locate the patient for surgical treatment;

b. patients with primary stage iv breast cancer surgery;

c. patients who have undergone resection biopsy (including minimally invasive surgery) of a tumor at a hospital, have trimmed their margins during open surgery at the hospital, or have completed an axillary operation, have a consultation report from the hospital's pathology department based on a white film of the pathological tissue of the tumor at the hospital.

The above, the standard of the optimization rule is wider, and is applicable to primary general screening, the exclusion rule supplements the optimization rule, and the condition that the selection is not performed is determined, and further, the remark standard supplements the optimization rule and the discharge rule, and the special condition that the selection is performed is determined.

104. Matching the grouping label set with the text label set, and determining a specific type node corresponding to the grouping label set in the implicit semantic space.

In this embodiment, under the condition that the grouping label set defines the grouping requirement and the PLSA classification model constructs complete association mapping for the original case data, the grouping label set is matched with the text label set to determine a specific type node consistent with the grouping requirement.

105. And extracting the original case data associated and mapped with the specific type node to obtain grouped case data.

In this embodiment, the original case data mapped and associated with the specific type node is the grouping case data meeting the requirement of the grouping rule text, and the specific type node is extracted according to the specific type node.

In the embodiment, after the required grouping case data is screened out, the grouping is carried out,

as an optional implementation manner, the data x tool is used for synchronizing the grouped case data, and the structured data in the grouped case data is placed in the standard data table; and splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each incident case data relative to a standard data table, and respectively placing the split semi-structured data and the unstructured data into the standard data table.

Therefore, structured data in the grouped case data are correspondingly classified into the standard data table, and semi-structured data and unstructured data are split and grouped based on matching values of the semi-structured data and parameters such as data formats of fields and field requirements in the standard data table.

In the embodiment, the grouped case data is also checked and cleaned, and errors are removed, so that accurate and available grouped case data is obtained.

As an optional implementation manner, performing value domain verification on each standard data table constructed based on the grouped case data, and removing the standard data table with the overrun data;

Here, the isomorphic value field check is used for eliminating the grouping case data with overrun data (such as negative value age), and the logic check is used for eliminating the grouping case data with medical logic errors (such as operation date earlier than pathological date before treatment), so that the negative influence of invalid data on the research is avoided.

In this embodiment, the image data is stored independently with respect to the grouped case data, and an association relationship is established with the grouped case data in a text format.

As an optional implementation manner, the image data related to the group case data is stored in an image library, and an association relationship is established between the group case data and the image data through a patient main index;

Therefore, special processing on the image data is not needed, and the influence on the accuracy of the image data is avoided.

In summary, the original case data and the grouping rule text are processed by adopting a natural language processing mode to obtain a text label set and a grouping label set, the original case data and the text label set are further processed by adopting a PLSA classification model to obtain probability distribution-based association mapping between the original case data and the text label set and between the original case data and each type of node, the required grouping case data are extracted by matching the grouping label set and the specific type of node, grouping screening of the original case data including unstructured data is completed, manual intervention is not needed in the process, and the accuracy is high.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of a case grouping and screening system based on natural language processing according to an embodiment of the present invention. As shown in fig. 2, the system for screening case grouping based on natural language processing may include:

a tag identification unit 201, configured to perform initial identification on original case data by using an NLP model, so as to obtain a text tag set;

the model construction unit 202 is configured to construct a PLSA classification model based on the original case data and the text label set, and perform associated mapping on the original case data, the text label set, and a plurality of type nodes;

the tag determining unit 203 is configured to determine an entry tag set by using an NLP model based on the entry rule text;

wherein the grouping rule text at least comprises an optimization rule, an exclusion rule and a remark rule;

in the original case data, the original case data which meets the optimization rule and does not meet the exclusion rule, or the original case data which meets the remark rule is the grouped case data.

A tag matching unit 204, configured to match the grouped tag set with the text tag set, and determine a specific type node corresponding to the grouped tag set in an implied semantic space;

the data extraction unit 205 is configured to extract original case data mapped in association with a specific type node to obtain grouped case data;

the data synchronization unit 206 is configured to, after the data extraction unit extracts the original case data mapped in association with the specific type node to obtain the grouped case data, synchronize the grouped case data by using a Datax tool, and place the structured data in the grouped case data into a standard data table;

the data matching unit 207 is used for splitting the semi-structured data and the unstructured data according to a matching value of the semi-structured data and the unstructured data contained in each grouped case data relative to the standard data table and respectively placing the split semi-structured data and the unstructured data into the standard data table;

the first eliminating unit 208 is used for performing value domain verification on each standard data table constructed based on the grouped case data and eliminating the standard data table with the overrun data;

a second eliminating unit 209, configured to perform logic verification on each standard data table subjected to value range verification, and eliminate standard data tables having defective data that violates medical logic;

the image association unit 210 is configured to store image data related to the group entry case data in an image library, and establish an association relationship between the group entry case data and the image data through a patient main index;

the image retrieving unit 211 is configured to retrieve and retrieve any one of the incoming group case data, and retrieve corresponding image data synchronously based on the association relationship.

As an optional implementation manner, the data synchronization unit 206 synchronizes the grouped case data by using a Datax tool, and places the structured data in the grouped case data into a standard data table; the data matching unit 207 splits the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data included in each entered group case data relative to the standard data table, and places the split semi-structured data and the unstructured data into the standard data table respectively.

As an optional implementation manner, the first removing unit 208 performs value domain verification on each standard data table constructed based on the grouped case data, and removes the standard data table with the overrun data;

the second culling unit 209 performs logic verification on each standard data sheet for which value range verification is completed, and culls the standard data sheet in which defective data against medical logic exists.

As an optional implementation manner, the image association unit 210 stores the image data related to the grouped case data in an image library, and establishes an association relationship between the grouped case data and the image data through a patient main index;

when retrieving and calling any one of the pieces of incoming group case data, image calling section 211 synchronously calls the corresponding image data based on the association relationship.

EXAMPLE III

Referring to fig. 3, fig. 3 is a schematic structural diagram of another example grouping screening system based on natural language processing according to an embodiment of the present invention. As shown in fig. 3, the system for screening cases into groups based on natural language processing may include:

a memory 301 storing executable program code;

a processor 302 coupled to the memory 301;

the processor 302 calls the executable program code stored in the memory 301 to execute a case grouping screening method based on natural language processing of fig. 1.

The embodiment of the invention discloses a computer-readable storage medium which stores a computer program, wherein the computer program enables a computer to execute the case grouping screening method based on natural language processing in the figure 1.

Embodiments of the present invention also disclose a computer program product, wherein, when the computer program product is run on a computer, the computer is caused to execute part or all of the steps of the method as in the above method embodiments.

It will be understood by those skilled in the art that all or part of the steps in the methods of the embodiments described above may be implemented by hardware instructions of a program, and the program may be stored in a computer-readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other Memory, such as a magnetic disk, or a combination thereof, A tape memory, or any other medium readable by a computer that can be used to carry or store data.

The method and the system for grouping and screening cases based on natural language processing disclosed by the embodiment of the invention are described in detail, a specific example is applied in the method to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for case grouping screening based on natural language processing, the method comprising:

based on the grouping rule text, determining a grouping label set by adopting an NLP model;

and extracting the original case data associated and mapped with the specific type node to obtain grouped case data.

2. The method as claimed in claim 1, wherein the grouping rule text at least includes a preference rule, an exclusion rule and a remark rule;

3. The method as claimed in claim 1, wherein after the extracting of the original case data mapped to the specific type of node to obtain the grouped case data, the method further comprises:

4. The method as claimed in claim 3, further comprising:

5. The method as claimed in claim 1, further comprising:

storing the image data related to the grouping case data in an image library, and establishing an incidence relation between the grouping case data and the image data through a patient main index;

when any group case data is retrieved and called, corresponding image data is synchronously called based on the association relation.

6. A system for case grouping screening based on natural language processing, the system comprising:

the label identification unit is used for carrying out primary identification on the original case data by adopting an NLP (non line segment) model to obtain a text label set;

7. The system of claim 6, wherein the grouping rules text comprises at least preference rules, exclusion rules, and remark rules;

8. The system of claim 6, further comprising:

and the data matching unit is used for splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each grouping case data relative to the standard data table and respectively placing the split semi-structured data and the unstructured data into the standard data table.

9. The system of claim 8, wherein the method further comprises:

and the second eliminating unit is used for carrying out logic verification on each standard data table subjected to value range verification and eliminating the standard data tables with defective data violating medical logic.

10. The system of claim 6, wherein the method further comprises: