CN114743681A - Case grouping screening method and system based on natural language processing - Google Patents
Case grouping screening method and system based on natural language processing Download PDFInfo
- Publication number
- CN114743681A CN114743681A CN202111564591.1A CN202111564591A CN114743681A CN 114743681 A CN114743681 A CN 114743681A CN 202111564591 A CN202111564591 A CN 202111564591A CN 114743681 A CN114743681 A CN 114743681A
- Authority
- CN
- China
- Prior art keywords
- data
- case data
- grouping
- label set
- grouped
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000003058 natural language processing Methods 0.000 title claims abstract description 41
- 238000012216 screening Methods 0.000 title claims abstract description 31
- 238000013145 classification model Methods 0.000 claims abstract description 13
- 238000013507 mapping Methods 0.000 claims abstract description 13
- 238000012795 verification Methods 0.000 claims description 21
- 230000007717 exclusion Effects 0.000 claims description 14
- 238000013075 data extraction Methods 0.000 claims description 6
- 230000007547 defect Effects 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 230000002950 deficient Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000004590 computer program Methods 0.000 description 9
- 238000011160 research Methods 0.000 description 8
- 238000005457 optimization Methods 0.000 description 6
- 206010006187 Breast cancer Diseases 0.000 description 5
- 206010028980 Neoplasm Diseases 0.000 description 5
- 208000026310 Breast neoplasm Diseases 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 238000001356 surgical procedure Methods 0.000 description 4
- 210000000481 breast Anatomy 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 230000001575 pathological effect Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000003902 lesion Effects 0.000 description 2
- 230000036210 malignancy Effects 0.000 description 2
- 230000003211 malignant effect Effects 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 201000009030 Carcinoma Diseases 0.000 description 1
- 206010073099 Lobular breast carcinoma in situ Diseases 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 208000010191 Osteitis Deformans Diseases 0.000 description 1
- 208000027868 Paget disease Diseases 0.000 description 1
- 241001440127 Phyllodes Species 0.000 description 1
- 206010039491 Sarcoma Diseases 0.000 description 1
- 210000001099 axilla Anatomy 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 201000005389 breast carcinoma in situ Diseases 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 201000011059 lobular neoplasia Diseases 0.000 description 1
- 208000027202 mammary Paget disease Diseases 0.000 description 1
- 210000005075 mammary gland Anatomy 0.000 description 1
- 208000004396 mastitis Diseases 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 238000002324 minimally invasive surgery Methods 0.000 description 1
- 238000013188 needle biopsy Methods 0.000 description 1
- 208000003154 papilloma Diseases 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 238000002271 resection Methods 0.000 description 1
- 208000011581 secondary neoplasm Diseases 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Pathology (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a case grouping screening method and system based on natural language processing, wherein the method comprises the following steps: performing primary identification on original case data by adopting an NLP model to obtain a text label set; constructing a PLSA (partial least squares) classification model, and performing associated mapping on original case data, a text label set and type nodes; determining an grouping label set of a grouping rule text by adopting an NLP model; matching the grouping label set with the text label set, and determining a specific type node corresponding to the grouping label set; and extracting the grouped case data which is associated and mapped by the specific type node. In the embodiment of the invention, the text label set and the grouping label set are obtained by adopting natural language processing, the association mapping between the text label set and the grouping label set and the nodes of various types based on probability distribution is further obtained by adopting PLSA, and the required grouping case data is extracted by matching the grouping label set and the nodes of specific types, so that the grouping screening of the original case data comprising the unstructured data is completed, the process does not need manual intervention, and the accuracy is high.
Description
Technical Field
The invention relates to data processing, in particular to a case grouping and screening method and system based on natural language processing.
Background
The case data distribution is generated at each stage in the diagnosis and treatment process of the patient, is necessary information data in the processes of diagnosis and treatment, follow-up visit, scientific research and the like, and comprises basic patient information, medical history information, auxiliary examination information, operation information, medical advice information, image information and the like in each stage.
In the case data, except that the inherent information such as the name, the certificate number and the like in the patient information is structured data, other information such as medical history information and medical advice information has large text expression input manually, the image information further comprises image information different from the text expression, the information is semi-structured and unstructured information, the necessary effective information is difficult to extract completely by means of detection based on keywords, manual examination and entry of the case data are needed in the process of hospital transfer and the like, the process depends on manual work seriously, time and labor are wasted, and the requirement on occupational literacy of a processor is high.
In addition, when scientific research is carried out on a specific case, the method usually relates to large case data of multiple mechanisms and multiple time periods, classification and identification are carried out before research, case data meeting research requirements are screened out and grouped, and grouping rules mainly exist in unstructured data such as medical history information and operation information and are reflected in a semantic form, and cannot be identified through keywords.
Disclosure of Invention
The embodiment of the invention discloses a case grouping screening method and system based on natural language processing.
The embodiment of the invention discloses a case grouping and screening method based on natural language processing in a first aspect, which comprises the following steps:
performing primary identification on original case data by adopting an NLP model to obtain a text label set;
constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data, the text label set and a plurality of type nodes;
determining an grouping label set by adopting an NLP model based on the grouping rule text;
matching the grouped label set with the text label set, and determining a specific type node corresponding to the grouped label set in the implicit semantic space;
and extracting the original case data which are mapped and associated with the specific type nodes to obtain grouped case data.
As an optional implementation manner, in the first aspect of the embodiment of the present invention, the grouping rule text at least includes a preference rule, an exclusion rule, and a remark rule;
and in the original case data, the original case data which conforms to the preferred rule and does not conform to the exclusion rule, or the original case data which conforms to the remark rule is grouped case data.
As an optional implementation manner, in the first aspect of the embodiment of the present invention, after the extracting the original case data mapped in association with the specific type node to obtain the grouped case data, the method further includes:
synchronizing the grouped case data by adopting a Datax tool, and placing the structured data in the grouped case data into a standard data table;
and splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each incident case data relative to a standard data table, and respectively placing the split semi-structured data and the unstructured data into the standard data table.
As an optional implementation manner, in the first aspect of this embodiment of the present invention, the method further includes:
performing value domain verification on each standard data table constructed based on the grouped case data, and removing the standard data table with the overrun data;
and performing logic verification on each standard data table subjected to value range verification, and removing the standard data tables with the defect data violating the medical logic.
As an optional implementation manner, in the first aspect of this embodiment of the present invention, the method further includes:
storing the image data related to the grouping case data in an image library, and establishing an association relation between the grouping case data and the image data through a patient main index;
when any of the group case data is retrieved and called, the corresponding image data is synchronously called based on the association relationship.
The second aspect of the embodiment of the invention discloses a case grouping and screening system based on natural language processing, which comprises:
the label identification unit is used for carrying out primary identification on the original case data by adopting an NLP model to obtain a text label set;
the model construction unit is used for constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data, the text label set and a plurality of type nodes;
the tag determining unit is used for determining an grouping tag set by adopting an NLP model based on the grouping rule text;
the label matching unit is used for matching the grouped label set with the text label set and determining a specific type node corresponding to the grouped label set in the implied semantic space;
and the data extraction unit is used for extracting the original case data associated and mapped with the specific type node to obtain the grouped case data.
As an optional implementation manner, in the second aspect of the embodiment of the present invention, the grouping rule text at least includes a preference rule, an exclusion rule, and a remarking rule;
and in the original case data, the original case data which conforms to the preferred rule and does not conform to the exclusion rule, or the original case data which conforms to the remark rule is grouped case data.
As an optional implementation manner, in the second aspect of the embodiment of the present invention, the system further includes:
the data synchronization unit is used for synchronizing the grouped case data by adopting a Datax tool after the data extraction unit extracts the original case data associated and mapped with the specific type node to obtain the grouped case data, and placing the structured data in the grouped case data into a standard data table;
and the data matching unit is used for splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each grouped case data relative to the standard data table and respectively placing the split semi-structured data and the unstructured data into the standard data table.
As an optional implementation manner, in the second aspect of the embodiment of the present invention, the method further includes:
the first removing unit is used for carrying out value domain verification on each standard data table constructed based on the grouped case data and removing the standard data table with the overrun data;
and the second eliminating unit is used for carrying out logic verification on each standard data table subjected to value range verification and eliminating the standard data tables with the defect data which violates medical logic.
As an optional implementation manner, in the second aspect of the embodiment of the present invention, the method further includes:
the image association unit is used for storing the image data related to the grouping case data in an image library and establishing an association relation between the grouping case data and the image data through a patient main index;
and the image calling unit is used for synchronously calling the corresponding image data based on the association relation when any group case data is searched and called.
The third aspect of the embodiments of the present invention discloses a case grouping and screening system based on natural language processing, which includes:
a memory storing executable program code;
a processor coupled with the memory;
the processor calls the executable program code stored in the memory to execute a case grouping and screening method based on natural language processing disclosed in the first aspect of the embodiment of the invention.
A fourth aspect of the embodiments of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program enables a computer to execute the method for screening grouping of cases based on natural language processing disclosed in the first aspect of the embodiments of the present invention.
A fifth aspect of embodiments of the present invention discloses a computer program product, which, when run on a computer, causes the computer to perform some or all of the steps of any one of the methods of the first aspect.
A sixth aspect of the present embodiment discloses an application publishing platform, where the application publishing platform is configured to publish a computer program product, where when the computer program product runs on a computer, the computer is caused to execute part or all of the steps of any one of the methods in the first aspect.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
in the embodiment of the invention, the original case data and the grouping rule text are processed by adopting a natural language processing mode to obtain the text label set and the grouping label set, the original case data and the text label set are further processed by adopting a PLSA classification model to obtain the association mapping between the original case data and the text label set and various types of nodes based on probability distribution, the required grouping case data are extracted by matching the grouping label set and the specific type of nodes, the grouping screening of the original case data comprising unstructured data is completed, the process does not need manual intervention, and the accuracy is high.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a case grouping screening method based on natural language processing according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a case grouping and screening system based on natural language processing according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of another system for screening case grouping based on natural language processing according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first", "second", "third", "fourth", etc. in the description and claims of the present invention are used for distinguishing different objects, and are not used for describing a specific order. The terms "comprises," "comprising," and "having," and any variations thereof, of embodiments of the present invention are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the invention discloses a case grouping screening method and system based on natural language processing, wherein original case data and a grouping rule text are processed by adopting a natural language processing mode to obtain a text label set and a grouping label set, the original case data and the text label set are further processed by adopting a PLSA classification model to obtain probability distribution-based association mapping between the original case data and the text label set and various types of nodes, the required grouping case data are extracted by matching the grouping label set and specific types of nodes, grouping screening of the original case data comprising unstructured data is completed, manual intervention is not needed in the process, and the accuracy is high.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of a case grouping screening method based on natural language processing according to an embodiment of the present invention. As shown in fig. 1, the case grouping screening method based on natural language processing may include the following steps.
101. And performing initial identification on the original case data by adopting an NLP model to obtain a text label set.
In this embodiment, an NLP model is used to extract the text labels of the keywords in each original case data as the basic identification and matching basis.
102. And constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data and the text label set and a plurality of types of nodes.
In this embodiment, a PLSA classification model in which the original case data and the type nodes are associated with each other is established, where each case data is represented by a probability distribution of a text label thereon, and the type nodes are represented by a probability distribution of each case data thereon, so as to form a probability distribution of a double-layer structure, thereby obtaining a probability relationship between the original case data and the type nodes, and determining the type nodes corresponding to the case data based on the probability relationship with the strongest association.
103. And determining an grouping label set by adopting an NLP model based on the grouping rule text.
In this embodiment, the grouping rule text is determined based on the research project, and most of the grouping rule text is a text expression with a long sentence pattern or multi-segment distribution, and is mainly embodied as unstructured data when the research field is more detailed.
As an optional implementation manner, the grouping rule text at least comprises a preference rule, an exclusion rule and a remarking rule; in the original case data, the original case data which meets the optimization rule and does not meet the exclusion rule, or the original case data which meets the remark rule is the grouped case data.
Specifically, taking a breast cancer study as an example here, the preferred rule may be: pathologically confirmed left or right breast malignancies, defined as: carcinomas, sarcomas, malignant or borderline phyllodes, mesenchyme, CDCIS, Paget's disease. (the group was not affected by the presence or absence of other secondary tumors at the time of initial diagnosis).
Its exclusion rules may be:
a. patients with no histologically confirmed malignant breast tumor lesions;
b. breast cancer patients who have not received surgical treatment for the breast and axilla at the home hospital;
c. patients with typical lobular carcinoma in situ, benign breast cancer, mastitis, papilloma, benign leaf tumor, and no malignant lesion are histologically diagnosed;
d. patients who receive mammary gland operation treatment in an outer hospital and obtain a negative margin and are subjected to axillary operation in the same hospital;
e. the primary focus operation is not carried out in the hospital, and the patients with relapse and metastasis appear after the operation;
f. other malignancies metastasize to breast or axillary patients.
The remark rules may be:
a. coarse needle biopsy does not locate the patient for surgical treatment;
b. patients with primary stage iv breast cancer surgery;
c. patients who have undergone resection biopsy (including minimally invasive surgery) of a tumor at a hospital, have trimmed their margins during open surgery at the hospital, or have completed an axillary operation, have a consultation report from the hospital's pathology department based on a white film of the pathological tissue of the tumor at the hospital.
The above, the standard of the optimization rule is wider, and is applicable to primary general screening, the exclusion rule supplements the optimization rule, and the condition that the selection is not performed is determined, and further, the remark standard supplements the optimization rule and the discharge rule, and the special condition that the selection is performed is determined.
104. Matching the grouping label set with the text label set, and determining a specific type node corresponding to the grouping label set in the implicit semantic space.
In this embodiment, under the condition that the grouping label set defines the grouping requirement and the PLSA classification model constructs complete association mapping for the original case data, the grouping label set is matched with the text label set to determine a specific type node consistent with the grouping requirement.
105. And extracting the original case data associated and mapped with the specific type node to obtain grouped case data.
In this embodiment, the original case data mapped and associated with the specific type node is the grouping case data meeting the requirement of the grouping rule text, and the specific type node is extracted according to the specific type node.
In the embodiment, after the required grouping case data is screened out, the grouping is carried out,
as an optional implementation manner, the data x tool is used for synchronizing the grouped case data, and the structured data in the grouped case data is placed in the standard data table; and splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each incident case data relative to a standard data table, and respectively placing the split semi-structured data and the unstructured data into the standard data table.
Therefore, structured data in the grouped case data are correspondingly classified into the standard data table, and semi-structured data and unstructured data are split and grouped based on matching values of the semi-structured data and parameters such as data formats of fields and field requirements in the standard data table.
In the embodiment, the grouped case data is also checked and cleaned, and errors are removed, so that accurate and available grouped case data is obtained.
As an optional implementation manner, performing value domain verification on each standard data table constructed based on the grouped case data, and removing the standard data table with the overrun data;
and performing logic verification on each standard data table subjected to value range verification, and removing the standard data tables with the defect data violating the medical logic.
Here, the isomorphic value field check is used for eliminating the grouping case data with overrun data (such as negative value age), and the logic check is used for eliminating the grouping case data with medical logic errors (such as operation date earlier than pathological date before treatment), so that the negative influence of invalid data on the research is avoided.
In this embodiment, the image data is stored independently with respect to the grouped case data, and an association relationship is established with the grouped case data in a text format.
As an optional implementation manner, the image data related to the group case data is stored in an image library, and an association relationship is established between the group case data and the image data through a patient main index;
when any of the group case data is retrieved and called, the corresponding image data is synchronously called based on the association relationship.
Therefore, special processing on the image data is not needed, and the influence on the accuracy of the image data is avoided.
In summary, the original case data and the grouping rule text are processed by adopting a natural language processing mode to obtain a text label set and a grouping label set, the original case data and the text label set are further processed by adopting a PLSA classification model to obtain probability distribution-based association mapping between the original case data and the text label set and between the original case data and each type of node, the required grouping case data are extracted by matching the grouping label set and the specific type of node, grouping screening of the original case data including unstructured data is completed, manual intervention is not needed in the process, and the accuracy is high.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a case grouping and screening system based on natural language processing according to an embodiment of the present invention. As shown in fig. 2, the system for screening case grouping based on natural language processing may include:
a tag identification unit 201, configured to perform initial identification on original case data by using an NLP model, so as to obtain a text tag set;
the model construction unit 202 is configured to construct a PLSA classification model based on the original case data and the text label set, and perform associated mapping on the original case data, the text label set, and a plurality of type nodes;
the tag determining unit 203 is configured to determine an entry tag set by using an NLP model based on the entry rule text;
wherein the grouping rule text at least comprises an optimization rule, an exclusion rule and a remark rule;
in the original case data, the original case data which meets the optimization rule and does not meet the exclusion rule, or the original case data which meets the remark rule is the grouped case data.
A tag matching unit 204, configured to match the grouped tag set with the text tag set, and determine a specific type node corresponding to the grouped tag set in an implied semantic space;
the data extraction unit 205 is configured to extract original case data mapped in association with a specific type node to obtain grouped case data;
the data synchronization unit 206 is configured to, after the data extraction unit extracts the original case data mapped in association with the specific type node to obtain the grouped case data, synchronize the grouped case data by using a Datax tool, and place the structured data in the grouped case data into a standard data table;
the data matching unit 207 is used for splitting the semi-structured data and the unstructured data according to a matching value of the semi-structured data and the unstructured data contained in each grouped case data relative to the standard data table and respectively placing the split semi-structured data and the unstructured data into the standard data table;
the first eliminating unit 208 is used for performing value domain verification on each standard data table constructed based on the grouped case data and eliminating the standard data table with the overrun data;
a second eliminating unit 209, configured to perform logic verification on each standard data table subjected to value range verification, and eliminate standard data tables having defective data that violates medical logic;
the image association unit 210 is configured to store image data related to the group entry case data in an image library, and establish an association relationship between the group entry case data and the image data through a patient main index;
the image retrieving unit 211 is configured to retrieve and retrieve any one of the incoming group case data, and retrieve corresponding image data synchronously based on the association relationship.
As an optional implementation manner, the data synchronization unit 206 synchronizes the grouped case data by using a Datax tool, and places the structured data in the grouped case data into a standard data table; the data matching unit 207 splits the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data included in each entered group case data relative to the standard data table, and places the split semi-structured data and the unstructured data into the standard data table respectively.
Therefore, structured data in the grouped case data are correspondingly classified into the standard data table, and semi-structured data and unstructured data are split and grouped based on matching values of the semi-structured data and parameters such as data formats of fields and field requirements in the standard data table.
As an optional implementation manner, the first removing unit 208 performs value domain verification on each standard data table constructed based on the grouped case data, and removes the standard data table with the overrun data;
the second culling unit 209 performs logic verification on each standard data sheet for which value range verification is completed, and culls the standard data sheet in which defective data against medical logic exists.
Here, the isomorphic value field check is used for eliminating the grouping case data with overrun data (such as negative value age), and the logic check is used for eliminating the grouping case data with medical logic errors (such as operation date earlier than pathological date before treatment), so that the negative influence of invalid data on the research is avoided.
In this embodiment, the image data is stored independently with respect to the grouped case data, and an association relationship is established with the grouped case data in a text format.
As an optional implementation manner, the image association unit 210 stores the image data related to the grouped case data in an image library, and establishes an association relationship between the grouped case data and the image data through a patient main index;
when retrieving and calling any one of the pieces of incoming group case data, image calling section 211 synchronously calls the corresponding image data based on the association relationship.
Therefore, special processing on the image data is not needed, and the influence on the accuracy of the image data is avoided.
In summary, the original case data and the grouping rule text are processed by adopting a natural language processing mode to obtain a text label set and a grouping label set, the original case data and the text label set are further processed by adopting a PLSA classification model to obtain probability distribution-based association mapping between the original case data and the text label set and between the original case data and each type of node, the required grouping case data are extracted by matching the grouping label set and the specific type of node, grouping screening of the original case data including unstructured data is completed, manual intervention is not needed in the process, and the accuracy is high.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of another example grouping screening system based on natural language processing according to an embodiment of the present invention. As shown in fig. 3, the system for screening cases into groups based on natural language processing may include:
a memory 301 storing executable program code;
a processor 302 coupled to the memory 301;
the processor 302 calls the executable program code stored in the memory 301 to execute a case grouping screening method based on natural language processing of fig. 1.
The embodiment of the invention discloses a computer-readable storage medium which stores a computer program, wherein the computer program enables a computer to execute the case grouping screening method based on natural language processing in the figure 1.
Embodiments of the present invention also disclose a computer program product, wherein, when the computer program product is run on a computer, the computer is caused to execute part or all of the steps of the method as in the above method embodiments.
It will be understood by those skilled in the art that all or part of the steps in the methods of the embodiments described above may be implemented by hardware instructions of a program, and the program may be stored in a computer-readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other Memory, such as a magnetic disk, or a combination thereof, A tape memory, or any other medium readable by a computer that can be used to carry or store data.
The method and the system for grouping and screening cases based on natural language processing disclosed by the embodiment of the invention are described in detail, a specific example is applied in the method to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (10)
1. A method for case grouping screening based on natural language processing, the method comprising:
performing primary identification on original case data by adopting an NLP model to obtain a text label set;
constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data, the text label set and a plurality of type nodes;
based on the grouping rule text, determining a grouping label set by adopting an NLP model;
matching the grouped label set with the text label set, and determining a specific type node corresponding to the grouped label set in the implicit semantic space;
and extracting the original case data associated and mapped with the specific type node to obtain grouped case data.
2. The method as claimed in claim 1, wherein the grouping rule text at least includes a preference rule, an exclusion rule and a remark rule;
and in the original case data, the original case data which conforms to the preferred rule and does not conform to the exclusion rule, or the original case data which conforms to the remark rule is grouped case data.
3. The method as claimed in claim 1, wherein after the extracting of the original case data mapped to the specific type of node to obtain the grouped case data, the method further comprises:
synchronizing the grouped case data by adopting a Datax tool, and placing the structured data in the grouped case data into a standard data table;
and splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each incident case data relative to a standard data table, and respectively placing the split semi-structured data and the unstructured data into the standard data table.
4. The method as claimed in claim 3, further comprising:
performing value domain verification on each standard data table constructed based on the grouped case data, and removing the standard data table with the overrun data;
and performing logic verification on each standard data table subjected to value range verification, and removing the standard data tables with the defect data violating the medical logic.
5. The method as claimed in claim 1, further comprising:
storing the image data related to the grouping case data in an image library, and establishing an incidence relation between the grouping case data and the image data through a patient main index;
when any group case data is retrieved and called, corresponding image data is synchronously called based on the association relation.
6. A system for case grouping screening based on natural language processing, the system comprising:
the label identification unit is used for carrying out primary identification on the original case data by adopting an NLP (non line segment) model to obtain a text label set;
the model construction unit is used for constructing a PLSA classification model based on the original case data and the text label set, and performing association mapping on the original case data, the text label set and a plurality of type nodes;
the tag determining unit is used for determining an grouping tag set by adopting an NLP model based on the grouping rule text;
the label matching unit is used for matching the grouped label set with the text label set and determining a specific type node corresponding to the grouped label set in the implied semantic space;
and the data extraction unit is used for extracting the original case data associated and mapped with the specific type node to obtain the grouped case data.
7. The system of claim 6, wherein the grouping rules text comprises at least preference rules, exclusion rules, and remark rules;
and in the original case data, the original case data which conforms to the preferred rule and does not conform to the exclusion rule, or the original case data which conforms to the remark rule is grouped case data.
8. The system of claim 6, further comprising:
the data synchronization unit is used for synchronizing the grouped case data by adopting a Datax tool after the data extraction unit extracts the original case data associated and mapped with the specific type node to obtain the grouped case data, and placing the structured data in the grouped case data into a standard data table;
and the data matching unit is used for splitting the semi-structured data and the unstructured data according to the matching value of the semi-structured data and the unstructured data contained in each grouping case data relative to the standard data table and respectively placing the split semi-structured data and the unstructured data into the standard data table.
9. The system of claim 8, wherein the method further comprises:
the first removing unit is used for carrying out value domain verification on each standard data table constructed based on the grouped case data and removing the standard data table with the overrun data;
and the second eliminating unit is used for carrying out logic verification on each standard data table subjected to value range verification and eliminating the standard data tables with defective data violating medical logic.
10. The system of claim 6, wherein the method further comprises:
the image association unit is used for storing the image data related to the grouping case data in an image library and establishing an association relation between the grouping case data and the image data through a patient main index;
and the image calling unit is used for synchronously calling the corresponding image data based on the association relation when any group case data is searched and called.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111564591.1A CN114743681B (en) | 2021-12-20 | 2021-12-20 | Case grouping screening method and system based on natural language processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111564591.1A CN114743681B (en) | 2021-12-20 | 2021-12-20 | Case grouping screening method and system based on natural language processing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114743681A true CN114743681A (en) | 2022-07-12 |
CN114743681B CN114743681B (en) | 2024-01-30 |
Family
ID=82274760
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111564591.1A Active CN114743681B (en) | 2021-12-20 | 2021-12-20 | Case grouping screening method and system based on natural language processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114743681B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180032678A1 (en) * | 2016-07-29 | 2018-02-01 | International Business Machines Corporation | Medical recording system |
CN109947858A (en) * | 2017-07-26 | 2019-06-28 | 腾讯科技(深圳)有限公司 | A kind of method and device of data processing |
CN110197723A (en) * | 2019-07-03 | 2019-09-03 | 四川大学华西医院 | Clinical somatization classification diagnosis system under psychosomatic medicine theoretical frame |
CN110413994A (en) * | 2019-06-28 | 2019-11-05 | 宁波深擎信息科技有限公司 | Hot topic generation method, device, computer equipment and storage medium |
CN110570943A (en) * | 2019-09-04 | 2019-12-13 | 医渡云(北京)技术有限公司 | method and device for intelligently recommending MDT (minimization of drive test) grouping, electronic equipment and storage medium |
CN111414393A (en) * | 2020-03-26 | 2020-07-14 | 湖南科创信息技术股份有限公司 | Semantic similar case retrieval method and equipment based on medical knowledge graph |
CN112948471A (en) * | 2019-11-26 | 2021-06-11 | 广州知汇云科技有限公司 | Clinical medical text post-structured processing platform and method |
-
2021
- 2021-12-20 CN CN202111564591.1A patent/CN114743681B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180032678A1 (en) * | 2016-07-29 | 2018-02-01 | International Business Machines Corporation | Medical recording system |
CN109947858A (en) * | 2017-07-26 | 2019-06-28 | 腾讯科技(深圳)有限公司 | A kind of method and device of data processing |
CN110413994A (en) * | 2019-06-28 | 2019-11-05 | 宁波深擎信息科技有限公司 | Hot topic generation method, device, computer equipment and storage medium |
CN110197723A (en) * | 2019-07-03 | 2019-09-03 | 四川大学华西医院 | Clinical somatization classification diagnosis system under psychosomatic medicine theoretical frame |
CN110570943A (en) * | 2019-09-04 | 2019-12-13 | 医渡云(北京)技术有限公司 | method and device for intelligently recommending MDT (minimization of drive test) grouping, electronic equipment and storage medium |
CN112948471A (en) * | 2019-11-26 | 2021-06-11 | 广州知汇云科技有限公司 | Clinical medical text post-structured processing platform and method |
CN111414393A (en) * | 2020-03-26 | 2020-07-14 | 湖南科创信息技术股份有限公司 | Semantic similar case retrieval method and equipment based on medical knowledge graph |
Non-Patent Citations (2)
Title |
---|
LIN LIU 等: "An overview of topic modeling and its current applications in bioinformatics", 《SPRINGERPLUS》, pages 1 - 22 * |
吴东: "基于潜在语义相关算法的电子病历检索的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 5, pages 138 - 1343 * |
Also Published As
Publication number | Publication date |
---|---|
CN114743681B (en) | 2024-01-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110765770B (en) | Automatic contract generation method and device | |
CN109582955B (en) | Method, apparatus and medium for standardizing medical terms | |
CN110059697B (en) | Automatic lung nodule segmentation method based on deep learning | |
CN112365987A (en) | Diagnostic data anomaly detection method and device, computer equipment and storage medium | |
CN103530334B (en) | Based on the data matching system and method for comparing template | |
CN103473375A (en) | Data cleaning method and data cleaning system | |
US20090287663A1 (en) | Disease name input support program, method and apparatus | |
Bertram et al. | Computer-assisted mitotic count using a deep learning–based algorithm improves interobserver reproducibility and accuracy | |
CN106502991B (en) | Publication treating method and apparatus | |
CN113488180B (en) | Clinical guideline knowledge modeling method and system | |
CN110019542B (en) | Generation of enterprise relationship, generation of organization member database and identification of same name member | |
Aggarwal et al. | Semantic and content-based medical image retrieval for lung cancer diagnosis with the inclusion of expert knowledge and proven pathology | |
CN113743463B (en) | Tumor benign and malignant recognition method and system based on image data and deep learning | |
CN114864107A (en) | Clinical pathway variation analysis method, equipment and storage medium | |
CN108170691A (en) | It is associated with the determining method and apparatus of document | |
CN110752027B (en) | Electronic medical record data pushing method, device, computer equipment and storage medium | |
Tafavvoghi et al. | Publicly available datasets of breast histopathology H&E whole-slide images: A scoping review | |
WO2021107099A1 (en) | Document creation assistance device, document creation assistance method, and program | |
CN114743681A (en) | Case grouping screening method and system based on natural language processing | |
CN110853716B (en) | Medical record template creation method and device | |
CN116206767A (en) | Disease knowledge mining method, device, electronic equipment and storage medium | |
Oh et al. | 3D auto-segmentation of biliary structure of living liver donors using magnetic resonance cholangiopancreatography for enhanced preoperative planning | |
CN116910650A (en) | Data identification method, device, storage medium and computer equipment | |
Gellatly | Reconstructing historical populations from genealogical data files | |
Wessel Lindberg et al. | Quantitative tumor heterogeneity assessment on a nuclear population basis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |