CN112289398B - Pathological report analysis method and device, storage medium and terminal - Google Patents

Pathological report analysis method and device, storage medium and terminal Download PDF

Info

Publication number
CN112289398B
CN112289398B CN202010825906.2A CN202010825906A CN112289398B CN 112289398 B CN112289398 B CN 112289398B CN 202010825906 A CN202010825906 A CN 202010825906A CN 112289398 B CN112289398 B CN 112289398B
Authority
CN
China
Prior art keywords
word
report
dictionary
pathological
analyzed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010825906.2A
Other languages
Chinese (zh)
Other versions
CN112289398A (en
Inventor
秦晓宏
刘焕春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Clinbrain Information Technology Co Ltd
Original Assignee
Shanghai Clinbrain Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Clinbrain Information Technology Co Ltd filed Critical Shanghai Clinbrain Information Technology Co Ltd
Priority to CN202010825906.2A priority Critical patent/CN112289398B/en
Publication of CN112289398A publication Critical patent/CN112289398A/en
Application granted granted Critical
Publication of CN112289398B publication Critical patent/CN112289398B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A pathology report analysis method and device, a storage medium and a terminal, wherein the method comprises the following steps: obtaining a pathological report to be analyzed; based on a set word stock, segmenting the pathological report to be analyzed to obtain a segmented pathological report to be analyzed, wherein the set word stock is obtained in the following way: combining a plurality of specified dictionaries to obtain an intermediate word stock, traversing all words in the intermediate word stock, judging whether each word can be subjected to word segmentation again, deleting words capable of being subjected to word segmentation again from the intermediate word stock, and obtaining the set word stock; vectorizing the segmented pathology report to be analyzed to obtain a word vector set corresponding to the pathology report to be analyzed; performing pathological report analysis on the word vector set corresponding to the pathological report to be analyzed by adopting a pre-trained pathological report analysis model to obtain a pathological report analysis result; and outputting a pathological report analysis result. The scheme can improve the accuracy of the analysis result of the pathological report.

Description

Pathological report analysis method and device, storage medium and terminal
Technical Field
The embodiment of the invention relates to the field of text analysis, in particular to a pathology report analysis method and device, a storage medium and a terminal.
Background
With the development of science and technology, the medical industry is also moving into the age of informatization, data formation and structuring, and in order to facilitate the storage, statistics, analysis, research, etc. of medical data, unstructured text needs to be changed into structured text.
The pathological report in the medical data refers to that after the operation is finished, after a series of technical treatments are carried out on tissues taken from the body of a patient, a pathologist carries out diagnosis description on the pathological tissues according to the dyeing condition and records some information related to diseases, and one or more sections of characters can be provided for clinical follow-up treatment.
At present, the pathological report is structured in several ways. In the first mode, the pathological text structuring is realized in a regular mode. And in a second mode, a bidirectional coding characterization method (Bidirectional Encoder Representation from Transformers, BRET) based on a converter and a Long Short-Term Memory artificial neural network (LSTM) are adopted to analyze the pathology report.
However, as the template which is required to be set is needed to be relied on for realizing pathological text structuring in a regular mode, once the pathological text structure is changed, the corresponding template is required to be updated in time to structure the data of the new structure, and if the template is not updated in time, the accuracy of structuring the pathological report is lower. In the second mode, the accuracy of the structured result obtained by structuring the pathology report is low.
In summary, the accuracy of the results of the prior art structured interpretation of pathology reports is low.
Disclosure of Invention
The technical problem solved by the embodiment of the invention is that the accuracy of the analysis result of the pathology report is lower.
In order to solve the above technical problems, an embodiment of the present invention provides a pathology report parsing method, including: obtaining a pathological report to be analyzed; based on a set word stock, segmenting the pathological report to be analyzed to obtain a segmented pathological report to be analyzed, wherein the set word stock is obtained in the following way: combining a plurality of specified dictionaries to obtain an intermediate word stock, traversing all words in the intermediate word stock, judging whether each word can be subjected to word segmentation again, deleting words capable of being subjected to word segmentation again from the intermediate word stock, and obtaining the set word stock; vectorizing the segmented pathology report to be analyzed to obtain a word vector set corresponding to the pathology report to be analyzed; performing pathological report analysis on the word vector set corresponding to the pathological report to be analyzed by adopting a pre-trained pathological report analysis model to obtain a pathological report analysis result; and outputting the pathological report analysis result.
Optionally, the determining whether each word can perform word re-segmentation, deleting the word capable of performing word re-segmentation from the intermediate word stock, includes: when a word can be subjected to word re-segmentation, judging whether the word capable of word re-segmentation belongs to a specific dictionary, wherein the specific dictionary is from the specific dictionary, and the specific dictionary comprises at least one of the following: disease diagnosis name dictionary, body part dictionary, symptom dictionary, and operation dictionary; deleting the word capable of performing re-segmentation from the intermediate word stock when the word capable of performing re-segmentation does not belong to the specific dictionary; and when the word capable of performing the re-segmentation belongs to the specific dictionary, not deleting the word capable of performing the re-segmentation.
Optionally, the specified dictionary includes at least one of: disease diagnosis name dictionary, body part dictionary, symptom dictionary, surgery dictionary, jieba dictionary.
Optionally, the word segmentation of the pathological report to be analyzed based on the word library includes: and performing word segmentation on the pathological report to be analyzed by adopting a maximum forward matching algorithm and a new word finding mode.
Optionally, training is performed in the following manner to obtain the pathology report analysis model: acquiring a training sample set, wherein the training sample set comprises a plurality of labeling samples; dividing words from the labeling sample based on the set word stock to obtain a labeling sample after word division; vectorizing the word-segmented labeling sample to obtain a word vector set corresponding to the labeling sample; inputting a word vector set corresponding to the labeling sample to spaCy; and training the parameters in spaCy by adopting the word vector set corresponding to the labeling sample until the set convergence condition is met, so as to obtain the pathological report analysis model.
Optionally, the labeling sample includes a labeled sample and an expanded labeling sample, and the expanded labeling sample is obtained by adopting the following manner: acquiring replacement data corresponding to a label adopted by the marked sample; and replacing the labeling data with the corresponding label in the labeled sample by adopting the replacement data to obtain the expanded labeling sample.
Optionally, the outputting the pathology report analysis result includes: and obtaining corresponding data according to the tag, carrying out structuring treatment on the tag and the data corresponding to the tag to obtain structured data, and outputting the structured data as the pathological report analysis result.
Optionally, the pathology report parsing method further includes: and modifying the stop word list in spaCy.
Optionally, the modifying the deactivation vocabulary of spaCy includes at least one of: deleting a first type of stop words from the stop word list, wherein the first type of stop words comprise at least one of the following: latin numbers, greek letters, asterisks; adding a second class of stop words to the stop word list, wherein the second class of stop words comprises at least one of the following: review and wait for observation at any time.
The embodiment of the invention also provides a pathology report analyzing device, which comprises: the acquisition unit is used for acquiring a pathology report to be analyzed; the word segmentation unit is used for segmenting the pathological report to be analyzed based on a set word lexicon to obtain the pathological report to be analyzed after word segmentation, wherein the set word lexicon is obtained by adopting the following modes: combining a plurality of specified dictionaries to obtain an intermediate word stock, traversing all words in the intermediate word stock, judging whether each word can be subjected to word segmentation again, deleting words capable of being subjected to word segmentation again from the intermediate word stock, and obtaining the set word stock; the vectorization unit is used for vectorizing the segmented pathology report to be analyzed to obtain a word vector set corresponding to the pathology report to be analyzed; the analysis unit is used for carrying out pathological report analysis on the word vector set corresponding to the pathological report to be analyzed by adopting a pre-trained pathological report analysis model to obtain pathological report analysis results; and the output unit is used for outputting the pathological report analysis result.
The embodiment of the invention also provides a storage medium, on which a computer program is stored, which when being executed by a processor, performs the steps of any of the above-mentioned pathology report parsing methods.
The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of any one of the pathology report analysis methods when running the computer program.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
The method comprises the steps of segmenting words of a pathology report to be analyzed based on a set word library, obtaining a word vector set corresponding to the pathology report to be analyzed by vectorizing the segmented pathology report, analyzing the pathology report by adopting a pre-trained pathology report analysis model to the word vector set corresponding to the pathology report to be analyzed, and obtaining a pathology report analysis result, wherein the set word library is obtained by adopting the following processing mode: the method comprises the steps of combining a plurality of specified dictionaries to obtain a middle word stock, traversing all words in the middle word stock, judging that each word can be subjected to word segmentation again, deleting the words capable of being subjected to word segmentation again from the middle word stock, and accordingly enabling the granularity of the words in the set word stock to be finer, improving the accuracy of word segmentation and fine-granularity word segmentation when the pathological report is segmented, and improving the accuracy of analysis results of the pathological report by improving the accuracy of word segmentation.
Drawings
FIG. 1 is a flow chart of a pathology report parsing method according to an embodiment of the present invention;
FIG. 2 is a training flow diagram of a pathology report parsing model in an embodiment of the present invention;
FIG. 3 is a schematic diagram showing the effect of a result of analyzing a pathology report on a visual interface according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a pathology report analysis apparatus according to an embodiment of the present invention.
Detailed Description
At present, the pathological report is structured in several ways. In the first mode, the pathological text structuring is realized in a regular mode. And secondly, analyzing a pathology report by adopting BERT and Long Short-Term Memory artificial neural network (LSTM).
However, as the template which is required to be set is needed to be relied on for realizing pathological text structuring in a regular mode, once the pathological text structure is changed, the corresponding template is required to be updated in time to structure the data of the new structure, and if the template is not updated in time, the accuracy of structuring the pathological report is lower. In the second mode, the accuracy of the structured result obtained by structuring the pathology report is low.
In summary, the accuracy of the results of the prior art structured interpretation of pathology reports is low.
In order to solve the above problems, in the embodiment of the present invention, a word is segmented for a pathological report to be analyzed based on a word library, a word vector set corresponding to the pathological report to be analyzed is obtained by vectorizing the segmented pathological report to be analyzed, a pathological report is analyzed for the word vector set corresponding to the pathological report to be analyzed by adopting a pre-trained pathological report analysis model, and a pathological report analysis result is obtained, because the word library is set in the following processing manner: the method comprises the steps of combining a plurality of specified dictionaries to obtain a middle word stock, traversing all words in the middle word stock, judging that each word can be subjected to word segmentation again, deleting the words capable of being subjected to word segmentation again from the middle word stock, and accordingly enabling the granularity of the words in the set word stock to be finer, improving the accuracy of word segmentation and fine-granularity word segmentation when the pathological report is segmented, and improving the accuracy of analysis results of the pathological report by improving the accuracy of word segmentation.
In order to make the above objects, features and advantages of the embodiments of the present invention more comprehensible, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.
An embodiment of the present invention provides a method for analyzing a pathology report, and referring to fig. 1, a flowchart of the method for analyzing a pathology report in the embodiment of the present invention is provided, and specifically may include the following steps:
And S11, obtaining a pathology report to be analyzed.
And step S12, based on the set word stock, segmenting the pathology report to be analyzed, and obtaining the segmented pathology report to be analyzed.
In the prior art, when a dictionary database in the prior art is adopted for word segmentation, the granularity of the word segmentation is thicker, and the phenomenon that partial words cannot be separated easily occurs, so that the accuracy of subsequent pathological report analysis is affected. For example, for "intrahepatic lymph node metastasis" that occurs in pathology reports, the result of the word segmentation in the prior art is: intrahepatic/visible/lymph node/metastasis, intrahepatic is separated as an independent word, so that the organ of the liver cannot be identified, and the organ of the liver cannot be identified, so that the subsequent failure to accurately determine the metastasis site of the lymph node is affected.
Aiming at the problems, the word library of the setting words adopted in the embodiment of the invention can be obtained by processing the following modes: combining a plurality of specified dictionaries to obtain an intermediate word stock, traversing all words in the intermediate word stock, judging whether each word can be subjected to word segmentation again, and deleting the words capable of being subjected to word segmentation again from the intermediate word stock to obtain the set word stock. By deleting the words which can be segmented again from the word stock, the granularity of the set word stock can be made finer.
In implementations, the specification dictionary can include at least one of: disease diagnosis name dictionary, body part dictionary, symptom dictionary, surgery dictionary, jieba dictionary. It will be appreciated that the specified dictionary may also include other types of dictionaries, not illustrated herein, depending on the actual needs.
In the embodiment of the invention, all the specified dictionaries such as a disease diagnosis name dictionary, a body part dictionary, a symptom dictionary, a surgery dictionary and jieba dictionaries can be combined together to obtain an intermediate word and word library. The dictionary jiaba can also be expanded on the basis of the dictionary jiaba by combining with other specified dictionaries such as a disease diagnosis name dictionary, a body part dictionary, a symptom dictionary, a surgery dictionary and the like to obtain a word library of intermediate words.
In order to improve accuracy of the re-segmented words deleted from the intermediate word stock, in the embodiment of the present invention, when a word is capable of re-segmentation, it is determined whether the word capable of re-segmentation belongs to a specific dictionary, where the specific dictionary is derived from the specific dictionary, and the specific dictionary includes at least one of the following: disease diagnosis name dictionary, body part dictionary, symptom dictionary, and operation dictionary.
And deleting the word capable of performing the re-segmentation from the intermediate word stock when the word capable of performing the re-segmentation does not belong to the specific dictionary. Accordingly, when the word capable of re-segmentation belongs to the specific dictionary, the word capable of re-segmentation is not deleted. So that erroneous deletion of words in the specific dictionary can be avoided.
Wherein, the word capable of performing the re-segmentation may also be called a parent word.
For example, as for the word "intrahepatic" in the intermediate word stock, since "intrahepatic" can be classified into "hepatic" and "intrahepatic" again, that is, "intrahepatic" is a parent word, the parent word "intrahepatic" can be deleted from the intermediate word stock.
As another example, for the word "liver cirrhosis" in the intermediate word stock, although "liver cirrhosis" can be classified into "liver" and "cirrhosis" again, that is, "liver cirrhosis" is a parent word, the parent word "liver cirrhosis" is not deleted because it belongs to the disease diagnosis name dictionary.
In the embodiment of the invention, the pathological report to be analyzed can be segmented based on the set word stock in the following manner. And performing word segmentation on the pathological report to be analyzed by adopting a maximum forward matching algorithm and a new word finding mode.
The algorithm idea of the maximum forward maximum matching algorithm is to match a plurality of continuous characters in a pathology report to be analyzed with words in a word stock of a set word from left to right, and if the continuous characters are matched with the words, a word is segmented. When maximum forward maximum matching is performed, the word can be segmented instead of the first matching, and the maximum matching word must be ensured to be finished only if the next scanning is not to set the word or the prefix of the word in the word stock.
In the embodiment of the invention, in order to expand the word bank of the set word and reduce the probability of the occurrence of the unregistered word during word segmentation as much as possible, operations such as word segmentation, statistics, screening and the like can be performed on the corpus in the business bank according to the corpus in the business bank so as to expand dictionaries related to symptoms, diseases, body parts and the like. The business library can generally comprise relevant corpora such as macroscopic, microscopic, pathological diagnosis, immunohistochemistry and the like.
And S13, vectorizing the segmented pathology report to be analyzed to obtain a word vector set corresponding to the pathology report to be analyzed.
In specific implementation, a pre-trained word vector conversion tool can be adopted to vectorize the segmented pathology report to be analyzed, so that a word vector set corresponding to the pathology report to be analyzed is obtained. In order to improve the accuracy of word vector conversion, data in a business library in the medical field can be adopted to train a word library vector tool.
In specific implementation, when word segmentation is carried out by adopting a set word lexicon, for some words in the encyclopedia, the word vectors in the encyclopedia are used for initializing the words, for none words, the word vectors are initialized by truncating the front distribution, and then the word lexicon trains the word vectors in an unsupervised mode through word2vec, so that a word vector set to be finally used is obtained.
And S14, carrying out pathological report analysis on the word vector set corresponding to the pathological report to be analyzed by adopting a pre-trained pathological report analysis model to obtain a pathological report analysis result.
In a specific implementation, after the word vector set corresponding to the pathological report to be analyzed is obtained, a pre-trained pathological report analysis model can be adopted to analyze the pathological report of the word vector set corresponding to the pathological report to be analyzed, so that a pathological report analysis result is obtained.
And S15, outputting a pathological report analysis result.
In a specific implementation, the output format of the analysis result of the pathology report may be set according to the actual requirement.
From the above, it can be seen that, based on the word library, the word is segmented for the pathological report to be analyzed, the word vector set corresponding to the pathological report to be analyzed is obtained by vectorizing the word-segmented pathological report to be analyzed, the pathological report is analyzed for the pathological report by adopting the pre-trained pathological report analysis model, and the pathological report analysis result is obtained, because the adopted word library is obtained by adopting the following processing mode: the method comprises the steps of combining a plurality of specified dictionaries to obtain a middle word stock, traversing all words in the middle word stock, judging that each word can be subjected to word segmentation again, deleting the words capable of being subjected to word segmentation again from the middle word stock, and accordingly enabling the granularity of the words in the set word stock to be finer, improving the accuracy of word segmentation and fine-granularity word segmentation when the pathological report is segmented, and improving the accuracy of analysis results of the pathological report by improving the accuracy of word segmentation.
In a specific implementation, the pathology report analysis model may be trained in the following manner, and referring to fig. 2, the method specifically may include the following steps:
step S21, a training sample set is obtained.
In a specific implementation, because writing modes and writing methods of names of different doctors of the pathology department are different, the names of different parts may be different, so that the contents of pathology reports are different, and further, when sample labeling is performed, field names of defined labels are also different. Therefore, when the sample is marked, the label can be determined according to the type of the pathological report to be analyzed, and the data corresponding to the label is marked in the sample. The types of pathology reports may be classified according to the location of the lesion, or according to the location of the metastasis, it being understood that they may be classified in other ways. For example. And pathological reports of other parts such as abdomen pathological report, liver pathological report, pancreas pathological report and the like.
In the embodiment of the invention, after the data labeling of each sample is completed, the labeled data can be converted into a standard jsonl format to be used as a labeled sample.
In implementations, the acquired training sample set may include a number of annotation samples.
In the embodiment of the invention, the sample can be expanded based on the marked sample to obtain an expanded marked sample, so that the marked sample can comprise the marked sample and the expanded marked sample.
Specifically, the extended annotation sample can be obtained as follows: and acquiring replacement data corresponding to the label adopted by the marked sample. And replacing the labeling data with the corresponding label in the labeled sample by using the replacement data to obtain an expanded labeling sample.
For example, for a labeled sample "intra-liver lymph node metastasis", the label is a metastatic site, the data labeled for the metastatic site is "liver", when a sample is expanded based on the labeled sample "intra-liver lymph node metastasis", the replacement data corresponding to the labeled metastatic site may include lung, breast, spleen, etc., and the replacement data lung, breast, spleen are respectively replaced for the liver in the labeled sample, the labeled sample may be expanded as follows: lymph node metastasis is seen in the lung; expanding a labeling sample: lymph node metastasis is seen in the mammary gland; expanding a labeling sample: lymph node metastasis is seen in the spleen.
As another example, the marked sample is a gastric cancer radical treatment sample: the stomach has small curvature length of 13.0cm, large curvature length of 22.0cm, interval of 6.0cm, distance of 6.0cm from the upper cutting end, distance of 6.0cm from the lower cutting end, and size of 4.0x3.0x1.0 cm, and the stomach body is an ulcer type tumor with grey section and tough quality. The next small amount of duodenum, 2.0cm long, 3.0cm diameter, smooth mucosa. The tissue with large omentum, 24.0X23.0X1.3 cm in size, did not reach obvious enlarged nodules. 13 nodules are found on the small curved side, the diameter is 0.2-1.2cm, 2 nodules are found on the large curved side, and the diameter is 0.3-0.5cm. Upper cutting end: a block was irregularly organized, with a size of 2.2X2.0X0.3 cm. The labels used in labeling the samples are: the method comprises the steps of cutting the upper cutting end, cutting the lower cutting end, infringement positions, tumor sizes and transfer positions, wherein marking data corresponding to the cutting end of a label are 6.0cm, marking data corresponding to the cutting end of the label are 6.0cm, marking data corresponding to the tumor sizes of the label are 4.0 multiplied by 3.0 multiplied by 1.0cm, and no marking is performed because no data corresponding to the infringement positions and the transfer positions of the label exist in the sample. The replacement data corresponding to the tag tumor size may include: 4.5X2.0X2.0 cm, 5.0X3.0X3.0 cm, etc., and 4.0X2.0X2.0 cm, 5.0X3.0X3.0 cm are respectively replaced by 4.0X3.0X1.0 cm in the marked sample, two extended marked samples can be obtained, namely extended marked sample one: the gastric portion has a gastric curvature of 13.0cm … … and an ulcer type tumor with a size of 4.5X2.0X2.0 cm and a size of 2.2X2.0X0.3 cm with a section of off-white … …; expanding a second labeling sample: the gastric portion has a gastric curvature of 13.0cm … …, 5.0X13.0X13.0 cm, and a section of … … with a size of 2.2X12.0X10.3 cm.
It can be understood that when the sample expansion is performed, the sample expansion can be performed on the data corresponding to one tag in the marked sample, the sample expansion can be performed on the data corresponding to a plurality of tags, the sample expansion can be performed on the data corresponding to all the tags, and the sample expansion can be specifically configured according to the actual requirement, and the examples are only schematically illustrated, so that the person skilled in the art can better understand and sample expansion schemes, and the protection scope is not limited.
Through adopting the replacement data that the label corresponds to expand the sample based on the marked sample, can obtain more marked samples based on a small amount of marked samples for the expression form of sample is more diversified, has both saved a large amount of manpower and material resources, can improve the training effect of pathology report analytical model again.
And S22, segmenting the labeling sample based on the set word stock to obtain a segmented labeling sample.
And S23, vectorizing the segmented labeling sample to obtain a word vector set corresponding to the labeling sample.
And step S24, inputting the word vector set corresponding to the labeling sample to spaCy.
And S25, training the parameters in spaCy by adopting a word vector set corresponding to the labeling sample until a set convergence condition is met, so as to obtain the pathological report analysis model.
In an implementation, the process of spaCy processing text is modular, and when natural language processing (Natural Language Processing, NLP) is invoked to process text, spaCy first tags the text to generate Doc objects, and then processes Doc in several different components, also referred to as processing pipes or models, in turn. spaCy default processing pipeline is in turn: part of speech tagger (tagger), dependency syntax analysis (partner), named entity recognition (ner), etc., each pipeline component returns the processed Doc, which is then passed on to the next component.
The training process of the pathology report parsing model is also referred to as tagger, parser and the training process of ner. Since spaCy itself does not now have an open chinese model, chinese models can be trained from a chinese corpus.
SpaCy are statistical, each decision made is predictive. The predictions are based on samples that the model has seen during the training process. Training a pathology report analytical model first requires training samples (text samples) and hopes for labels predicted by the model. The labels may be part-of-speech tags, named entities or other information. The model will then find the unlabeled text and make predictions. Since we know the correct answer, we can feed back the deviation from the expected output to the erroneous result of the model calculation output. And according to the deviation condition, parameters in spaCy are adjusted so that the actual output result of the model and the expected output tend to be consistent, namely the set convergence condition is met, and the pathological report analysis model is obtained. When the pathological report analysis model analyzes the pathological report, the required analysis result can be obtained by combining the named entity identification result and the context semantics in the pathological report.
In an implementation, spaCy also relates to Language data (Language data), complete Language support, creating a subset of Language is required, declaring custom Language data, such as disabling word lists and exception segmentations, and testing new segmentors. The language settings are completed and a vocabulary can be created, including word frequencies, brown clusters (Brown clusters), and word vectors.
Each language is different and there are many exceptions and special cases in general, especially the most common words. Some of these exceptions are generic across languages, but others are entirely specific, often specific to the need for hard coding. The spacy.lang module contains most special language data, organized in a simple Python file to facilitate upgrading and expanding the data.
The stop word list in spaCy is a list of words or words that are usually automatically filtered before or after data processing, and the words in the stop word list are usually not significant in normal language and are usually treated as blank characters, but in the medical field, some stop words in normal language have special significance, such as latin number I, II, III, IV, etc. can be usually used to represent the cancerous period number, and these latin data are meaningful characters and cannot be removed at will, so in order to avoid deleting some words in the medical field as stop words, in the embodiment of the present invention, the stop word list in spaCy can be modified.
In an implementation, modifying the stop word list in spaCy may include deleting a first type of stop word from the stop word list, where the first type of stop word may include at least one of: latin numbers, greek letters, asterisks, etc., wherein the first class of stop words refers to words that are meaningless in the conventional art, but have diagnostic significance in the medical field.
Modifying the stop word list in spaCy may also include adding a second type of stop word to the stop word list, where the second type of stop word may include at least one of: review at any time, wait for observation, etc. Wherein the second type of stop word refers to a word that is not meaningful for diagnosis.
Further, when outputting the result of analyzing the pathology report in step S15, the corresponding data may be obtained according to the tag, the tag and the data corresponding to the tag may be structured, the data after the structuring may be obtained, and the data after the structuring may be output as the result of analyzing the pathology report.
In some embodiments, the pathology report parsing results may be output in a tabular manner, as shown in table 1:
TABLE 1
Text of Label (Label)
Pancreatic cancer Lesions
33.7×32.8mm Size and dimensions of
Liver Distant metastasis
Fundus of stomach Infringement of
Abdominal aorta Lymph node metastasis
It will be appreciated that other contents may be displayed on the basis of the display contents illustrated in table 1, and in particular, may be configured by a user according to the needs.
In another embodiment, the result of analyzing the pathology report may be output in a text manner, the labels and the corresponding data may be separated in a space manner, and referring to a schematic diagram of the display effect of the result of analyzing the pathology report on the visual interface in the embodiment of the present invention shown in fig. 3, the label data "pancreatic cancer" and the label "lesion" are separated by a space; the label data of 33.7 multiplied by 32.8mm is separated from the label size by a space; the label data of liver and label of remote transfer are separated by space; the label data is separated from the label infringement by a space; the label data "abdominal aorta" and the label "lymph node metastasis" are separated by a space. The different labels can be separated by background color, boxes, blank spaces or punctuation marks. In addition, some other contents such as transfer possibility prediction, possibility prediction of infringement and the like can be included in the pathological report analysis result. In fig. 3, the possibility of distant metastasis is predicted to be high, the possibility of invasion is predicted to be high, and lymph node metastasis is predicted to be high.
According to the outputted pathological report analysis result, the method can help scientific research doctors to carry out statistics or classification and the like, and help clinicians to carry out more effective follow-up treatment on patients. The key information can be provided for doctors rapidly through structuring the pathological report, and the working efficiency of the doctors is improved.
In the specific implementation, the structured pathological report analysis result can be stored in a relational database for being checked by scientific researchers and clinicians, so that the scientific researches, screening, diagnosis and treatment of the scientific researchers and the clinicians are facilitated.
In the implementation, the processing result of the pathological report analysis model can be displayed on the visual interface of the front-end platform, and the analysis structure of the pathological report analysis model is demonstrated in real time through the visual interface, so that a user can intuitively feel the pathological report analysis result and the quality of the model.
The visual interface can be provided with an input port of a pathological report to be analyzed and a display position of an analysis result of the pathological report, and can be further provided with a configuration inlet of the label, a plurality of labels can be arranged at the configuration inlet of the label, a user can add a new label through the label configuration inlet, can delete the existing label, can modify the existing label, can search the label and the like. In addition, the user can configure the display content of the analysis result of the pathological report.
The inventory data or the incremental data can be analyzed by adopting the pathological report analysis model, and the inventory data or the incremental data can be used as a pathological report to be analyzed. Wherein, the stock data is generally configured to run once, i.e. all the stock data are parsed; the incremental data is typically configured to run in real-time or periodically.
In order to facilitate better understanding and implementation of the embodiments of the present invention by those skilled in the art, the present invention further provides a pathology report parsing apparatus.
Referring to fig. 4, which shows a schematic structural diagram of a pathology report analysis apparatus according to an embodiment of the present invention, a pathology report analysis apparatus 40 may include:
an acquisition unit 41 for acquiring a pathology report to be analyzed;
The word segmentation unit 42 is configured to segment the to-be-analyzed pathology report based on a set word lexicon, so as to obtain a segmented to-be-analyzed pathology report, where the set word lexicon is obtained by adopting the following manner: combining a plurality of specified dictionaries to obtain an intermediate word stock, traversing all words in the intermediate word stock, judging whether each word can be subjected to word segmentation again, deleting words capable of being subjected to word segmentation again from the intermediate word stock, and obtaining the set word stock;
the vectorization unit 43 is configured to vectorize the segmented pathology report to be analyzed, so as to obtain a word vector set corresponding to the pathology report to be analyzed;
the analyzing unit 44 is configured to perform pathological report analysis on the word vector set corresponding to the pathological report to be analyzed by using a pre-trained pathological report analysis model, so as to obtain a pathological report analysis result;
And an output unit 45 for outputting the result of analyzing the pathological report.
In specific implementation, the specific working principle and workflow of the pathology report analysis device 40 may be referred to the description of the pathology report analysis method provided in any of the above embodiments of the present invention, which is not repeated herein.
The embodiment of the present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the pathology report parsing method in any of the above embodiments.
The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the pathology report analysis method in any embodiment when running the computer program.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in any computer readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, etc.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims (11)

1. A pathology report parsing method, comprising:
obtaining a pathological report to be analyzed;
Based on a set word stock, segmenting the pathological report to be analyzed to obtain a segmented pathological report to be analyzed, wherein the set word stock is obtained in the following way: combining a plurality of specified dictionaries to obtain an intermediate word stock, traversing all words in the intermediate word stock, judging whether each word can be subjected to word segmentation again, deleting words capable of being subjected to word segmentation again from the intermediate word stock, and obtaining the set word stock;
vectorizing the segmented pathology report to be analyzed to obtain a word vector set corresponding to the pathology report to be analyzed;
Performing pathological report analysis on the word vector set corresponding to the pathological report to be analyzed by adopting a pre-trained pathological report analysis model to obtain a pathological report analysis result;
Outputting the pathological report analysis result;
The step of judging whether each word can be subjected to word segmentation again, deleting the word capable of word segmentation again from the intermediate word stock, comprises the following steps: when a word can be subjected to word re-segmentation, judging whether the word capable of word re-segmentation belongs to a specific dictionary, wherein the specific dictionary is from the specific dictionary, and the specific dictionary comprises at least one of the following: disease diagnosis name dictionary, body part dictionary, symptom dictionary, and operation dictionary; deleting the word capable of performing re-segmentation from the intermediate word stock when the word capable of performing re-segmentation does not belong to the specific dictionary; and when the word capable of performing the re-segmentation belongs to the specific dictionary, not deleting the word capable of performing the re-segmentation.
2. The pathology report parsing method of claim 1, wherein the specified dictionary is selected from the group consisting of: disease diagnosis name dictionary, body part dictionary, symptom dictionary, surgery dictionary, jieba dictionary.
3. The pathology report parsing method according to claim 1, wherein the word segmentation of the pathology report to be parsed based on the set word stock comprises:
and performing word segmentation on the pathological report to be analyzed by adopting a maximum forward matching algorithm and a new word finding mode.
4. The pathology report parsing method of claim 1, wherein the pathology report parsing model is trained as follows:
Acquiring a training sample set, wherein the training sample set comprises a plurality of labeling samples;
Dividing words from the labeling sample based on the set word stock to obtain a labeling sample after word division;
Vectorizing the word-segmented labeling sample to obtain a word vector set corresponding to the labeling sample;
inputting a word vector set corresponding to the labeling sample to spaCy;
and training the parameters in spaCy by adopting the word vector set corresponding to the labeling sample until the parameters in spaCy meet the set convergence condition, so as to obtain the pathological report analysis model.
5. The pathology report parsing method of claim 4, wherein the labeling samples comprise labeled samples and extended labeling samples, wherein the extended labeling samples are obtained by:
acquiring replacement data corresponding to a label adopted by the marked sample;
and replacing the labeling data with the corresponding label in the labeled sample by adopting the replacement data to obtain the expanded labeling sample.
6. The pathology report parsing method of claim 5, wherein the outputting the pathology report parsing result comprises:
And obtaining corresponding data according to the tag, carrying out structuring treatment on the tag and the data corresponding to the tag to obtain structured data, and outputting the structured data as the pathological report analysis result.
7. The pathology report parsing method of claim 4, further comprising: and modifying the stop word list in spaCy.
8. The pathology report parsing method of claim 7, wherein said modifying the stop vocabulary of spaCy comprises at least one of:
Deleting a first type of stop words from the stop word list, wherein the first type of stop words comprise at least one of the following: latin numbers, greek letters, asterisks;
adding a second class of stop words to the stop word list, wherein the second class of stop words comprises at least one of the following: review and wait for observation at any time.
9. A pathology report analysis apparatus, comprising:
the acquisition unit is used for acquiring a pathology report to be analyzed;
The word segmentation unit is used for segmenting the pathological report to be analyzed based on a set word lexicon to obtain the pathological report to be analyzed after word segmentation, wherein the set word lexicon is obtained by adopting the following modes: combining a plurality of specified dictionaries to obtain an intermediate word stock, traversing all words in the intermediate word stock, judging whether each word can be subjected to word segmentation again, deleting words capable of being subjected to word segmentation again from the intermediate word stock, and obtaining the set word stock; the step of judging whether each word can be subjected to word segmentation again, deleting the word capable of word segmentation again from the intermediate word stock, comprises the following steps: when a word can be subjected to word re-segmentation, judging whether the word capable of word re-segmentation belongs to a specific dictionary, wherein the specific dictionary is from the specific dictionary, and the specific dictionary comprises at least one of the following: disease diagnosis name dictionary, body part dictionary, symptom dictionary, and operation dictionary; deleting the word capable of performing re-segmentation from the intermediate word stock when the word capable of performing re-segmentation does not belong to the specific dictionary; when the word capable of performing word segmentation again belongs to the specific dictionary, the word capable of performing word segmentation again is not deleted;
The vectorization unit is used for vectorizing the segmented pathology report to be analyzed to obtain a word vector set corresponding to the pathology report to be analyzed;
the analysis unit is used for carrying out pathological report analysis on the word vector set corresponding to the pathological report to be analyzed by adopting a pre-trained pathological report analysis model to obtain pathological report analysis results;
And the output unit is used for outputting the pathological report analysis result.
10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the pathology report parsing method according to any one of claims 1 to 8.
11. A terminal comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor executes the steps of the pathology report parsing method according to any one of claims 1 to 8 when the computer program is executed.
CN202010825906.2A 2020-08-17 2020-08-17 Pathological report analysis method and device, storage medium and terminal Active CN112289398B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010825906.2A CN112289398B (en) 2020-08-17 2020-08-17 Pathological report analysis method and device, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010825906.2A CN112289398B (en) 2020-08-17 2020-08-17 Pathological report analysis method and device, storage medium and terminal

Publications (2)

Publication Number Publication Date
CN112289398A CN112289398A (en) 2021-01-29
CN112289398B true CN112289398B (en) 2024-05-31

Family

ID=74420737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010825906.2A Active CN112289398B (en) 2020-08-17 2020-08-17 Pathological report analysis method and device, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN112289398B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5862259A (en) * 1996-03-27 1999-01-19 Caere Corporation Pattern recognition employing arbitrary segmentation and compound probabilistic evaluation
JP2005025555A (en) * 2003-07-03 2005-01-27 Ricoh Co Ltd Thesaurus construction system, thesaurus construction method, program for executing the method, and storage medium with the program stored thereon
WO2018149326A1 (en) * 2017-02-16 2018-08-23 阿里巴巴集团控股有限公司 Natural language question answering method and apparatus, and server
CN108538395A (en) * 2018-04-02 2018-09-14 上海市儿童医院 A kind of construction method of general medical disease that calls for specialized treatment data system
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN109918672A (en) * 2019-03-13 2019-06-21 东华大学 A kind of structuring processing method of the Thyroid ultrasound report based on tree construction
CN110457682A (en) * 2019-07-11 2019-11-15 新华三大数据技术有限公司 Electronic health record part-of-speech tagging method, model training method and relevant apparatus
CN110534170A (en) * 2019-08-30 2019-12-03 志诺维思(北京)基因科技有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111274806A (en) * 2020-01-20 2020-06-12 医惠科技有限公司 Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
US10740561B1 (en) * 2019-04-25 2020-08-11 Alibaba Group Holding Limited Identifying entities in electronic medical records

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030105638A1 (en) * 2001-11-27 2003-06-05 Taira Rick K. Method and system for creating computer-understandable structured medical data from natural language reports

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5862259A (en) * 1996-03-27 1999-01-19 Caere Corporation Pattern recognition employing arbitrary segmentation and compound probabilistic evaluation
JP2005025555A (en) * 2003-07-03 2005-01-27 Ricoh Co Ltd Thesaurus construction system, thesaurus construction method, program for executing the method, and storage medium with the program stored thereon
WO2018149326A1 (en) * 2017-02-16 2018-08-23 阿里巴巴集团控股有限公司 Natural language question answering method and apparatus, and server
CN108538395A (en) * 2018-04-02 2018-09-14 上海市儿童医院 A kind of construction method of general medical disease that calls for specialized treatment data system
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN109918672A (en) * 2019-03-13 2019-06-21 东华大学 A kind of structuring processing method of the Thyroid ultrasound report based on tree construction
US10740561B1 (en) * 2019-04-25 2020-08-11 Alibaba Group Holding Limited Identifying entities in electronic medical records
CN110457682A (en) * 2019-07-11 2019-11-15 新华三大数据技术有限公司 Electronic health record part-of-speech tagging method, model training method and relevant apparatus
CN110534170A (en) * 2019-08-30 2019-12-03 志诺维思(北京)基因科技有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111274806A (en) * 2020-01-20 2020-06-12 医惠科技有限公司 Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Pathology Image Analysis Using Segmentation Deep Learning Algorithms";Wang, Shidan,et al.;《AMERICAN JOURNAL OF PATHOLOGY》;第189卷(第9期);1686-1698 *
"海量文本疾病主题自动提取研究";王明令,等;《数字技术与应用》;第37卷(第5期);74-75 *

Also Published As

Publication number Publication date
CN112289398A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
US10929420B2 (en) Structured report data from a medical text report
CN111274806B (en) Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN109299472B (en) Text data processing method and device, electronic equipment and computer readable medium
CN111540468B (en) ICD automatic coding method and system for visualizing diagnostic reasons
US8155951B2 (en) Process for constructing a semantic knowledge base using a document corpus
CN112597774B (en) Chinese medical named entity recognition method, system, storage medium and equipment
CN108831559A (en) A kind of Chinese electronic health record text analyzing method and system
CN111834014A (en) Medical field named entity identification method and system
US20220301670A1 (en) Automated information extraction and enrichment in pathology report using natural language processing
CN111949759A (en) Method and system for retrieving medical record text similarity and computer equipment
CN108804423B (en) Medical text feature extraction and automatic matching method and system
CN112241457A (en) Event detection method for event of affair knowledge graph fused with extension features
CN111783466A (en) Named entity identification method for Chinese medical records
Carchiolo et al. Medical prescription classification: a NLP-based approach
CN109522338B (en) Clinical term mining method, device, electronic equipment and computer readable medium
CN109918672B (en) Structural processing method of thyroid ultrasound report based on tree structure
Qiu et al. Fast and accurate recognition of Chinese clinical named entities with residual dilated convolutions
CN111696640A (en) Method, device and storage medium for automatically acquiring medical record template
CN112635013A (en) Medical image information processing method and device, electronic equipment and storage medium
Tahmasebi et al. Automatic normalization of anatomical phrases in radiology reports using unsupervised learning
CN116340544B (en) Visual analysis method and system for ancient Chinese medicine books based on knowledge graph
CN115019906A (en) Multi-task sequence labeled drug entity and interaction combined extraction method
CN115910263A (en) PET/CT image report conclusion auxiliary generation method and device based on knowledge graph
CN112289398B (en) Pathological report analysis method and device, storage medium and terminal
CN117422074A (en) Method, device, equipment and medium for standardizing clinical information text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant