WO2024042348A1 - 英文医疗文本结构化的方法、装置、介质及电子设备 - Google Patents

英文医疗文本结构化的方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2024042348A1
WO2024042348A1 PCT/IB2022/057919 IB2022057919W WO2024042348A1 WO 2024042348 A1 WO2024042348 A1 WO 2024042348A1 IB 2022057919 W IB2022057919 W IB 2022057919W WO 2024042348 A1 WO2024042348 A1 WO 2024042348A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
structured
model
medical record
type
Prior art date
Application number
PCT/IB2022/057919
Other languages
English (en)
French (fr)
Inventor
郭锋
金薇
张语宸
俞素娥
陈伟权
曹晶露
Original Assignee
Evyd科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Evyd科技有限公司 filed Critical Evyd科技有限公司
Priority to PCT/IB2022/057919 priority Critical patent/WO2024042348A1/zh
Publication of WO2024042348A1 publication Critical patent/WO2024042348A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Definitions

  • the present disclosure relates to the field of text processing technology, and in particular, to a method for structuring English medical text, a device for structuring English medical text, a computer-readable storage medium and Electronic equipment.
  • BACKGROUND OF THE INVENTION The related technology of text structuring is to convert text expressed in natural language into structured data that can be retrieved, analyzed, and calculated, and to extract the information of interest in the corresponding scene. This can greatly reduce the manual workload. Improve efficiency. For example, in the medical field, how to extract and structure information from a large number of complex unstructured medical records or surgical records is a very useful and difficult technology, and it is also an indispensable technology for text processing and knowledge extraction.
  • entity recognition models, relationship extraction models, reading comprehension models, and text classification models can be used to achieve text structuring effects.
  • these supervised learning methods require a large amount of manually labeled data. Obtaining labeled data is extremely difficult and wastes labor and time costs.
  • these supervised learning models use a large number of cases from Western countries for research, and do not fit well with the English usage habits and case writing habits of Southeast Asia, which reduces the accuracy of structuring and deteriorates the quality of text structuring. Effect. In view of this, there is an urgent need in this field to develop a new method and device for structuring English medical text.
  • the purpose of the present disclosure is to provide a method for structuring English medical text, a device for structuring English medical text, a computer-readable storage medium and an electronic device, so as to overcome, at least to a certain extent, the limitations caused by the limitations of related technologies. Technical problems of poor accuracy and high cost. Additional features and advantages of the disclosure will be apparent from the following detailed description, or, in part, may be learned by practice of the disclosure. According to one aspect of the present disclosure, a method for structuring English medical text is provided.
  • the method includes: obtaining English medical record text, and performing text preprocessing on the English medical record text to obtain text to be structured; determining the structure to be structured. type of the text to be structured, and determine the corresponding structured model according to the type of the text to be structured; use the structured model to structure the text to be structured, and obtain the target structured text of the medical record text .
  • the text preprocessing of the medical record text to obtain the text to be structured includes: using a spelling correction model to perform spelling correction on the medical record text to obtain the corrected text; identifying the The target keywords in the text are corrected, and the corrected texts are merged according to the target keywords to obtain the text to be structured.
  • determining the type of the text to be structured and determining the corresponding structured model according to the type of the text to be structured includes: determining the type of the text to be structured according to the type of the text to be structured.
  • the target field of determines the type of the text to be structured; when the type of text to be structured is the first type, it is determined to use the named entity recognition model for structuring; when the type of the text to be structured is the second type , it is determined to use a text classification model for structuring; when the type of text to be structured is the third type, it is determined to use a question and answer model for structuring.
  • determining the type of the text to be structured according to the target field of the text to be structured includes: Identify the target field of the text to be structured, and when the target field is named entity information, determine that the text to be structured is the first type; when the target field is text classification information, determine the text to be structured is the second type; when the target field is question and answer information, the text to be structured is determined to be the third type.
  • using the structured model to structure the text to be structured includes: obtaining the text with the English text The target format text corresponding to the medical record text and the verification rules corresponding to the text format to be structured, and using the verification rules to parse the target format text to obtain the first verification structured result, the verification The rules include custom rules formulated for the target format text; using a named entity recognition model to perform named entity recognition on the text to be structured to obtain a second structured result of the medical record text; according to the first structured result Combined with the second structured result, the target structured text is determined.
  • the named entity recognition model is obtained through the following training steps: obtaining a medical record sample and annotation fields corresponding to the medical record sample, and using pre-trained word vectors to classify the medical record sample. Perform text mapping to obtain word embedding vectors; use word embeddings to contextually map the medical record samples to obtain string embedding vectors, and vertically splice the word embedding vectors and string embedding vectors to obtain sample vectors; Use the annotation fields and the sample vectors to train multiple named entity recognition models to be trained, and perform model scoring on the multiple trained named entity recognition models to obtain corresponding multiple first model scores; A model score determines a trained named entity recognition model among a plurality of the trained named entity recognition models.
  • the text classification model is trained through the following steps: obtaining medical record samples, and training multiple initial text classification models using the text classification fields in the medical record samples;
  • the use of the structured model to structure the text to be structured includes: obtaining the text to be structured and the text to be structured.
  • a device for structuring English medical text is provided.
  • the device includes: a text acquisition module configured to acquire English medical record text, and perform text preprocessing on the English medical record text to obtain the text to be structured. text; a model selection module, configured to determine the type of the text to be structured, and determine the corresponding structured model according to the type of the text to be structured; a result generation module, configured to use the structured model to
  • the text to be structured is text structured to obtain the target structured text of the medical record text.
  • an electronic device including: a processor and a memory; wherein computer-readable instructions are stored on the memory, and when the computer-readable instructions are executed by the processor, any of the above exemplary implementations are implemented A method for structuring English medical texts.
  • a computer-readable storage medium is provided, a computer program is stored thereon, and when the computer program is executed by a processor, the method for structuring English medical text in any of the above exemplary embodiments is implemented.
  • the English medical text structuring method, English medical text structuring device, computer storage medium and electronic equipment in the exemplary embodiments of the present disclosure at least have the following advantages and positive effects:
  • text preprocessing of English medical record text can automatically identify and correct spelling errors, ensuring the accuracy and effectiveness of the text, and greatly reducing text structuring effects caused by spelling errors. errors or omissions occur.
  • the corresponding structured model is determined according to the type of text to be structured for text structuring, which improves the accuracy of text structuring, reduces the development time and iteration time, and significantly improves the accuracy of text structuring.
  • Figure 1 schematically shows a flow chart of a method for structuring English medical text in an exemplary embodiment of the present disclosure
  • Figure 2 schematically shows text preprocessing of English medical record text in an exemplary embodiment of the present disclosure. The flow of the method is illustrated.
  • i 3' schematically illustrates a flow chart of a method for determining a structured model based on text to be structured in an exemplary embodiment of the present disclosure
  • Figure 4 schematically illustrates an exemplary implementation of the present disclosure.
  • it is a schematic flowchart of a method for determining the type of text to be structured according to the target field
  • Figure 5 schematically shows a schematic flowchart of a method for text structuring of text to be structured using a structured model in an exemplary embodiment of the present disclosure.
  • Figure 6 schematically shows a flow chart of a method for training a named entity recognition model in an exemplary embodiment of the present disclosure
  • Figure 7 schematically shows a flow chart of a method of training a text classification model in an exemplary embodiment of the present disclosure
  • Figure 8 A schematic flow chart schematically illustrating another method of text structuring using a structured model in an exemplary embodiment of the present disclosure
  • Figure 9 schematically illustrates an English medical text in an exemplary embodiment of the present disclosure.
  • Structural diagram of a structuring device Figure 10 schematically shows an electronic device for implementing a method for structuring English medical text in an exemplary embodiment of the present disclosure
  • Figure 11 schematically shows an electronic device in an exemplary embodiment of the present disclosure.
  • Figure 1 shows a flow chart of the method for structuring English medical text.
  • the method for structuring English medical text at least includes the following steps:
  • step S110 obtain the English medical record text, and perform the following steps on the English medical record text:
  • Text preprocessing produces text to be structured.
  • step S120 the type of the text to be structured is determined, and the corresponding structured model is determined according to the type of the text to be structured.
  • step S130 a structured model is used to structure the text to be structured, and the target structured text of the medical record text is obtained.
  • step S110 the English medical record text is obtained, and text preprocessing is performed on the English medical record text to obtain text to be structured.
  • real-world research refers to research in which research data comes from real medical environments and reflects actual diagnosis and treatment processes and patient health conditions under specific conditions.
  • each field that needs to be structured is unstructured text, such as positive symptoms, emergency and other text categories.
  • the English medical record text may include medical treatment information, symptom information, diagnosis information, etc., which is not specifically limited in this exemplary embodiment.
  • text preprocessing can be performed on the English medical record text to obtain the corresponding text to be structured. Prior to this, regular expressions can be used to clean the English medical record text to obtain effective text.
  • Regular expressions also known as regular expressions (Regular Expression, abbreviated as regex regexp or RE), are a concept in computer science. Regular expressions are often used to retrieve and replace text that matches a certain pattern (rule).
  • a regular expression is a logical formula that operates on strings (including ordinary characters, such as letters between a and z) and special characters (metacharacters). It uses some predefined specific characters and combinations of these specific characters. , forming a "rule string”. This "rule string" is used to express a filtering logic for strings.
  • a regular expression is a text pattern that describes one or more strings to match when searching for text. Specifically, regular expressions can be used to eliminate text from English medical record text to obtain effective text.
  • regular expressions can be used to identify spaces, special symbols such as ⁇ t, numbers, dates and other irrelevant texts in the English medical record text, and further eliminate them from the English medical record text to obtain effective text.
  • Figure 2 shows a schematic flow chart of a method for text preprocessing of English medical record text.
  • the method may at least include the following steps:
  • step S210 use a spelling correction model to The medical record text is subjected to spelling correction to obtain the corrected text.
  • a spelling correction model is used to perform spelling correction on the valid text obtained based on the English medical record text to obtain the corrected text.
  • the spelling correction model can be a BERT (Bidirectional Encoder Representations from Transformers) pre-trained model .
  • pre-training refers to a process of training a neural network model by using a large data set so that the neural network model can learn the common features in the data set. The purpose of pre-training is to prepare the subsequent neural network model Training on specific data sets provides high-quality model parameters.
  • the cutting-edge BERT pre-training model uses the cutting-edge BERT pre-training model to obtain universal semantic representation and realize the conversion from natural language to machine language. Specifically, obtain correct text samples and incorrect text samples, and use The correct text sample and the fine-tuned pre-trained spelling correction model are used to obtain the spelling correction model after transfer learning. Among them, the incorrect text sample is a sample that contains spelling errors, and the correct text sample can be a corrected text sample that does not contain spelling errors. Samples.
  • the spelling correction model is a BERT pre-training model
  • transfer learning can be used based on a general model in which the BERT pre-training model has a certain understanding of natural language, and correct text samples and incorrect text samples are used to conduct the BERT pre-training model. Fine-tuning.
  • Transfer learning is to transfer the parameters of a trained model (pre-trained model) to a new model to help the new model train.
  • the model parameters that have been learned (which can also be understood as the knowledge learned by the model) can be shared with the new model in some way, thereby accelerating and optimizing the learning efficiency of the model, without having to learn from scratch like most networks.
  • there are the following three methods to implement transfer learning namely Transfer Learning, Extract Feature Vector and Fine-tuning.
  • Transfer Learning can freeze all convolutional layers of the pre-trained model and only train your own customized fully connected layers.
  • Extract Feature Vector can first calculate the feature vectors of the convolutional layer of the pre-trained model for all training and test data, and then discard the pre-trained model and only train your own customized simplified version of the fully connected network.
  • Fine-tuning is a process of further training a pre-trained neural network model using a specific data set.
  • the amount of data in the data set used in the fine-tuning phase is smaller than that in the pre-training phase, and the fine-tuning phase uses supervised learning, that is, the training samples in the data set used in the fine-tuning phase contain annotation information.
  • Fine-tuning can freeze some of the convolutional layers of the pre-trained model (usually the most convolutional layers close to the input, because these layers retain a lot of underlying information), or even not freeze any network layers, and train the remaining convolutional layers (usually the most convolutional layers close to the input) Output partially convolutional layer) and fully connected layer.
  • correct text samples and incorrect text samples to fine-tune the BERT pre-training model, you can truncate the last layer (softmax layer) of the BERT pre-training model and replace it with a new softmax layer trained on correct text samples and incorrect text samples. It is used to obtain the BERT pre-trained model after transfer learning, that is, the spelling correction model.
  • step S220 target keywords in the corrected text are identified, and the corrected texts are merged according to the target keywords to obtain text to be structured. After obtaining the corrected text, regular expressions are used to identify the target keywords in the corrected text, and the corrected texts can be merged based on the target keywords to obtain the text to be structured.
  • keywords can be used to obtain the corresponding text to be structured to reduce the difficulty of model learning.
  • a sentence splitter can be used to split the corrected text into sentences to identify each sentence.
  • regular expressions can also be used to obtain the target keywords.
  • the regular expression for identifying the keyword "kidney disease" in the correction text can be e renal (peritoneal dialysis
  • the type of the text to be structured is determined, and the corresponding structured model is determined according to the type of the text to be structured.
  • the text structuring method for structured text includes three NLP (Natural Language Processing, natural language processing) models to achieve the structuring effect of different fields, namely named entity recognition model, text classification model and question and answer model. Other models may also be included, which are not specifically limited in this exemplary embodiment.
  • Figure 3 shows a schematic flowchart of a method for determining a structured model according to text to be structured.
  • the method may at least include the following steps: In step S310, according to The target field of the text to be structured determines the type of text to be structured.
  • Figure 4 shows a schematic flow chart of a method for determining the type of text to be structured according to the target field.
  • the method may at least include the following steps: In step S410, identify the type of text to be structured.
  • the target field of structured text When the target field is named entity information, it is determined that the text to be structured is of the first type. For example, when the target field is the first field, it can be determined that the type of text to be structured is the first type. Wherein, the first field may be named entity information.
  • the named entity information may be information representing three major categories and seven sub-categories. Among them, the three major categories can include entity categories, time categories and number categories, and the seven sub-categories can include person names, organization names, place names, time, date, currency and percentage.
  • the named entity information may be positive symptoms, negative symptoms, or disease information, which is not specifically limited in this exemplary embodiment.
  • the target field is text classification information
  • it is determined that the text to be structured is the second type.
  • the target field is the second field
  • the type of text to be structured can be determined to be the second type.
  • the second field may be text classification information.
  • the text classification information is information that is automatically classified and tagged according to a certain classification system or standard.
  • the text classification information may be information representing classification problems such as whether there is kidney disease or end-stage renal disease, which is not particularly limited in this exemplary embodiment.
  • the target field is question and answer information
  • the type of text to be structured can be determined to be the third type.
  • the third field may be question and answer information, and the question and answer information may answer questions raised by the user in natural language in accurate and concise natural language.
  • the question and answer information may include field information that is not classified, such as disease start time, which is not specifically limited in this exemplary embodiment.
  • a corresponding structured model can be determined according to the type of the text to be structured for structuring processing.
  • the structured text type is the first type, it is determined that the named entity recognition model is used for structuring.
  • the first type may be entity information such as identification of symptoms, or other information, which is not specifically limited in this exemplary embodiment.
  • the structured text type is the second type, it is determined that the text classification model is used for structuring.
  • the second type may be information used to identify field types, or may be other information, which is not specifically limited in this exemplary embodiment.
  • step S340 when the structured text type is the third type, it is determined that the question and answer model is used for structuring.
  • the third type may be non-entity and non-type field information, or may be other information, which is not specifically limited in this exemplary embodiment.
  • the corresponding structured model is determined based on the determined type of text to be structured, and the text structuring method is accurately selected, so that the text structured processing method is more consistent with the text to be structured. , ensuring the text structuring effect.
  • a structured model is used to structure the text to be structured, and the target structured text of the medical record text is obtained.
  • the text to be structured can be text structured to obtain the target structured text of the medical record text.
  • Figure 5 shows a schematic flowchart of a method for text structuring of text to be structured using a structured model.
  • the method may at least include the following steps: In step S510 , obtain the target format text corresponding to the English medical record text and the verification rules corresponding to the text format to be structured, and use the verification rules to parse the target format text to obtain the first structured result.
  • the verification rules include targeting the target format Text-based custom rules.
  • Named Entity Recognition also known as "proper name recognition” refers to the identification of entities with specific meanings in text, mainly including person names, place names, organization names, proper nouns, etc.
  • NER Named Entity Recognition
  • Use the trained named entity recognition model to structure the text to be structured to obtain the target structured text of the English medical record text.
  • Figure 6 shows a schematic flowchart of a method for training a named entity recognition model. As shown in Figure 6, the method at least includes the following steps: In step S610, obtain the medical record sample and the corresponding medical record sample annotation fields, and use pre-trained word vectors to text-map medical record samples to obtain word embedding vectors.
  • the medical record sample can be obtained by text cleaning, spelling correction and paragraph merging of real-world medical records, and is used to train the named entity recognition model. After obtaining real-world medical records, and before becoming medical record samples, randomly selected medical records can be annotated to obtain annotated fields.
  • the annotation fields include positive symptoms, negative symptoms, etc., which are not specifically limited in this exemplary embodiment.
  • pubmed embeddings each word in the query medical record text can be text mapped to obtain a fixed word embedding vector.
  • word embedding is used to contextually map the medical record sample to obtain a string embedding vector, and the word embedding vector and the string embedding vector are vertically spliced to obtain a sample vector.
  • Flair Embeddings is currently one of the best word embeddings for processing NER.
  • Flair Embeddings are also called contextual string embeddings.
  • Two features of Flair Embeddings are that the words are understood as characters (without any notion of words), and that the embeddings are contextualized by their surrounding text, and words can have different meanings in different sentences. Therefore, Flair Embeddings can be used to contextually map the medical record text to obtain the corresponding string embedding vector. Depending on the context, the same medical record text may produce different string embedding vectors. After obtaining the word embedding vector and the string embedding vector, the word embedding vector and the string embedding vector can be vertically spliced to obtain a sample vector.
  • the sample vector processed by vertical splicing is a 200X200-dimensional vector.
  • sample vectors with clearer meanings can be obtained, which provides data support for training named entity recognition models.
  • multiple named entity recognition models to be trained are trained using annotation fields and sample vectors, and model scoring is performed on the multiple trained named entity recognition models to obtain corresponding multiple first model scores.
  • the annotation field and sample vector can be used to train the named entity recognition model to be trained.
  • the named entity recognition model to be trained can be composed of Bi-LSTM (Bi-directional Long Short-Term Memory, Bi-directional Long Short-Term Memory) combined with Conditional Random Field (CRF, Conditional Random Field), or it can be other model, this exemplary embodiment does not impose special limitations on this.
  • Fi score is an indicator used in statistics to measure the accuracy of two-classification models. It takes into account both the precision and recall of the classification model.
  • the F] score can be seen as a harmonic average of the model's precision and recall. Its maximum value is 1 and its minimum value is 0.
  • a trained named entity recognition model is determined among a plurality of trained named entity recognition models according to the plurality of first model scores.
  • multiple named entity recognition models can be trained, and the first model scores of the multiple trained named entity recognition models can be calculated according to formula (1). Compare multiple first model scores to determine the trained named entity recognition model with the highest first model score as the trained named entity recognition model, and save the trained named entity recognition model.
  • the trained named entity recognition model that performs best on the verification set can be obtained, providing a model for named entity recognition of structured text. Base. After obtaining the trained named entity recognition model, the trained named entity recognition model can be used to structure the text to be structured.
  • custom rules can be formulated to assist the prediction of the named entity recognition model.
  • the custom rules are verification rules, which are rules for verifying the format of the target format text.
  • the verification rules may include that the format of the target format text is that the first row and the first column are the main complaint, and the first column of all rows except the first row is the symptom, etc. This exemplary embodiment does not make a special limitation on this.
  • the verification rule can be used to parse the target table related to the symptom from the target format text.
  • the target table also includes the status of the symptom, that is, the first structured result. Among them, the status of the symptom is yes, which means the symptom is positive, and no, which means the symptom is negative.
  • the named entity recognition model is used to perform named entity recognition on the text to be structured to obtain a second structured result of the medical record text. By inputting the text to be structured into the trained named entity recognition model, a second structured result can be obtained after the named entity recognition is performed on the text to be structured.
  • it is determined that the structured text is obtained by combining the first structured result and the second structured result.
  • the first structured result after parsing the target format file using the verification rule is combined with the second structured result output by the named entity recognition model to obtain the target structured result.
  • the first structured result after parsing the target format file using the verification rule is ABC
  • the second structured result output by the named entity recognition model is BD
  • the first structured result can be The existing D of the second structured result is added to the first structured result to obtain the target structured result as ABCDo.
  • the naming of the structured text to be treated can be realized by using the trained named entity recognition model. Entity recognition, while verifying results through verification rules, greatly improves the accuracy of named entity recognition.
  • the text to be structured is input into the text classification model, so that the text classification model performs text structuring on the text to be structured.
  • Text classification is the automatic classification and labeling of text sets (or other entities and objects) according to a certain classification system or standard. It finds the relationship model between document features and document categories based on a collection of annotated training documents, and then uses this learned relationship model to judge the category of new documents. Text classification gradually changes from knowledge-based methods to methods based on statistics and machine learning. Text classification generally includes processes such as text expression, classifier selection and training, and evaluation and feedback of classification results. Among them, text expression can be subdivided into steps such as text preprocessing, indexing and statistics, and feature extraction.
  • the overall functional modules of the text classification system are preprocessing, indexing, statistics, feature extraction, classifier and evaluation.
  • preprocessing is to format the original corpus into the same format to facilitate subsequent unified processing; indexing is to decompose the document into basic processing units while reducing the cost of subsequent processing; statistics is word frequency statistics, items (words, concepts) and classification
  • statistics is word frequency statistics, items (words, concepts) and classification
  • Figure 7 shows a schematic flow chart of a method for training a text classification model.
  • the method at least includes the following steps: In step S710, obtain a medical record sample, and use the medical record sample to Train multiple initial text classification models for text classification fields.
  • the medical record sample can be a sample obtained by text cleaning, spelling correction and paragraph merging of real-world medical records, and used to train the named entity recognition model.
  • the pre-trained text classification model can be composed of a linear classifier (Linear Classifier) added to the pre-trained BERT model.
  • Linear Classifier Linear Classifier
  • step S720 an optimization algorithm is used to optimize multiple initial text classification models to obtain a second score corresponding to each initial text classification model.
  • the hyperparameters of the fine-tuned text classification model were adjusted using medical record samples in the training and validation sets.
  • the hyperparameters of the text classification model include the learning rate of the training neural network, the weight attenuation coefficient, the proportion of slow-heat learning, etc.
  • the medical record samples in the training set are used to continuously adjust the weight of the fine-tuned text classification model and optimize the performance of the model during iterations and traversal of the data set.
  • the pre-trained text classification model can also use the more novel and more suitable training model and word vector Bio-Clinical BERT for biomedicine. Further, perform model scoring on the fine-tuned text classification model according to formula (1) to obtain the Fi score of the text classification model, which is used as the second model score.
  • a text classification model is determined according to a plurality of second scores.
  • the second model scores of multiple text classification models can be calculated. Compare multiple second model scores to determine the text classification model with the highest second model score as the text classification model after transfer learning, and save the text classification model after transfer learning.
  • XGBoost an optimized distributed gradient boosting library
  • the second model scores of multiple text classification models can be calculated. Compare multiple second model scores to determine the text classification model with the highest second model score as the text classification model after transfer learning, and save the text classification model after transfer learning.
  • only a small amount of medical record text is needed to achieve very good results in the process of training the text classification model, making the text classification model much more efficient than existing rule sets based on a small amount of data. The development time and iteration time are reduced, and the accuracy rate is significantly improved.
  • the text to be structured can be input into the text classification model after transfer learning, so that the text classification model after transfer learning outputs the classification result of the text to be structured, so as to convert the text to be structured.
  • the classification results serve as target structured text for unstructured text.
  • a question and answer system method of searching for answers using target text can be used.
  • the target question may be the onset time of kidney disease, etc., which is not particularly limited in this exemplary embodiment.
  • QA Question Answering System
  • QA is an advanced form of information retrieval system. It can use accurate and concise natural language to answer questions raised by users in natural language.
  • Question answering system is a research direction that has attracted much attention and has broad development prospects in the field of artificial intelligence and natural language processing.
  • the machine can learn a question and answer system after transfer learning.
  • medical record samples, question samples and answer samples corresponding to the question samples are obtained, and the pre-trained question and answer model is fine-tuned using the medical record samples, question samples and answer samples to obtain a fine-tuned question and answer model.
  • the pre-trained question and answer model can use Bio BERT, or other models can be selected according to actual conditions and needs. This exemplary embodiment does not impose special limitations on this.
  • the pre-trained Bio BERT model is generated by massive training, only a small number of question samples and answer samples are needed to fine-tune the pre-trained Bio BERT model in the specific scenario of structuring target paragraphs formed from case texts. , so that the more general concepts in the pre-trained Bio BERT can be better developed for specific scenarios.
  • the question sample can be a sample composed of several or more than a dozen questions related to the disease in the case text
  • the answer sample can be a sample composed of the corresponding answers marked in the medical record sample based on the question sample. .
  • the pre-trained Bio BERT model can learn how to search for the corresponding answer to the target question from the original text after fine-tuning the sample pairs composed of medical record samples, question samples and answer samples.
  • the third model score is obtained by performing model scoring on the fine-tuned question and answer model, and the question and answer model after transfer learning is determined in the fine-tuned question and answer model based on the third model score.
  • the fine-tuned question and answer model perform model scoring on the fine-tuned question and answer model according to formula (1) to obtain the Fi score of the question and answer model, which is used as the third model score. Since there can be multiple question-and-answer models trained or fine-tuned, multiple question-answering models can be trained or fine-tuned, and further optimized to obtain multiple question-answering models. Then, the third model scores of multiple question-answering models can be calculated according to formula (1). The multiple third model scores are compared to determine the question and answer model with the highest third model score as the question and answer model after transfer learning, and the question and answer model after transfer learning is saved.
  • FIG. 8 shows a schematic flowchart of another method for text structuring of text to be structured using a structured model. As shown in FIG. 8 , the method at least includes the following steps: In step S810 , obtain the target question corresponding to the text to be structured. It is worth noting that the answer corresponding to the target question must belong to a certain part of the text to be structured.
  • a question and answer model is used to perform an answer search on the structured text to be obtained to obtain the target structured text of the medical record text.
  • the target question and the text to be structured can be input into the question and answer system after transfer learning, so that the question and answer system after transfer learning can search for the corresponding answer to the target question from the target paragraph, so as to Determine the answer as the target structured text of the medical record text text.
  • the structured output is positive symptoms, negative symptoms Each is a list.
  • structuring is achieved through text classification, for example, if the field is "whether the patient has end-stage renal disease", a text can only have one output: yes/no/null (null means it cannot be speculated). In this case, the classification model can To achieve the effect O, the last situation is special.
  • a question and answer system such as the start time of kidney disease
  • the output is a string in the original text
  • the input is text and a question.
  • the model learns the sample to learn from Search the original text for answers to user questions.
  • text preprocessing of English medical record text can automatically identify and correct spelling errors, ensuring the accuracy and effectiveness of the text, and greatly reducing the poor text structuring effect caused by spelling errors. or omission occurs.
  • the corresponding structured model is used to structure the text to be structured, which reduces the development time and iteration time, and significantly improves the accuracy and accuracy of text structuring.
  • a device for structuring English medical text is also provided.
  • Figure 9 shows a schematic structural diagram of an English medical text structuring device.
  • the English medical text structuring device 900 may include: a text acquisition module 910, a model selection module 920 and a result generation module 930.
  • the text acquisition module 910 is configured to obtain English medical record text, and perform text preprocessing on the English medical record text to obtain the text to be structured
  • the model selection module 920 is configured to determine the type of the text to be structured, And determine the corresponding structured model according to the type of the text to be structured
  • the result generation module 930 is configured to use the structured model to perform text structuring on the text to be structured, and obtain the target of the medical record text Structured text.
  • the text preprocessing of the medical record text obtains the structure to be text, including: using a spelling correction model to perform spelling correction on the medical record text to obtain a corrected text; identifying target keywords in the corrected text, and merging the corrected text according to the target keywords to obtain a text to be structured text.
  • determining the type of the text to be structured and determining the corresponding structured model according to the type of the text to be structured includes: determining the type of the text to be structured according to the type of the text to be structured.
  • determining the type of the text to be structured according to the target field of the text to be structured includes: identifying the target field of the text to be structured, when the When the target field is named entity information, it is determined that the text to be structured is of the first type; when the target field is text classification information, it is determined that the text to be structured is of the second type; when the target field is question and answer information, It is determined that the text to be structured is the third type.
  • the use of the structured model to structure the text to be structured includes: obtaining the information related to the medical record The target format text corresponding to the text and the verification rules corresponding to the text format to be structured, and use the verification rules to parse the target format text to obtain the first structured result, and the verification rules include: Custom rules formulated by the target format text; Using a named entity recognition model to perform named entity recognition on the text to be structured to obtain a second structured result of the medical record text; According to the first structured result and the The second structured results are combined to determine the target structured text.
  • the named entity recognition model is obtained through the following training steps: obtaining a medical record sample and annotation fields corresponding to the medical record sample, and using pre-trained word vectors to classify the medical record sample. Perform text mapping to obtain word embedding vectors; use word embeddings to contextually map the medical record samples to obtain string embedding vectors, and vertically splice the word embedding vectors and string embedding vectors to obtain sample vectors; Use the annotation fields and the sample vectors to train multiple named entity recognition models to be trained, and perform model scoring on the multiple trained named entity recognition models to obtain corresponding multiple first model scores; A model score determines a trained named entity recognition model among a plurality of the trained named entity recognition models.
  • the text classification model is trained through the following steps: Obtain medical record samples, and use the text classification fields in the medical record samples to train multiple initial text classification models; Optimize using an optimization algorithm A plurality of the initial text classification models are used to obtain a second score corresponding to each of the initial text classification models; and the text classification model is determined based on a plurality of the second scores.
  • the use of the structured model to structure the text to be structured includes: obtaining the text to be structured and the text to be structured.
  • the target question corresponding to the text Based on the target question, use a question and answer model to perform an answer search on the text to be structured to obtain the target structured text of the medical record text.
  • the specific details of the above-mentioned English medical text structuring device 900 have been described in detail in the corresponding English medical text structuring method, so they will not be described again here. It should be noted that although several modules or units of the English medical text structuring device 900 are mentioned in the above detailed description, this division is not mandatory. In fact, according to embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into multiple modules or units to be embodied.
  • an electronic device capable of implementing the above method is also provided.
  • the electronic device 1000 according to this embodiment of the present disclosure is described below with reference to FIG. 10.
  • the electronic device 1000 shown in FIG. 10 is only an example and should not bring any limitations to the functions and scope of use of the embodiment of the present disclosure.
  • electronic device 1000 is embodied in the form of a general computing device.
  • the components of the electronic device 1000 may include, but are not limited to: the above-mentioned at least one processing unit 1010, the above-mentioned at least one storage unit 1020, a bus 1030 connecting different system components (including the storage unit 1020 and the processing unit 1010), and the display unit 1040.
  • the storage unit 1020 stores program code, and the program code can be executed by the processing unit 1010, so that the processing unit 1010 executes various exemplary methods according to the present disclosure described in the "Example Method" section of this specification.
  • the storage unit 1020 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 1021 and/or a cache storage unit 1022, and may further include a read-only storage unit (ROM) 1023 o
  • Storage unit 1020 may also include a program/utility 1024 having a set of (at least one) program modules 1025, such program modules 1025 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, examples of which Each of these, or some combination thereof, may include the implementation of a network environment.
  • the bus 1030 may represent one or more of several types of bus structures, including a memory unit bus or a memory unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures. bus.
  • the electronic device 1000 may also communicate with one or more external devices 1200 (such as a keyboard, a pointing device, a Bluetooth device, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 1000, and/or with Any device that enables the electronic device 1000 to communicate with one or more other computing devices (eg, router, modem, etc.). This communication may occur through input/output (I/O) interface 1050.
  • I/O input/output
  • the electronic device 1000 can also communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 1060.
  • network adapter 1040 communicates with other modules of electronic device 1000 via bus 1030.
  • other hardware and/or software modules may be used in conjunction with the electronic device 1000, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the technical solution according to the embodiment of the present disclosure can be embodied in the form of a software product.
  • the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network. , including several instructions to cause a computing device (which may be a personal computer, server, terminal device, or network device, etc.) to execute the method according to the embodiment of the present disclosure.
  • a computer-readable storage medium is also provided, on which a program product capable of implementing the method described above in this specification is stored.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code.
  • a program product 1100 for implementing the above method according to an embodiment of the present disclosure is described, which can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be used on a terminal device, For example, run on a personal computer.
  • CD-ROM portable compact disk read-only memory
  • the program product of the present disclosure is not limited thereto.
  • a readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, apparatus or device.
  • the program product may take the form of any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of readable storage media (non- (exhaustive list) includes: electrical connection with one or more wires, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory) , optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying readable program code therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a readable signal medium may also be any readable medium other than a readable storage medium that may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted using any appropriate medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
  • Program code for performing operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., as well as conventional procedural programming languages. Programming language - such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computing device, partly on the user's computing device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., provided by an Internet service).
  • LAN local area network
  • WAN wide area network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Machine Translation (AREA)

Abstract

一种英文医疗文本结构化的方法及装置、存储介质、电子设备。该方法包括:获取英文病历文本,并对英文病历文本进行文本预处理得到待结构化文本;确定待结构化文本的类型,并根据待结构化文本的类型确定对应的结构化模型;利用结构化模型对待结构化文本进行文本结构化,得到病历文本的目标结构化文本。保证了文本的准确性和有效性,减少了因拼写错误导致的文本结构化效果不佳或遗漏的情况发生,对待结构化文本进行文本结构化,减少了研发时长和迭代时间,显著提升了文本结构化的准确性和准召率。

Description

英文 医疗文本 结构化 的方法 、 装置、 介质及电子设备 技术领域 本公开 涉及文本处理技术领域, 尤其涉及一种英文医疗文本结构化的方法、 英文医 疗文本结构化的装置、 计算机可读存储介质及电子设备。 背景技术 文本结 构化的相关技术是将自然语言表达的文本转化为可检索、 可分析、 可计算的 结构化数据, 提取相应场景中感兴趣的信息, 这能够极大的减少人工的工作量, 提高效 率。 例如, 在医疗领域, 如何从大量的复杂无结构化的就诊或手术记录中进行信息抽取 与结构化是非常有用和有难度的技术, 也是文本处理与知识提取不可或缺的技术。 通常 , 可以分别利用实体识别模型、 关系抽取模型、 阅读理解模型、 文本分类模型 实现文本的结构化效果。 但是, 这几种监督学习的方法需要大量的人工标注数据, 要获 取标注数据是极其困难且浪费人力 成本、 时间成本的。 而且, 这几种监督学习的模型大 量使用西方国家的病例作为研 究, 并未能很好的契合东南亚地区的英语使用习惯和病例 书写习惯, 降低了结构化的准确率, 劣化了文本结构化的效果。 鉴于此 , 本领域亟需开发一种新的英文医疗文本结构化的方法及装置。 需要说 明的是, 在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解, 因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。 发 明内容 本公开 的目的在于提供一种英文医疗文本结构化的方法、 英文医疗文本结构化的装 置、 计算机可读存储介质及电子设备, 进而至少在一定程度上克服由于相关技术的限制 而导致的准确率差和成本高的技术问题。 本公开 的其他特性和优点将通过下面的详细描述变得显然, 或部分地通过本公开的 实践而习得。 根据本公开 的一个方面, 提供一种英文医疗文本结构化的方法, 所述方法包括: 获取英文病历文本 , 并对所述英文病历文本进行文本预处理得到待结构化文本; 确定所述待 结构化文本的类型, 并根据所述待结构化文本的类型确定对应的结构化 模型; 利用所述 结构化模型对所述待结构化文本进行文本结构化, 得到所述病历文本的目 标结构化文本。 在本 公开的一种示例性实施例中, 所述对所述病历文本进行文本预处理得到待结构 化文本, 包括: 利用拼写校正模型对所述病历文本进行拼写校正得到校正文本 ; 识别所述 校正文本中的目标关键词, 并根据所述目标关键词对所述校正文本进行合 并得到待结构化文本。 在本 公开的一种示例性实施例中, 所述确定所述待结构化文本的类型, 并根据所述 待结构化文本的类型确定对应的结构化模型, 包括: 根据所述待结构化文本的 目标字段确定所述待结构化文本的类型; 当所述待结构化文本类型为第一类型时, 确定采用命名实体识别模型进行结构化; 当所述待结构化文本类型为第二类型时, 确定采用文本分类模型进行结构化; 当所述待结构化文本类型为第三类型时, 确定采用问答模型进行结构化。 在本 公开的一种示例性实施例中, 所述根据所述待结构化文本的目标字段确定所述 待结构化文本的类型, 包括: 识别所述 待结构化文本的目标字段, 当所述目标字段为命名实体信息, 确定所述待 结构化文本为第一类型; 当所述目标字段为文本分类信息, 确定所述待结构化文本为第二类型; 当所述目标字段为问答信息, 确定所述待结构化文本为第三类型。 在本 公开的一种示例性实施例中, 确定采用命名实体识别模型进行结构化时, 所述 利用所述结构化模型对所述待结构化文本进行文本结构化, 包括: 获取与所 述英文病历文本对应的目标格式文本以及与所述待结构化文本格式 对应的 校验规则, 并利用所述校验规则对所述目标格式文本进行解析得到第一校验结构化结果, 所述校验规则包括针对所述目标格式文本制定的自定义规则; 利用命 名实体识别模型对所述待结构化文本进行命名实体识别得到所述病历 文本的 第二结构化结果; 根据所述第一结构化结果和所述第二结构化结果进行结合, 确定目标结构化文本。 在本公开 的一种示例性实施例中, 所述命名实体识别模型通过如下训练步骤得到: 获取病 历样本以及与所述病历样本对应的标注字段, 并利用预训练的词向量对所述 病历样本进行文本映射得到单词嵌入向量; 利用词 嵌入对所述病历样本进行语境映射得到字符串嵌入向量, 并对所述单词嵌入 向量和所述字符串嵌入向量进行纵向拼接处理得到样本向量; 利用所述 标注字段和所述样本向量训练多个待训练的命名实体识别模 型, 并对多个 训练得到的命名实体识别模型进行模型评分得到对应的多个第一模型分值; 根据 多个所述第一模型分值在多个所述训练得到的命名实体识别模型 中确定一个训 练好的命名实体识别模型。 在本公开 的一种示例性实施例中, 所述文本分类模型通过如下步骤训练得到: 获取 病历样本, 并利用所述病历样本中的文本分类字段训 练多个初始文本分类模 型;
' 利用优化算法优化多个所述初始文本分类模型, 得到每一所述初始文本分类模型对 应的第二分值; 根据多个所述第二分值确定所述文本分类模型 。 在本 公开的一种示例性实施例中, 确定采用问答模型进行结构化时, 所述利用所述 结构化模型对所述待结构化文本进行文本结构化, 包括: 获取与所述待结构化文本对应的 目标问题; 基于所述 目标问题, 利用问答模型对所述待结构化文本进行答案搜索得到所述病历 文本的目标结构化文本。 根据本公开 的一个方面, 提供一种英文医疗文本结构化的装置, 所述装置包括: 文本获 取模块, 被配置为获取英文病历文本, 并对所述英文病历文本进行文本预处 理得到待结构化文本; 模型选 择模块, 被配置为确定所述待结构化文本的类型, 并根据所述待结构化文本 的类型确定对应的结构化模型; 结果生 成模块, 被配置为利用所述结构化模型对所述待结构化文本进行文本结构化, 得到所述病历文本的目标结构化文本。 根据本 公开的一个方面, 提供一种电子设备, 包括: 处理器和存储器; 其中, 存储 器上存储有计算机可读指令 , 所述计算机可读指令被所述处理器执行时实现上述任意示 例性实施例的英文医疗文本结构化的方法。 根据本 公开的一个方面, 提供一种计算机可读存储介质, 其上存储有计算机程序, 所述计算机程序被处理器执行 时实现上述任意示例性实施例中的英文医疗文本结构化的 方法。 由上述技术方案可知, 本公开示例性实施例中的英文医疗文本结构化的方法、 英文 医疗文本结构化的装置、 计算机存储介质及电子设备至少具备以下优点和积极效果: 在本 公开的示例性实施例提供的方法及装置中, 对英文病历文本进行文本预处理, 能够自动识别, 并改正拼写错误, 保证了文本的准确性和有效性, 大大减少了因拼写错 误导致的文本结构化效果不佳或遗漏 的情况发生。 进一步的, 根据待结构化文本的类型 确定对应的结构化模型进行文本结构化 , 提高了文本结构化的准确性, 减少了研发时长 和迭代时间, 显著提升了文本结构化的准召率。 应 当理解的是, 以上的一般描述和后文的细节描述仅是示例性和解释性的, 并不能 限制本公开。 附图说明 图 1 示意性示出本公开示例性实施例中一种英文医疗文本结构化的方法的流程示意 图; 图 2示意性示出本公开示例性实施例中对英文病历文本进行文本预处理的方法的流 ^§^^意图 . i 3 '示意性示出本公开示例性实施例中根据待结构化文本确定结构化模型的方法的 流程示意图; 图 4示意性示出本公开示例性实施例中根据目标字段确定待结构化文本的类型的方 法的流程示意图; 图 5 示意性示出本公开示例性实施例中一种利用结构化模型对待结构化文本进行文 本结构化的方法的流程示意图; 图 6 示意性示出本公开示例性实施例中训练命名实体识别模 型的方法的流程示意 图; 图 7示意性示出本公开示例性实施例中训练文本分类模型的方法的流程示意图; 图 8示意性示出本公开示例性实施例中另一种利用结构化模型对待结构化文本进行 文本结构化的方法的流程示意图; 图 9示意性示出本公开示例性实施例中一种英文医疗文本结构化的装置的结构示意 图; 图 10示意性示出本公开示例性实施例中一种用于实现英文医疗文本结构化的方法的 电子设备; 图 11示意性示出本公开示例性实施例中一种用于实现英文医疗文本结构化的方法的 计算机可读存储介质。 具体实施方式 现在将参考 附图更全面地描述示例实施方式。 然而, 示例实施方式能够以多种形式 实施, 且不应被理解为限于在此阐述的范例; 相反, 提供这些实施方式使得本公开将更 加全面和完整, 并将示例实施方式的构思全面地传达给本领域的技术人员。 所描述的特 征、 结构或特性可以以任何合适的方式结合在一个或更多实施方式中。 在下面的描述中, 提供许多具体细节从而给出对本公开的实施方式的充分理解。 然而, 本领域技术人员将 意识到, 可以实践本公开的技术方案而省略所述特定细节中的一个或更多, 或者可以采 用其它的方法、 组元、 装置、 步骤等。 在其它情况下, 不详细示出或描述公知技术方案 以避免喧宾夺主而使得本公开的各方面变得模糊。 本说 明书中使用用语 “一个”、 “一”、 “该 ”和 “所述”用以表示存在一个或多 个要素 /组成部分 /等; 用语 “包括 ”和 “具有”用以表示开放式的包括在内的意思并且是 指除了列出的要素 /组成部分 /等之外还可存在另外的要素 /组成部分 /等; 用语 “第一 ”和 “第二”等仅作为标记使用 , 不是对其对象的数量限制。 此外 , 附图仅为本公开的示意性图解, 并非一定是按比例绘制。 图中相同的附图标 记表示相同或类似的部分, 因而将省略对它们的重复描述。 附图中所示的一些方框图是 功能实体, 不一定必须与物理或逻辑上独立的实体相对应。 文本结构化 的相关技术是将自然语言表达的文本转化为可检索、 可分析、 可计算的 结构化数据, 提取相应场景中感兴趣的信息, 这能够极大的减少人工的工作量, 提高效 率。 例如, 在医疗领域, 如何从大量的复杂无结构化的就诊或手术记录等医疗文本中进 行信息抽取与结构化是非常有用和有难度的技术, 也是文本处理与知识提取不可或缺的 技术。 针对相关技术 中存在的问题, 本公开提出了一种英文医疗文本结构化的方法。 图 1 示出了英文医疗文本结构化的方法的流程图, 如图 1所示, 英文医疗文本结构化的方法 至少包括以下步骤: 在步骤 S110中, 获取英文病历文本, 并对英文病历文本进行文本预处理得到待结构 化文本。 在步骤 S120中, 确定待结构化文本的类型, 并根据待结构化文本的类型确定对应的 结构化模型。 在步骤 S130中, 利用结构化模型对待结构化文本进行文本结构化, 得到病历文本的 目标结构化文本。 在本 公开的示例性实施例中, 对英文病历文本进行文本预处理, 大大减少了因拼写 错误导致的文本结构化效果不佳或遗漏的情况发生。 进一步的, 根据待结构化文本的类 型确定对应的结构化模型进行文本结构化, 提高了文本结构化的准确性, 减少了研发时 长和迭代时间, 显著提升了文本结构化的准召率。 下面对英文医疗文本结构化的方法的各个步骤进行详细说明。 在步骤 S110中, 获取英文病历文本, 并对英文病历文本进行文本预处理得到待结构 化文本。 在本 公开的示例性实施例中, 真实世界研究是指研究数据来自真实的医疗环境, 反 映实际诊疗过程和具体条件下的患者健康状况的研究。 举例 而言, 在真实世界的英文病历文本中, 每个需要结构化的字段为非结构化文本, 例如阳性症状、 急诊等文本类别。 其 中, 英文病历文本可以包括就诊信息、 症状信息和诊断信息等, 本示例性实施例 对此不做特殊限定。 在获取到英文病历文本之后 , 能够对英文病历文本进行文本预处理得到对应的待结 构化文本。 在此之前 , 可以利用正则表达式对英文病历文本进行文本清理得到有效文本。 正则 表达式, 又称规则表达式 (Regular Expression, 简写为 regex regexp或 RE) , 是计算机 科学的一个概念。 正则表达式通常被用来检索、 替换那些符合某个模式 (规则)的文本。 正则表达式是对字符串 (包括普通字符, 例如 a到 z之间的字母)和特殊字符 (元字符) 操作的一种逻辑公式, 就是用事先定义好的一些特定字符以及这些特定字符的组合, 组 成一个 “规则字符串” 。 这个 “规则字符串”用来表达对字符串的一种过滤逻辑。 正则 表达式是一种文本模式, 该模式描述在搜索文本时要匹配一个或多个字符串。 具体 的, 可以利用正则表达式对英文病历文本进行文本剔除得到有效文本。 在对英文病历文本进行 的常规处理中, 可以利用正则表达式识别英文病历文本中的 空格、 \t等特殊符号、 数字和日期等不相关的文本, 进一步从英文病历文本中进行剔除得 到有效文本。 举例而言, 识别英文病历文本中的日期的正则表达式可以是 date regex = r"(?:(?:31(V|-|\.)(?:0?[13578]|l[02]))\l|(?:(?:29|30)(V|-|\.)(?:0?[l,3-9]|l[0-2])\2))(?:(?:l[6-9]|[2-9] \d)?\d{2})|(?:29(V|-|\.)0?2\3(?:(?:(?: l[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?: 16|[2 468][048]|[3579][26])00))))|(?:0?[l-9]|l\d|2[0-8])(V|-|\.)(?:(?:0?[l-9])|(?: l[0-2]))\4(?:(?: l[6-9]|[2 -9]\d)?\d{2})" o 在对英文病例文本进行 的针对性处理中, 也可以利用正则表达式识别英文病历文本 中的开头的无用信息, 并从英文病历文本中剔除。 举例而言, 识别英文病历文本中开头 无用信息的正则表达式可以是 header regex = r"Note Type.*(Clini Note OP|Performed By|Clini Notes OP|Clini Notes IP|Date of Discharge |Progress Notes)" o 除此之外, 还可以是利用正则表达式对英文病历文本进行文本替换得到有效文本。 在对英文病历文本进行 的针对性处理中, 还可以利用正则表达式识别英文病历文本 中的特定情况。 例如, 识别英文病例文本中的 “? ”, 进一步将该 “? ”替换为 “: ” 得到有效文本。 举例而言, 识别英文病历文本中的 “? ”等特定符号的正则表达式可以 是 r"(?<=Any)(. *?)(?=no|yes|nil)" □ 当英文病历文本为 any fever?no时, 正则表达式可以识 别其中的问号, 并进行音换得到 any fever: no, 以使句子的切分更加合理。 利用正则表达式对英文病历文本进行文本清理处理 , 能够将去除无用信息的干扰, 也能够为待结构化文本的生成提供更为合理准确的文本支持。 进一步 的, 在得到有效文本之后, 可以对有效文本中拼写错误的部分进行校正得到 校正文本以及对校正文本进行合并, 以得到待结构化文本。 在可选 的实施例中, 图 2示出了对英文病历文本进行文本预处理的方法的流程示意 图, 如图 2所示, 该方法至少可以包括以下步骤: 在步骤 S210中, 利用拼写校正模型对 病历文本进行拼写校正得到校正文本。 具体 的, 利用拼写校正模型对根据英文病历文本得到的有效文本进行拼写校正得到 校正文本。 其中, 拼写校正模型可以是 BERT (Bidirectional Encoder Representations from Transformers)预训练模型。 其中, 预训练 (pre-training) 是指一种通过使用大型数 据集对神经网络模型进行训练, 使神经网络模型学习到数据集中的通用特征的过程。 预 训练的目的是为后续神经网络模型在特定数据集上训练提供优质的模型参数。 使用前沿 水平的 BERT预训练模型来获取通用语义表示, 实现从自然语言到机器语言的转化。 具体 的, 获取正确文本样本和错误文本样本, 并利用正确文本样本和微调预训练的 拼写校正模型得到迁移学习后的拼写校正模型。 其中, 错误文本样本是包含拼写错误的 样本, 正确文本样本可以是对错误文本样本进行订正后的, 不含拼写错误的样本。 当拼写校正模型是 BERT预训练模型, 可以以 BERT预训练模型对自然语言有一定理 解的通用模型为基础, 利用迁移学习, 且采用正确文本样本和错误文本样本对 BERT预 训练模型进行微调。 其中, 迁移学习 (Transfer learning)顾名思义就是把已训练好的模型 (预训练模型)参数迁移到新的模型来帮助新模型训练。 考虑到大部分数据或任务都是存 在相关性的, 所以通过迁移学习可以将已经学到的模型参数 (也可理解为模型学到的知识) 通过某种方式来分享给新模型, 从而加快并优化模型的学习效率, 不用像大多数网络那 样从零学习。 其 中, 实现迁移学习有以下三种手段, 分别是 Transfer Learning (迁移学习)、 Extract Feature Vector (提取特征向量)和 Fine-tuning (微调) 。 其中, Transfer Learning可以冻 结预训练模型的全部卷积层, 只训练自己定制的全连接层。 Extract Feature Vector可以先 计算出预训练模型的卷积层对所有训练和测试数据的特征向量, 然后抛开预训练模型, 只训练自己定制的简配版全连接网络。 Fine-tuning是一种使用特定数据集对预训练神经 网络模型进行进一步训练的过程。 通常情况下, 微调阶段所使用数据集的数据量小于预 训练阶段所使用数据集的数据量, 且微调阶段采用监督式学习的方式, 即微调阶段所使 用数据集中的训练样本包含标注信息。 微调可 以冻结预训练模型的部分卷积层 (通常是靠近输入的多数卷积层, 因为这些层 保留了大量底层信息) , 甚至不冻结任何网络层, 训练剩下的卷积层 (通常是靠近输出的 部分卷积层)和全连接层。 在利用正确文本样本和错误文本样本对 BERT预训练模型进行微调时, 可以是截断 BERT 预训练模型的最后一层 (softmax层) , 并用正确文本样本和错误文本样本训练出 的新的 softmax层替换它, 以得到迁移学习后的 BERT预训练模型, 亦即拼写校正模型。 在得到迁移学 习后的拼写校正模型之后, 可以将有效文本输入至该迁移学习后的拼 写校正模型中, 以使该迁移学习后的拼写校正模型输出对应的校正文本。 利用深度学 习模型学习语义改正拼写错误, 能够通过学习现有的真实的就诊记录、 入院记录、 出院记录、 临床记录等文本, 自动识别并改正有效文本中的拼写错误, 大大 减少了后续因拼写错误导致的文本结构化的遗漏情况发生。 在步骤 S220中, 识别校正文本中的目标关键词, 并根据目标关键词对校正文本进行 合并得到待结构化文本。 在得 到校正文本之后, 利用正则表达式识别校正文本中的目标关键词, 根据目标关 键词可以对校正文本进行合 并处理, 以得到待结构化文本。 由于根据病历文本获取到的 待结构化文本会包含多个段落 , 内容过长, 因此, 可以利用关键词获取对应的待结构化 文本, 以降低模型学习的难度。 在对 校正文本进行识别之前, 可以利用句子拆分器将校正文本切分成句子, 从而对 每个句子进行识别 。 在对每个句子进行识别时, 也可以是利用正则表达式实现, 以得到 目标关键词 。 举例而言, 对校正文本进行“肾病” 的关键词识别的正则表达式可 以是 e renal(peritoneal dialysis|on pd|on
Figure imgf000008_0001
并对关键语句进行合并得到待结 构化文本。 在确定目标关键词之后, 可以在校正文本中取目标关键词前后各一句或者两 句的句子作为关键语句, 并将关键语句合并形成待结构化文本。 利 用目标关键词获取对应的待结构化文本, 缩短了校正文本的句子长度, 降低了后 续模型学习的难度, 并且还能提高模型的训练速度。 在步骤 S120中, 确定待结构化文本的类型, 并根据待结构化文本的类型确定对应的 结构化模型。 具 体的, 对待结构化文本进行文本结构化的方式包括三种 NLP (Natural Language Processing, 自然语言处理)模型以达到不同字段的结构化效果, 分别是命名实体识别模 型、 文本分类模型和问答模型, 也可以包括其他模型, 本示例性实施例对此不做特殊限 定。 在本 公开的示例性实施例中, 图 3 示出了根据待结构化文本确定结构化模型的方法 的流程示意图, 如图 3所示, 该方法至少可以包括以下步骤: 在步骤 S310中, 根据待结 构化文本的目标字段确定待结构化文本的类型。 在 可选的实施例中, 图 4示出了根据目标字段确定待结构化文本的类型的方法的流 程示意图, 如图 4所示, 该方法至少可以包括以下步骤: 在步骤 S410中, 识别待结构化 文本的目标字段, 当目标字段为命名实体信息, 确定待结构化文本为第一类型。 举例而言 , 当目标字段为第一字段时, 可以确定待结构化文本的类型为第一类型。 其 中, 第一字段可以是命名实体信息。 该命名实体信息可以是表征三大类和七小类 的信息。 其中, 三大类可以包括包括实体类、 时间类和数字类, 七小类可以包括人名、 机构名、 地名、 时间、 日期、 货币和百分比。 例如, 命名实体信息可以是阳性症状、 阴 性症状或者疾病等信息, 本示例性实施例对此不做特殊限定。 在步骤 S420中, 当目标字段为文本分类信息, 确定待结构化文本为第二类型。 举例而言 , 当目标字段为第二字段时, 可以确定待结构化文本的类型为第二类型。 其 中, 第二字段可以是文本分类信息。 该文本分类信息是按照一定分类体系或标准 进行自动分类标记的信息 。 例如, 该文本分类信息可以是是否有肾病、 是否为末期肾病 等表征分类问题的信息, 本示例性实施例对此不做特殊限定。 在步骤 S430中, 当目标字段为问答信息, 确定待结构化文本为第三类型。 举例而言 , 当目标字段为第三字段时, 可以确定待结构化文本的类型为第三类型。 其 中, 第三字段可以是问答信息, 该问答信息可以用准确、 简洁的自然语言回答用 户用自然语言提出的 问题。 例如, 问答信息可以包括疾病开始时间等不是分类性质的字 段信息, 本示例性实施例对此不做特殊限定。 进一步 的, 在确定待结构化文本的类型之后, 可以根据该待结构化文本的类型确定 对应的结构化模型进行结构化处理。 在步骤 S320中, 当结构化文本类型为第一类型时, 确定采用命名实体识别模型进行 结构化。 其 中, 第一类型可以是识别症状等实体信息, 也可以是其他信息, 本示例性实施例 对此不做特殊限定。 在步骤 S330中, 当结构化文本类型为第二类型时, 确定采用文本分类模型进行结构 化。 其 中, 第二类型可以是用于识别字段类型的信息, 也可以是其他信息, 本示例性实 施例对此不做特殊限定。 在步骤 S340中, 当结构化文本类型为第三类型时, 确定采用问答模型进行结构化。 其 中, 第三类型可以是非实体以及非类型字段的信息, 也可以是其他信息, 本示例 性实施例对此不做特殊限定。 在本示例 性实施例中, 根据确定出的待结构化文本的类型确定对应的结构化模型, 准确选定了文本结构化的方式, 使得文本结构化的处理方式更加与待结构化文本贴合, 保证了文本结构化的效果。 在步骤 S130中, 利用结构化模型对待结构化文本进行文本结构化, 得到病历文本的 目标结构化文本。 在本 公开的示例性实施例中, 确定结构化模型之后, 可以对待结构化文本进行文本 结构化得到病历文本的目标结构化文本。 在可选 的实施例中, 图 5 示出了一种利用结构化模型对待结构化文本进行文本结构 化的方法的流程示意图, 如图 5所示, 该方法至少可以包括以下步骤: 在步骤 S510中, 获取与英文病历文本对应 的目标格式文本以及与待结构化文本格式对应的校验规则, 并 利用校验规则对 目标格式文本进行解析得到第一结构化结果, 校验规则包括针对目标格 式文本制定的自定义规则。 命名实体识别 (Named Entity Recognition, NER) , 又称作 “专名识别”, 是指识别 文本中具有特定意义的实体, 主要包括人名、 地名、 机构名、 专有名词等。 利用训 练好的命名实体识别模型对待结构化文本进行文本结构化得到英文病历 文本 的目标结构化文本。 从医疗 文本抓取医疗实体, 例如阳性症状、 阴性症状、 疾病等是具有普适意义的, 也能为后续的关系识别和知识图谱构建奠定基础。 在利用 训练好的命名实体识别模型对待结构化文本进行文本结构化之前 , 可以先对 待训练的命名实体识别模型进行训练。 在可选 的实施例中, 图 6示出了训练命名实体识别模型的方法的流程示意图, 如图 6 所示, 该方法至少包括以下步骤: 在步骤 S610中, 获取病历样本以及与病历样本对应的 标注字段, 并利用预训练的词向量对病历样本进行文本映射得到单词嵌入向量。 该病历样 本就可以是对真实世界的病历进行文本清理、 拼写校正和段落合并得到的, 且用于训练命名实体识别模型的样本。 在获取 到真实世界的病历之后, 且在成为病历样本之前, 可以对随机抽取的病历进 行标注得到标注字段 。 该标注字段为阳性症状、 阴性症状等, 本示例性实施例对此不做 特殊限定。 对于词 向量的部分, 可以采用单词嵌入 (Word Embeddings)和 Flair 下文的字符串 嵌入 (Flair Embeddings)的堆叠。 在 pubmed embeddings中, 可以对查询病历文本中的每 个词进行文本映射得到固定的单词嵌入向量。 在步骤 S620中, 利用词嵌入对病历样本进行语境映射得到字符串嵌入向量, 并对单 词嵌入向量和字符串嵌入向量进行纵向拼接处理得到样本向量。
Flair Embeddings则是目前处理 NER效果最好的词嵌入之一。 Flair Embeddings又被 称为上下文字符 串的嵌入。 Flair Embeddings 的两个特点分别是这些单词被理解为字符 (没有任何单词的概念), 以及嵌入是通过其周围文本进行语境化的, 单词在不同的句子 中可以具有不同的含义。 因此, 利用 Flair Embeddings可以联系语境将病历文本进行语境映射得到对应的字符 串嵌入向量。 根据语境的不同, 同样的病历文本可能会产生不同的字符串嵌入向量。 在得到 单词嵌入向量和字符串嵌入向量之后, 可以将该单词嵌入向量和字符串嵌入 向量进行纵向拼接处理, 以得到样本向量。 举例而言, 当单词嵌入向量为 100X200维, 字符串嵌入向量为 100X200维时, 经过纵向拼接处理的样本向量为 200X200维的向量。 通过 对单词嵌入向量和字符串嵌入向量的纵向拼接处理, 能够得到表征含义更为清 楚的样本向量, 为训练命名实体识别模型提供了数据支持。 在步骤 S630中, 利用标注字段和样本向量训练多个待训练的命名实体识别模型, 并 对多个训练得到的命名实体识别模型进行模型评分得到对应的多个第一模型分值。 在得 到样本向量之后, 可以利用标注字段和样本向量训练待训练的命名实体识别模 型。 举 例而言, 待训练的命名实体识 别模型可以是 由 Bi-LSTM ( Bi-directional Long Short-Term Memory, 双向长短期记忆)结合条件随机场(CRF, Conditional Random Field) 构成的, 也可以是其他模型, 本示例性实施例对此不做特殊限定。
Figure imgf000011_0001
其 中, Fi分数 (L Score) , 是统计学中用来衡量二分类模型精确度的一种指标。 它 同时兼顾了分类模型 的精确率和召回率。 F]分数可以看作是模型精确率和召回率的一种 调和平均, 它的最大值是 1 , 最小值是 0。 在步骤 S640中, 根据多个第一模型分值在多个训练得到的命名实体识别模型中确定 一个训练好的命名实体识别模型。 由于在待训练的命名实体识别模型可以有多个, 因此, 可以训练得到多个命名实体 识别模型, 并依据公式(1)计算得到多个训练得到的命名实体识别模型的第一模型分值。 将 多个第一模型分值进行比较, 以将第一模型分值最高的训练得到的命名实体识别 模型确定为训练好的命名实体识别模型, 并对该训练好的命名实体识别模型进行保存。 在本 示例性实施例中, 通过对待训练的命名实体识别模型的训练和评分, 能够得到 在验证集上表现最好 的训练好的命名实体识别模型, 为待结构化文本的命名实体识别提 供模型基础。 在得 到训练好的命名实体识别模型之后, 可以利用该训练好的命名实体识别模型对 待结构化文本进行文本结构化。 由于真实世界的病历文本可能是 HTML格式的, 因此非结构化文本也可以是根据 HTML 格式的文件获得的, 只是失去了 HTML格式的结构, 例如表格等信息。 因此, 可以获取到生成非结构化文本的 HTML格式的目标格式文件。 针对 该目标格式文件, 可以制定自定义规则来辅助命名实体识别模型的预测。 其中, 自定义规则即为校验规则, 该校验规则是校验目标格式文本的格式的规则。 其 中, 校验规则可以包括目标格式文本的格式是第一行第一列为主诉, 除第一行的 所有行的第一列为症状等, 本示例性实施例对此不做特殊限定。 进一步 的, 可以利用该校验规则从目标格式文本中解析出与症状相关的目标表格。 并且, 该目标表格中还包括症状的状态, 亦即第一结构化结果。 其 中, 症状的状态是由 yes代表该症状为阳性, no代表该症状为阴性。 在步骤 S520中, 利用命名实体识别模型对待结构化文本进行命名实体识别得到病历 文本的第二结构化结果。 将待结构化文本输入至训练好的命名实体识别模型 中, 可以得到对待结构化文本进 行命名实体识别后的第二结构化结果。 在步骤 S530中, 根据第一结构化结果和第二结构化结果进行结合, 确定得到结构化 文本。 将利用校验规则对 目标格式文件进行解析之后的第一结构化结果与命名实体识别模 型输出的第二结构化结果进行结合以得到目标结构化结果。 举例而 言, 当利用校验规则对目标格式文件进行解析之后的第一结构化结果为 ABC, 而命名实体识别模型输出的第二结构化结果为 BD时, 可以将第一结构化结果中不存在的 第二结构化结果的 D添加到第一结构化结果中, 以得到目标结构化结果为 ABCDo 在本示例性实施例 中, 利用训练好的命名实体识别模型能够实现对待结构化文本的 命名实体识别, 同时通过校验规则进行结果校验, 大大提升了命名实体识别的准召率。 在可选 的实施例中, 将待结构化文本输入至文本分类模型中, 以使文本分类模型对 待结构化文本进行文本结构化。 文本分类是对文本集 (或其他实体、 物件)按照一定的分类体系或标准进行自动分类 标记。 它根据一个已经被标注的训练文档集合找到文档特征和文档类别之间的关系模型, 然后利用这种学习得到的关系模型对新的文档进行类别判断。 文本分类从基于知识的方 法逐渐转变为基于统计和机器学习的方法。 文本分类一般包括 了文本的表达、 分类器的选择与训练、 分类结果的评价与反馈 等过程。 其中, 文本的表达又可细分为文本预处理、 索引和统计、 特征抽取等步骤。 文 本分类系统的总体功能模块为预处理、 索引、 统计、 特征抽取、 分类器和评价。 其 中, 预处理是将原始语料格式化为同一格式, 便于后续的统一处理; 索引是将文 档分解为基本处理单元, 同时降低后续处理的开销; 统计是词频统计, 项 (单词、 概念) 与分类的相关概率; 特征抽取是从文档中抽取出反映文档主题的特征; 分类器是分类器 的训练; 评价是分类器的测试结果分析。 一些字段如是否有 肾病、 是否是末期肾病是分类问题, 因此需要训练好的分类器实 现实现。 然而获取大量标注数据难度较大 , 而 BERT只需要少量的标注数据即可达到非常不 错的效果, 因此采取通过微调的方式训练 BERT模型训练文本分类器。 在可选 的实施例中, 图 7示出了训练文本分类模型的方法的流程示意图, 如图 7所示, 该方法至少包括以下步骤: 在步骤 S710中, 获取病历样本, 并利用病历样本中的文本分 类字段训练多个初始文本分类模型。 其 中, 该病历样本就可以是对真实世界的病历进行文本清理、 拼写校正和段落合并 得到的, 且用于训练命名实体识别模型的样本。 预训练 的文本分类模型可以是在预训练 BERT模型之上加入线性分类器 (Linear Classifier) 构成。 利用病历文本从 头训练线性分类器, 并微调 BERT 的参数, 使得整个文本分类模型, 亦即 BERT模型和 Linear Classifier结构能一起最大化当前下游任务的目标, 以得到微调 后的文本分类模型。 除此之外 , 文本分类模型也可以是由其他结构或模型构成, 本示例性实施例对此不 做特殊限定。 在步骤 S720中, 利用优化算法优化多个初始文本分类模型, 得到每一初始文本分类 模型对应的第二分值。 利用训练集和验证集 中的病历样本调节微调后的文本分类模型的超参数。 其中, 文 本分类模型的超参数包括训练神经网络的学习速率、 权值衰减系数、 慢热学习的比例 等。 利用训练集 中的病历样本在一次次的迭代和遍历数据集的过程中不断调整微调后的 文本分类模型的权重以及优化模型的表现。 其 中, 预训练的文本分类模型还可以采用更为新颖, 且更适用于生物医学的与训练 模型和词向量 Bio-Clinical BERT。 进一步 的, 按照公式 (1) 对微调后的文本分类模型进行模型评分得到文本分类模型 的 Fi分数, 以作为第二模型分值。 在步骤 S730中, 根据多个第二分值确定文本分类模型。 由于训练的文本分类模型可以有多个, 例如树结构的随机森林和 XGBoost(一个优化 的分布式梯度增强库), 或者是其他深度学习模型, 因此, 可以训练或者微调, 并进一步 优化得到多个文本分类模型, 那么, 根据公式 (1) 可以计算得到多个文本分类模型的第 二模型分值。 将多个第二模型分值进行 比较, 以将第二模型分值最高的得到的文本分类模型确定 为迁移学习后的文本分类模型, 并对该迁移学习后的文本分类模型进行保存。 在本示例性实施例 中, 在训练文本分类模型的过程中只需要少量的病历文本即可达 到非常好的效果, 使得文本分类模型相比于现有的基于少量数据的规则集而言, 大大减 少了研发时长和迭代时间, 并显著提升了准确率。 在得到迁移学习后的文本分类模型之后, 可以将待结构化文本输入至迁移学习后的 文本分类模型中, 以使迁移学习后的文本分类模型输出该待结构化文本的分类结果, 以 将该分类结果作为非结构化文本的目标结构化文本。 对于一些不是分类性质 的字段, 例如疾病开始时间等, 可以采取目标文本进行答案 搜索的问答系统方式。 举例而言 , 目标问题可以是肾病开始时间等, 本示例性实施例对此不做特殊限定。 其 中, 问答系统 (Question Answering System, QA) 是信息检索系统的一种高级形 式, 它能用准确、 简洁的自然语言回答用户用自然语言提出的问题。 其研究兴起的主要 原因是人们对快速、 准确地获取信息的需求。 问答系统是人工智能和自然语言处理领域 中一个倍受关注并具有广泛发展前景的研究方向。 通过多个 由原文的病历样本、 问题样本与答案样本组成的样本对, 可以使机器学习 出一个迁移学习后的问答系统。 具体 的, 获取病历样本、 问题样本以及与问题样本对应的答案样本, 并利用病历样 本、 问题样本和答案样本微调预训练的问答模型得到微调后的问答模型。 其 中, 预训练的问答模型可以采用 Bio BERT, 也可以根据实际情况和需求选择其他 模型, 本示例性实施例对此不做特殊限定。 由于预训练的 Bio BERT模型是由海量训练产生的, 因此, 在针对病例文本形成的目 标段落的结构化的特定场景下, 只需要少量的问题样本和答案样本对预训练的 Bio BERT 模型进行微调, 以使该预训练的 Bio BERT中比较通用的概念能够更好的针对特定场景深 入发挥。 值得说 明的是, 问题样本可以是病例文本中与疾病相关的几种或者十几种问题构成 的样本, 那么, 答案样本可以是根据问题样本在病历样本中标注出的对应答案, 从而构 成的样本。 在进行数据预处理 的过程中, 需要注意的是, 在每一个病历样本、 问题样本与答案 样本组成的样本对里, 都要保证答案样本是来自病历样本中的部分内容, 因此要保证答 案样本和原文的病历样本一致。 预训练 的 Bio BERT模型在经过病历样本、 问题样本与答案样本组成的样本对的微调, 能够学习出如何从原文中搜索出目标问题的对应答案的能力。 对微调后 的问答模型进行模型评分得到第三模型分值, 并根据第三模型分值在微调 后的问答模型中确定得到迁移学习后的问答模型。 进一步 的, 按照公式(1)对微调后的问答模型进行模型评分得到问答模型的 Fi分数, 以作为第三模型分值。 由于训练或者微调的问答模型可以有多个, 因此, 可以训练或者微调, 并进一步优 化得到多个问答模型, 那么, 根据公式(1)可以计算得到多个问答模型的第三模型分值。 将多个第三模型分值进行 比较, 以将第三模型分值最高的得到的问答模型确定为迁 移学习后的问答模型, 并对该迁移学习后的问答模型进行保存。 在微调 问答模型的过程中, 只需要少量的样本即可达到很好的效果, 大大减少了研 发时长和迭代时间, 并且显著提高了文本结构化的准确率。 在可选 的实施例中, 图 8示出了另一种利用结构化模型对待结构化文本进行文本结 构化的方法的流程示意图, 如图 8所示, 该方法至少包括以下步骤: 在步骤 S810中, 获 取与待结构化文本对应的目标问题。 值得说 明的是, 该目标问题对应的答案必须是属于待结构化文本的某一部分。 在步骤 S820中, 基于目标问题, 利用问答模型对待结构化文本进行答案搜索得到病 历文本的目标结构化文本。 在得到迁移学习后的 问答模型之后, 可以将目标问题和待结构化文本输入至迁移学 习后的问答系统中, 以使迁移学习后的问答系统从目标段落中搜索出目标问题的对应答 案, 以确定该答案为病历文本文本的目标结构化文本。 基于此, 在模型训练初期, 通过 对字段定义的研究, 确定字段适合的结构化模型。 例如 , 在通过命名实体识别实现结构化时, 一个文本中可以出现多个阳性症状和多 个阴性症状, 可以通过一个实体识别模型一次性识别所有的实体, 结构化的输出是阳性 症状、 阴性症状各为一个列表。 在通过文本分类实现结构化时, 例如字段为 “病患是否为末期肾病”的情况, 一个 文本只能有一个输出: yes/no/null(null表示无法推测), 这种情况下分类模型可以达到效 果 O 最后一种情况 比较特殊, 是通过问答系统实现结构化时, 例如肾病开始时间, 输出 的是原文中的一段字符串, 输入的是文本和一个问题, 模型通过学习样本, 做到从原文 搜索出用户问题的答案。 在本公开 的应用场景中, 对英文病历文本进行文本预处理, 能够自动识别, 并改正 拼写错误, 保证了文本的准确性和有效性, 大大减少了因拼写错误导致的文本结构化效 果不佳或遗漏的情况发生。 进一步的, 利用对应的结构化模型对待结构化文本进行文本 结构化, 减少了研发时长和迭代时间, 显著提升了文本结构化的准确性和准召率。 此外 , 在本公开的示例性实施例中, 还提供一种英文医疗文本结构化的装置。 图 9 示出了英文医疗文本结构化的装置的结构示意图, 如图 9所示, 英文医疗文本结构化的 装置 900可以包括: 文本获取模块 910、 模型选择模块 920和结果生成模块 930。 其中: 文本获取模块 910, 被配置为获取英文病历文本, 并对所述英文病历文本进行文本预 处理得到待结构化文本; 模型选择模块 920, 被配置为确定所述待结构化文本的类型, 并根据所述待结构化文 本的类型确定对应的结构化模型; 结果生成模块 930, 被配置为利用所述结构化模型对所述待结构化文本进行文本结构 化, 得到所述病历文本的目标结构化文本。 在本公开 的一种示例性实施例中, 所述对所述病历文本进行文本预处理得到待结构 化文本, 包括: 利用拼写校正模型对所述病历文本进行拼写校正得到校正文本; 识别所 述校正文本中的目标关键词, 并根据所述目标关键词对所述校正文本进行合并得到待结 构化文本。 在本公开 的一种示例性实施例中, 所述确定所述待结构化文本的类型, 并根据所述 待结构化文本的类型确定对应的结构化模型, 包括: 根据所述待结构化文本的目标字段 确定所述待结构化文本的类型; 当所述待结构化文本类型为第一类型时, 确定采用命名 实体识别模型进行结构化; 当所述待结构化文本类型为第二类型时, 确定采用文本分类 模型进行结构化; 当所述待结构化文本类型为第三类型时, 确定采用问答模型进行结构 化。 在本公开 的一种示例性实施例中, 所述根据所述待结构化文本的目标字段确定所述 待结构化文本的类型, 包括: 识别所述待结构化文本的目标字段, 当所述目标字段为命 名实体信息, 确定所述待结构化文本为第一类型; 当所述目标字段为文本分类信息, 确 定所述待结构化文本为第二类型; 当所述目标字段为问答信息, 确定所述待结构化文本 为第三类型。 在本公开 的一种示例性实施例中, 确定采用命名实体识别模型进行结构化时, 所述 利用所述结构化模型对所述待结构化文本进行文本结构化, 包括: 获取与所述病历文本 对应的目标格式文本以及与所述待结构化文本格式对应的校验规则, 并利用所述校验规 则对所述目标格式文本进行解析得到第一结构化结果, 所述校验规则包括针对所述目标 格式文本制定的自定义规则; 利用命名实体识别模型对所述待结构化文本进行命名实体 识别得到所述病历文本的第二结构化结果; 根据所述第一结构化结果和所述第二结构化 结果进行结合, 确定目标结构化文本。 在本公开 的一种示例性实施例中, 所述命名实体识别模型通过如下训练步骤得到: 获取病历样本以及与所述病历样本对应的标注字段, 并利用预训练的词向量对所述病历 样本进行文本映射得到单词嵌入向量; 利用词嵌入对所述病历样本进行语境映射得到字 符串嵌入向量, 并对所述单词嵌入向量和所述字符串嵌入向量进行纵向拼接处理得到样 本向量; 利用所述标注字段和所述样本向量训练多个待训练的命名实体识别模型, 并对 多个训练得到的命名实体识别模型进行模型评分得到对应的多个第一模型分值; 根据多 个所述第一模型分值在多个所述训练得到的命名实体识别模型中确定一个训练好的命名 实体识别模型。 在本公开 的一种示例性实施例中, 所述文本分类模型通过如下步骤训练得到: 获取 病历样本, 并利用所述病历样本中的文本分类字段训练多个初始文本分类模型; 利用优 化算法优化多个所述初始文本分类模型, 得到每一所述初始文本分类模型对应的第二分 值; 根据多个所述第二分值确定所述文本分类模型。 在本公开 的一种示例性实施例中, 确定采用问答模型进行结构化时, 所述利用所述 结构化模型对所述待结构化文本进行文本结构化, 包括: 获取与所述待结构化文本对应 的目标问题; 基于所述目标问题, 利用问答模型对所述待结构化文本进行答案搜索得到 所述病历文本的目标结构化文本。 上述英文医疗文本结构化的装置 900的具体细节已经在对应的英文医疗文本结构化 的方法中进行了详细的描述, 因此此处不再赘述。 应 当注意, 尽管在上文详细描述中提及了英文医疗文本结构化的装置 900的若干模 块或者单元, 但是这种划分并非强制性的。 实际上, 根据本公开的实施方式, 上文描述 的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。 反之, 上 文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体 化。 此外 , 在本公开的示例性实施例中, 还提供了一种能够实现上述方法的电子设备。 下面参照 图 10来描述根据本公开的这种实施例的电子设备 1000 o 图 10显示的电子 设备 1000仅仅是一个示例, 不应对本公开实施例的功能和使用范围带来任何限制。 如 图 10所示, 电子设备 1000以通用计算设备的形式表现。 电子设备 1000的组件可 以包括但不限于: 上述至少一个处理单元 1010、 上述至少一个存储单元 1020、 连接不同 系统组件 (包括存储单元 1020和处理单元 1010) 的总线 1030、 显示单元 1040。 其 中, 所述存储单元存储有程序代码, 所述程序代码可以被所述处理单元 1010执行, 使得所述处理单元 1010执行本说明书上述 “示例性方法 ”部分中描述的根据本公开各种示 例性实施例的步骤。 存储单元 1020可以包括易失性存储单元形式的可读介质, 例如随机存取存储单元 (RAM) 1021和 /或高速缓存存储单元 1022, 还可以进一步包括只读存储单元 (ROM) 1023 o 存储单元 1020还可以包括具有一组(至少一个)程序模块 1025的程序 /实用工具 1024, 这样的程序模块 1025包括但不限于: 操作系统、 一个或者多个应用程序、 其它程序模块 以及程序数据, 这些示例中的每一个或某种组合中可能包括网络环境的实现。 总线 1030可以为表示几类总线结构中的一种或多种, 包括存储单元总线或者存储单 元控制器、 外围总线、 图形加速端口、 处理单元或者使用多种总线结构中的任意总线结 构的局域总线。 电子设备 1000也可以与一个或多个外部设备 1200 (例如键盘、 指向设备、 蓝牙设备 等)通信, 还可与一个或者多个使得用户能与该电子设备 1000交互的设备通信, 和 /或与 使得该电子设备 1000能与一个或多个其它计算设备进行通信的任何设备 (例如路由器、 调制解调器等等)通信。 这种通信可以通过输入 /输出(I/O)接口 1050进行。 并且, 电子 设备 1000还可以通过网络适配器 1060与一个或者多个网络(例如局域网(LAN) , 广域 网 (WAN)和 /或公共网络, 例如因特网)通信。 如图所示, 网络适配器 1040通过总线 1030与电子设备 1000的其它模块通信。 应当明白, 尽管图中未示出, 可以结合电子设备 1000使用其它硬件和 /或软件模块, 包括但不限于: 微代码、 设备驱动器、 冗余处理单元、 外部磁盘驱动阵列、 RAID系统、 磁带驱动器以及数据备份存储系统等。 通过 以上的实施例的描述, 本领域的技术人员易于理解, 这里描述的示例实施例可 以通过软件实现, 也可以通过软件结合必要的硬件的方式来实现。 因此, 根据本公开实 施例的技术方案可以以软件产品的形式体现出来, 该软件产品可以存储在一个非易失性 存储介质 (可以是 CD-ROM, U盘, 移动硬盘等)中或网络上, 包括若干指令以使得一台 计算设备 (可以是个人计算机、 服务器、 终端装置、 或者网络设备等)执行根据本公开实 施例的方法。 在本公开 的示例性实施例中, 还提供了一种计算机可读存储介质, 其上存储有能够 实现本说明书上述方法的程序产品。 在一些可能的实施例中, 本公开的各个方面还可以 实现为一种程序产品的形式, 其包括程序代码, 当所述程序产品在终端设备上运行时, 所述程序代码用于使所述终端设备执行本说明书上述 “示例性方法 ”部分中描述的根据本 公开各种示例性实施例的步骤。 参考 图 11所示, 描述了根据本公开的实施例的用于实现上述方法的程序产品 1100, 其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码, 并可以在终端设备, 例 如个人电脑上运行。 然而, 本公开的程序产品不限于此, 在本文件中, 可读存储介质可 以是任何包含或存储程序的有形介质, 该程序可以被指令执行系统、 装置或者器件使用 或者与其结合使用。 所述程序产 品可以采用一个或多个可读介质的任意组合。 可读介质可以是可读信号 介质或者可读存储介质 。 可读存储介质例如可以为但不限于电、 磁、 光、 电磁、 红外线、 或半导体的系统、 装置或器件, 或者任意以上的组合。 可读存储介质的更具体的例子(非 穷举的列表)包括: 具有一个或多个导线的电连接、 便携式盘、 硬盘、 随机存取存储器 (RAM)、 只读存储器(ROM)、 可擦式可编程只读存储器(EPROM或闪存)、 光纤、 便携式紧凑盘只读存储器(CD-ROM)、 光存储器件、 磁存储器件、 或者上述的任意合适的 组合。 计算机可读信号介质可 以包括在基带中或者作为载波一部分传播的数据信号, 其中 承载了可读程序代码 。 这种传播的数据信号可以采用多种形式, 包括但不限于电磁信号、 光信号或上述的任意合适的组合。 可读信号介质还可以是可读存储介质以外的任何可读 介质, 该可读介质可以发送、 传播或者传输用于由指令执行系统、 装置或者器件使用或 者与其结合使用的程序。 可读介质 上包含的程序代码可以用任何适当的介质传输, 包括但不限于无线、 有线、 光缆、 RF等等, 或者上述的任意合适的组合。 可 以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作 的程序代码, 所述程序设计语言包括面向对象的程序设计语言一诸如 Java、 C++等, 还包括常规的过程 式程序设计语言一诸如 “C”语言或类似的程序设计语言。 程序代码可以完全地在用户计算 设备上执行、 部分地在用户设备上执行、 作为一个独立的软件包执行、 部分在用户计算 设备上部分在远程计算设备上执行、 或者完全在远程计算设备或服务器上执行。 在涉及 远程计算设备的情形中, 远程计算设备可以通过任意种类的网络, 包括局域网(LAN)或 广域网 (WAN) , 连接到用户计算设备, 或者, 可以连接到外部计算设备 (例如利用因 特网服务提供商来通过因特网连接) 。 本领域技术人员在考虑说 明书及实践这里公开的公开后, 将容易想到本公开的其他 实施例。 本申请旨在涵盖本公开的任何变型、 用途或者适应性变化, 这些变型、 用途或 者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识 或惯用技术手段。 说明书和实施例仅被视为示例性的, 本公开的真正范围和精神由权利 要求指出。

Claims

权利要求
1. 英文医疗文本结构化的方法, 所述方法包括: 获取英文病历文本 , 并对所述英文病历文本进行文本预处理得到待结构化文本; 确定所述待 结构化文本的类型, 并根据所述待结构化文本的类型确定对应的结构化 模型; 利用所述 结构化模型对所述待结构化文本进行文本结构化, 得到所述病历文本的目 标结构化文本。
2. 根据权利要求 1 所述的方法, 其中, 所述对所述病历文本进行文本预处理得到待 结构化文本, 包括: 利用拼写校正模型对所述病历文本进行拼写校正得到校正文本 ; 识别所述 校正文本中的目标关键词, 并根据所述目标关键词对所述校正文本进行合 并得到待结构化文本。
3. 根据权利要求 1 所述的方法, 其中, 所述确定所述待结构化文本的类型, 并根据 所述待结构化文本的类型确定对应的结构化模型, 包括: 根据所述待结构化文本的 目标字段确定所述待结构化文本的类型; 当所述待结构化文本类型为第一类型时, 确定采用命名实体识别模型进行结构化; 当所述待结构化文本类型为第二类型时, 确定采用文本分类模型进行结构化; 当所述待结构化文本类型为第三类型时, 确定采用问答模型进行结构化。
4. 根据权利要求 3 所述的方法, 其中, 所述根据所述待结构化文本的目标字段确定 所述待结构化文本的类型, 包括: 识别所述 待结构化文本的目标字段, 当所述目标字段为命名实体信息, 确定所述待 结构化文本为第一类型; 当所述目标字段为文本分类信息, 确定所述待结构化文本为第二类型; 当所述目标字段为问答信息, 确定所述待结构化文本为第三类型。
5. 根据权利要求 3 所述的方法, 其中, 确定采用命名实体识别模型进行结构化时, 所述利用所述结构化模型对所述待结构化文本进行文本结构化, 包括: 获取与所 述英文病历文本对应的目标格式文本以及与所述待结构化文本格式对 应的 校验规则, 并利用所述校验规则对所述目标格式文本进行解析得到第一结构化结果, 所 述校验规则包括针对所述目标格式文本制定的自定义规则; 利用命 名实体识别模型对所述待结构化文本进行命名实体识别得到所述病历 文本的 第二结构化结果; 根据所述第一结构化结果和所述第二结构化结果进行结合, 确定目标结构化文本。
6. 根据权利要求 3 所述的方法, 其中, 所述命名实体识别模型通过如下训练步骤得 到: 获取病 历样本以及与所述病历样本对应的标注字段, 并利用预训练的词向量对所述 病历样本进行文本映射得到单词嵌入向量; 利用词 嵌入对所述病历样本进行语境映射得到字符串嵌入向量, 并对所述单词嵌入 向量和所述字符串嵌入向量进行纵向拼接处理得到样本向量; 利用所述 标注字段和所述样本向量训练多个待训练的命名实体识别模 型, 并对多个 训练得到的命名实体识别模型进行模型评分得到对应的多个第一模型分值; 根据 多个所述第一模型分值在多个所述训练得到的命名实体识别模型 中确定一个训 练好的命名实体识别模型。
7. 根据权利要求 3所述的方法, 其中, 所述文本分类模型通过如下步骤训练得到: 获取 病历样本, 并利用所述病历样本中的文本分类字段训 练多个初始文本分类模 型; 利用优化 算法优化多个所述初始文本分类模型, 得到每一所述初始文本分类模型对 应的第二分值; 根据多个所述第二分值确定所述文本分类模型 。
8. 根据权利要求 3 所述的方法, 其中, 确定采用问答模型进行结构化时, 所述利用 所述结构化模型对所述待结构化文本进行文本结构化, 包括: 获取与所述待结构化文本对应的 目标问题; 基于所述 目标问题, 利用问答模型对所述待结构化文本进行答案搜索得到所述病历 文本的目标结构化文本。
9. 一种英文医疗文本结构化的装置, 包括: 文本获 取模块, 被配置为获取英文病历文本, 并对所述英文病历文本进行文本预处 理得到待结构化文本; 模型选 择模块, 被配置为确定所述待结构化文本的类型, 并根据所述待结构化文本 的类型确定对应的结构化模型; 结果生 成模块, 被配置为利用所述结构化模型对所述待结构化文本进行文本结构化, 得到所述病历文本的目标结构化文本。
10. 一种计算机可读存储介质, 其上存储有计算机程序, 所述计算机程序被发送器 执行时实现权利要求 1-8中任意一项所述的英文医疗文本结构化的方法。
11. 一种电子设备, 包括: 发送器 ; 存储器 , 用于存储所述发送器的可执行指令; 其 中, 所述发送器被配置为经由执行所述可执行指令来执行权利要求 1-8中任意一项 所述的英文医疗文本结构化的方法。
PCT/IB2022/057919 2022-08-24 2022-08-24 英文医疗文本结构化的方法、装置、介质及电子设备 WO2024042348A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2022/057919 WO2024042348A1 (zh) 2022-08-24 2022-08-24 英文医疗文本结构化的方法、装置、介质及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2022/057919 WO2024042348A1 (zh) 2022-08-24 2022-08-24 英文医疗文本结构化的方法、装置、介质及电子设备

Publications (1)

Publication Number Publication Date
WO2024042348A1 true WO2024042348A1 (zh) 2024-02-29

Family

ID=90012645

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2022/057919 WO2024042348A1 (zh) 2022-08-24 2022-08-24 英文医疗文本结构化的方法、装置、介质及电子设备

Country Status (1)

Country Link
WO (1) WO2024042348A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117995347A (zh) * 2024-04-07 2024-05-07 北京惠每云科技有限公司 病历内涵质控方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032648A (zh) * 2019-03-19 2019-07-19 微医云(杭州)控股有限公司 一种基于医学领域实体的病历结构化解析方法
CN112582073A (zh) * 2020-12-30 2021-03-30 天津新开心生活科技有限公司 医疗信息获取方法、装置、电子设备和介质
US20210174923A1 (en) * 2019-12-06 2021-06-10 Ankon Technologies Co., Ltd Method, device and medium for structuring capsule endoscopy report text
CN113160999A (zh) * 2021-04-25 2021-07-23 厦门拜特信息科技有限公司 用于医疗决策的数据结构化分析系统与数据处理方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032648A (zh) * 2019-03-19 2019-07-19 微医云(杭州)控股有限公司 一种基于医学领域实体的病历结构化解析方法
US20210174923A1 (en) * 2019-12-06 2021-06-10 Ankon Technologies Co., Ltd Method, device and medium for structuring capsule endoscopy report text
CN112582073A (zh) * 2020-12-30 2021-03-30 天津新开心生活科技有限公司 医疗信息获取方法、装置、电子设备和介质
CN113160999A (zh) * 2021-04-25 2021-07-23 厦门拜特信息科技有限公司 用于医疗决策的数据结构化分析系统与数据处理方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117995347A (zh) * 2024-04-07 2024-05-07 北京惠每云科技有限公司 病历内涵质控方法、装置、电子设备及存储介质

Similar Documents

Publication Publication Date Title
US11132370B2 (en) Generating answer variants based on tables of a corpus
Arora et al. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis
US9373075B2 (en) Applying a genetic algorithm to compositional semantics sentiment analysis to improve performance and accelerate domain adaptation
US9606990B2 (en) Cognitive system with ingestion of natural language documents with embedded code
US10366107B2 (en) Categorizing questions in a question answering system
US10339453B2 (en) Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation
US10671929B2 (en) Question correction and evaluation mechanism for a question answering system
JP6095621B2 (ja) 回答候補間の関係を識別および表示する機構、方法、コンピュータ・プログラム、ならびに装置
US9373086B1 (en) Crowdsource reasoning process to facilitate question answering
US10331659B2 (en) Automatic detection and cleansing of erroneous concepts in an aggregated knowledge base
US10956463B2 (en) System and method for generating improved search queries from natural language questions
US20160232222A1 (en) Generating Usage Report in a Question Answering System Based on Question Categorization
US9720962B2 (en) Answering superlative questions with a question and answer system
US10956824B2 (en) Performance of time intensive question processing in a cognitive system
WO2013088287A1 (en) Generation of natural language processing model for information domain
WO2023029506A1 (zh) 病情分析方法、装置、电子设备及存储介质
US20170371955A1 (en) System and method for precise domain question and answer generation for use as ground truth
US10628749B2 (en) Automatically assessing question answering system performance across possible confidence values
US20170351754A1 (en) Automated Timeline Completion Using Event Progression Knowledge Base
US20170371956A1 (en) System and method for precise domain question and answer generation for use as ground truth
US20170140290A1 (en) Automated Similarity Comparison of Model Answers Versus Question Answering System Output
Liu et al. Augmented LSTM framework to construct medical self-diagnosis android
US10552461B2 (en) System and method for scoring the geographic relevance of answers in a deep question answering system based on geographic context of a candidate answer
US20180082187A1 (en) System and Method for Scoring the Geographic Relevance of Answers in a Deep Question Answering System Based on Geographic Context of an Input Question
Wang et al. Novel medical question and answer system: graph convolutional neural network based with knowledge graph optimization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22956384

Country of ref document: EP

Kind code of ref document: A1