WO2021164301A1 - Medical text structuring method and apparatus, computer device and storage medium - Google Patents

Medical text structuring method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2021164301A1
WO2021164301A1 PCT/CN2020/124215 CN2020124215W WO2021164301A1 WO 2021164301 A1 WO2021164301 A1 WO 2021164301A1 CN 2020124215 W CN2020124215 W CN 2020124215W WO 2021164301 A1 WO2021164301 A1 WO 2021164301A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
code file
medical
sentence
characteristic
Prior art date
Application number
PCT/CN2020/124215
Other languages
French (fr)
Chinese (zh)
Inventor
朱威
何义龙
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021164301A1 publication Critical patent/WO2021164301A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the field of intelligent decision-making in artificial intelligence, and in particular to a method, device, computer equipment, and storage medium for structuring medical text.
  • the same medical source text contains a large number of medical knowledge texts, and these texts will involve a variety of medical knowledge in the medical field.
  • these medical knowledge texts need to be displayed in the interface, it is necessary to manually edit such medical knowledge texts.
  • Effective editing is carried out to make it structured and easy to view, but the inventor realizes that the text format of the medical knowledge text in the source text is usually uneven, and most of the medical knowledge text is presented in an unstructured form, so , It is easy to cause manual editing errors, and the editing efficiency is low, and the editing takes a lot of time.
  • the medical knowledge texts must be structured in a specific format, such as correct segmentation and reasonable indentation. If a structured medical text can be displayed externally through manual editing, it is time-consuming and labor-intensive. Those skilled in the art urgently need to find a new technical solution to solve the above-mentioned problems.
  • a method of structuring medical texts including:
  • the second feature sentence output by the preset article semantic recognition model is obtained;
  • the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
  • a medical text structuring device including:
  • the grab module is used to grab the entire paragraph of unstructured medical knowledge text in the medical source text to be processed
  • a splitting module used to identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
  • the first acquiring module is configured to acquire a semantic feature vector corresponding to each of the first characteristic sentences after inputting the first characteristic sentence into the preset language recognition model;
  • the second acquisition module is configured to input all the semantic feature vectors into the preset article semantic recognition model, and then acquire the second feature sentence output by the preset article semantic recognition model; the second feature sentence contains all
  • the preset article semantic recognition model determines a preset number of positions to be segmented according to the context relationship of the unstructured medical knowledge text;
  • the insertion module is used to call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and compare it with the second characteristic sentence in the first code file After inserting the dividing symbol at the position corresponding to the position to be divided, the second code file is obtained;
  • the display module is used to run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the second feature sentence output by the preset article semantic recognition model is obtained;
  • the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
  • One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the second feature sentence output by the preset article semantic recognition model is obtained;
  • the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
  • the above-mentioned medical text structuring method, device, computer equipment and storage medium use models and segmentation symbols to replace the previous manual editing of unstructured medical knowledge text in the medical source text, avoiding the error rate of manual editing
  • the problem of high and the time spent on manual editing improves the efficiency of converting unstructured medical knowledge texts into structured medical knowledge texts.
  • FIG. 1 is a schematic diagram of an application environment of the medical text structuring method in an embodiment of the present application
  • FIG. 2 is a flowchart of a method for structuring medical text in an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a medical text structuring device in an embodiment of the present application.
  • Fig. 4 is a schematic diagram of a computer device in an embodiment of the present application.
  • the medical text structuring method provided in this application can be applied in the application environment as shown in Fig. 1, in which the client communicates with the server through the network.
  • the client can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a method for structuring medical texts is provided.
  • the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • the medical source text to be processed may refer to the unstructured medical knowledge text in the medical field on the webpage, where the unstructured medical knowledge text may include, but is not limited to, the formulations of various medical drugs on the webpage, and various types of medical drugs.
  • Descriptions of medical functions or product instructions of medical drugs; unstructured medical knowledge text refers to text without a fixed format.
  • the fixed format includes but is not limited to text paragraph format, text format, indentation format, and spacing format. Due to this implementation For example, the text in the medical source text to be processed is uploaded by different users, and the format used by each user in the editing process will be inconsistent.
  • the same medical source text to be processed different users upload the entire text, and the entire text is finally displayed
  • various input components or display components in the medical source text to be processed may have incompatible text issues, and the entire text is copied from the display component in the medical source text to be processed to
  • the previously structured medical knowledge text may be turned into an unstructured medical knowledge text; specifically, in this embodiment, the unstructured medical source text is captured in the medical source text to be processed.
  • the medical knowledge text can be determined by recognizing the text.
  • the text selected by the user can be used as unstructured medical knowledge text, or it can be recognized by the NLP model, and When the NLP model recognizes that the text in the medical source text to be processed has multiple inconsistent formats or does not exist, the text in the medical source text to be processed can be captured as an unstructured medical knowledge text.
  • S20 Identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
  • punctuation marks in unstructured medical knowledge text can be identified by identifying the components of punctuation marks, or punctuation marks can be identified by NLP model, and structured medical knowledge can be undivided by the recognized punctuation marks
  • Sentences in the text are segmented to obtain multiple first characteristic sentences, wherein the punctuation marks in the first characteristic sentences are symbols that can be used to split complete sentences such as periods, exclamation marks, or question marks.
  • the sentence is split into a plurality of first feature sentences, and each sentence represents the feature of a complete sentence. This feature will provide a connection relationship of a complete sentence in the subsequent semantic recognition process, avoiding the occurrence of sentences between sentences.
  • the phenomenon of hybrid recognition is described by the phenomenon of hybrid recognition.
  • the preset language recognition model can be a bert model, where the bert model can be used to capture the first feature sentence and the level description of each word of the first feature sentence, and the goal of the bert model is to train on a large-scale unlabeled corpus , In order to obtain the expression of rich semantic information contained in the first characteristic sentence.
  • the core of the bert model is the Transformer module, and the Transformer module is created using the Attention mechanism, and the created Transformer module can be assembled into the above-mentioned bert model.
  • This embodiment uses the word-to-sentence relationship in the bert model to obtain the semantic feature vector corresponding to the first feature sentence.
  • the preset article semantic recognition model is the LSTM model, and the goal of the LSTM model is to remember information for a long time to identify each complete sentence in the input text.
  • the core processing of the LSTM model is completed through three thresholds.
  • the three thresholds are the forgetting threshold, the input threshold and the output threshold.
  • a complete second feature sentence can be determined.
  • the second characteristic sentence will form two positions to be divided.
  • the first code file represents the background code file corresponding to the medical source text to be processed.
  • the first code file can be called by the script language;
  • the second characteristic sentence is composed of the unstructured text in the medical source text to be processed.
  • the medical knowledge text is converted, so the second feature sentence also has a text display position in the first code file in the medical source text to be processed (the text display position contains multiple second feature sentences), which can be specified in the first code Query and write the second characteristic sentence in the file to the code language corresponding to the text display position, and finally identify the word corresponding to the second characteristic sentence in the text display position in the code language to determine the second characteristic sentence; segmentation symbol It can be understood as an html symbol.
  • two division symbols can be inserted in the two positions to be divided corresponding to the text display position (that is, the text display position contains at least two positions to be divided).
  • Two characteristic sentences form a division, where the division symbols include div symbols, h1 to h6 title symbols, etc.
  • the second code file is a background code file including segmentation symbols. At this time, if the specific medical source text to be processed is to be displayed, the second code file needs to be run.
  • the method further includes:
  • the first code file of the medical source text to be processed is called, and the wrong words in the first code file are corrected according to the marking result to obtain a third code file, and the third code file is After running, the revised unstructured medical knowledge text is obtained.
  • the preset natural language processing model can be an NLP model.
  • the semantic recognition function established by the model is used to mark unstructured medical knowledge texts with repetitive or incorrect words, and the result of the mark is used to mark the first word. Correction of incorrectly written words in a code file, where the correction includes the deletion of repeated words and typos.
  • the first characteristic sentence is stored in a blockchain;
  • the preset language recognition model is a bert model;
  • obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:
  • the semantic feature vector corresponding to the first feature sentence is obtained.
  • this embodiment mainly uses the Attention mechanism in the bert model to allow the model to focus on the input first feature sentence;
  • the Attention mechanism in this embodiment includes Query vector, Key vector and valve value, among which, Both the Query vector and the Key vector are derived from word vectors, and each word vector has a corresponding valve value.
  • the essence of Attention can be described as a mapping from a query (Query) to a series of (Key-Value) pairs; specifically
  • each word in the first characteristic sentence is queried through the bert model, and each word in the query is converted into a one-dimensional word vector through the bert model, and then a One of the word vectors of the first characteristic sentence is used as a target vector Query vector, and other word vectors in the first characteristic sentence are used as the Key vector, and then the Query vector and each Key vector are similarly calculated to obtain the weight coefficient.
  • the commonly used similarity functions include but are not limited to dot product, splicing and perceptron, and then use the preset softmax function to normalize the weight coefficients obtained, and the normalized weight coefficients are corresponding to the Query vector and the Key vector.
  • the key value value performs a weighted summation operation to obtain the first enhanced semantic feature vector corresponding to the Query vector output by the last Attention mechanism.
  • each Transformer Encoder composed of the Attention mechanism is used to perform data processing on the first enhanced semantic feature vector.
  • the data processing includes incomplete connection (adding the word vector and the first enhanced semantic feature vector directly as the final output), normalization and linear transformation of a layer of neural network nodes with 0 mean and 1 variance (for the first enhanced semantic feature)
  • the vector is linearly transformed to enhance the expressive ability of the bert model), after combining the second enhanced semantic feature vector corresponding to each word vector, the semantic feature vector corresponding to the first feature sentence is obtained.
  • the bert model is used as the preset language recognition model, and its achievable purposes are: 1. It can learn the relationship between the first feature sentences, that is, connect the context; 2. It is good to obtain the semantic representation of the sentence level (No. 2. Enhanced semantic feature vector).
  • the first characteristic sentence may also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the decentralized and fully distributed DNS service provided by the blockchain can realize the query and resolution of domain names through the point-to-point data transmission service between various nodes in the network, which can be used to ensure that the operating system and firmware of an important infrastructure are not available. If it is tampered with, it can monitor the status and integrity of the software, find bad tampering, and ensure that the transmitted data has not been tampered with. Store the first characteristic sentence in the blockchain to ensure the privacy and security of the first characteristic sentence sex.
  • the method further includes:
  • the corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is added to the second code file.
  • this embodiment mainly adds corresponding cascading style sheets such as color, font number, frame, etc. to the second code file to display the specific format status in the medical source text to be processed, such as color in CSS , Font-size in CSS and box in css.
  • the preset article semantic recognition model is an LSTM model
  • the method further includes:
  • the second characteristic sentence is output through the output threshold in the LSTM model and the required information.
  • each threshold designed by the LSTM model is the ability to remove or add information to the cell state (which can be regarded as a semantic feature vector).
  • each threshold includes a sigmoid neural network layer and a pointwise multiplication operation.
  • the sigmoid neural network layer outputs a value between 0 and 1, describing how much of each part can pass, 0 means no, 1 means pass;
  • the forget gate can determine the discarded information in the cell state.
  • the discarded information is the subject corresponding to the last semantic feature vector.
  • the input threshold can update the stored information in the cell state. Specifically, the discarded information is discarded from the semantic feature vector by the input threshold. And from the discarded semantic feature vector, the required information to be updated is determined, the output threshold can determine the second feature sentence to be output, and the second feature sentence is output according to the required information that has been determined in the above input threshold.
  • the above provides a medical text structuring method that uses models and segmentation symbols to replace the previous manual editing of unstructured medical knowledge text in the medical source text, avoiding the high error rate of manual editing and Manual editing of problems that take a lot of time can improve the efficiency of converting unstructured medical knowledge texts into structured medical knowledge texts.
  • This method can be applied to smart medical care to promote the construction of smart cities.
  • a medical text structuring device corresponds to the medical text structuring method in the above-mentioned embodiment in a one-to-one correspondence.
  • the medical text structuring device includes a grabbing module 11, a splitting module 12, a first acquiring module 13, a second acquiring module 14, an inserting module 15 and a display module 16.
  • the detailed description of each functional module is as follows:
  • the grabbing module 11 is used to grab the entire unstructured medical knowledge text in the medical source text to be processed;
  • the splitting module 12 is used to identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
  • the first obtaining module 13 is configured to obtain a semantic feature vector corresponding to each of the first characteristic sentences after inputting the first characteristic sentence into the preset language recognition model;
  • the second acquiring module 14 is configured to input all the semantic feature vectors into a preset article semantic recognition model, and then acquire the second feature sentence output by the preset article semantic recognition model; the second feature sentence includes A preset number of positions to be segmented determined by the preset article semantic recognition model according to the context relationship of the unstructured medical knowledge text;
  • the insertion module 15 is used to call the first code file of the medical source text to be processed, query the second feature sentence from the first code file, and compare the second feature sentence in the first code file with the After inserting the dividing symbol at the position corresponding to the position to be divided of the sentence, the second code file is obtained;
  • the display module 16 is configured to run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
  • the medical text structuring device further includes:
  • the marking module is used to detect the unstructured medical knowledge text through a preset natural language processing model, mark the words with errors in the unstructured, and obtain the marking result;
  • the running module is used for calling the first code file of the medical source text to be processed, correcting the wrong words in the first code file according to the marking result, obtaining the third code file, and checking all the After the third code file is run, the revised unstructured medical knowledge text is obtained.
  • the preset language recognition model is a bert model
  • the first acquisition module includes:
  • the input sub-module is used to query the word vector of each word in the first characteristic sentence through the bert model after inputting the first characteristic sentence into the bert model;
  • a selection sub-module configured to select one of the word vectors in the first characteristic sentence as a Query vector through the Attention mechanism in the bert model, and use the other word vectors of the first characteristic sentence as a Key vector;
  • the weighting operation sub-module is used to calculate the similarity between the Query vector and each of the Key vectors to obtain weight coefficients, and perform weighting operations on the Value values corresponding to the Query vector and the Key vector through the weight coefficients , Obtain the first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;
  • a linear conversion sub-module configured to perform linear conversion on the first enhanced semantic feature vector through multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;
  • the combination sub-module is used to combine the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.
  • the medical text structuring device further includes:
  • the adding module is used for calling out the corresponding cascading style sheet according to the preset style format, and adding the cascading style sheet to the second code file.
  • the preset article semantic recognition model is an LSTM model
  • the second acquisition module includes:
  • the first selection sub-module is configured to select discarding information through the forgetting threshold in the LSTM model
  • the second selection sub-module is configured to select required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;
  • the output sub-module is configured to output the second characteristic sentence through the output threshold in the LSTM model and the required information.
  • Each module in the above-mentioned medical text structuring device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 4.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes computer readable instructions and internal memory.
  • the computer readable instructions are stored with an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the operation of the operating system and the computer-readable instructions in the computer-readable instructions.
  • the database of the computer equipment is used to store multiple pieces of historical test data, and each piece of historical test data corresponds to a test problem record.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a medical text structuring method.
  • a computer device which includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. Describe the structuring method of medical text.
  • one or more readable storage media storing computer readable instructions are provided.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the medical text structuring method described in the above embodiments.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The present application relates to an artificial intelligence technology applicable to the field of medical text processing, and particularly discloses a medical text structuring method and apparatus, a computer device, and a storage medium. The method comprises: acquiring unstructured medical knowledge text; splitting the unstructured text into a plurality of first feature statements; inputting the first feature statements to a preset language recognition model to obtain semantic feature vectors; inputting the semantic feature vectors to a preset article semantics recognition model to obtain output second feature statements; calling a first code file of a medical source text to be processed; inserting segmentation symbols to positions in said first code file corresponding to the segmentation-pending positions of the second feature statements, to obtain a second code file; running the second code file so as to display on the medical source text to be processed structured medical knowledge text corresponding to the unstructured medical knowledge text. The present method improves conversion efficiency for structured medical knowledge texts.

Description

医学文本结构化方法、装置、计算机设备及存储介质Medical text structuring method, device, computer equipment and storage medium
本申请要求于2020年09月8日提交中国专利局、申请号为202010935255.2,发明名称为“医学文本结构化方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on September 8, 2020 with the application number 202010935255.2 and the invention title "Medical text structuring method, device, computer equipment and storage medium". The entire content of the application is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能中的智能决策领域,尤其涉及一种医学文本结构化方法、装置、计算机设备及存储介质。This application relates to the field of intelligent decision-making in artificial intelligence, and in particular to a method, device, computer equipment, and storage medium for structuring medical text.
背景技术Background technique
目前,同一个医学源文本中包含大量的医学知识文本,且这些文本会涉及到医学领域中的多种医学知识,当需要在界面中展示这些医学知识文本时,需要人工对此类医学知识文本进行有效编辑以令其结构化而便于查看,但发明人意识到源文本中的医学知识文本的文本格式通常参差不齐,其中的大部分医学知识文本又是以非结构化的形式呈现,因此,很容易导致人工编辑出错,且编辑效率低、编辑花费时间多。尤其是在需要将一些新出现的医学知识文本(医疗领域新出的产品说明书等)向用户展示时,要求医学知识文本必须是具有结构化的特定格式,比如分段正确,缩进合理。如果通过人工手动编辑形成可以对外展示的结构化医学文本,但其做法既耗时又耗力。因本领域技术人员亟需寻找一种新的技术方案以解决上述的问题。At present, the same medical source text contains a large number of medical knowledge texts, and these texts will involve a variety of medical knowledge in the medical field. When these medical knowledge texts need to be displayed in the interface, it is necessary to manually edit such medical knowledge texts. Effective editing is carried out to make it structured and easy to view, but the inventor realizes that the text format of the medical knowledge text in the source text is usually uneven, and most of the medical knowledge text is presented in an unstructured form, so , It is easy to cause manual editing errors, and the editing efficiency is low, and the editing takes a lot of time. Especially when some new medical knowledge texts (new product manuals in the medical field, etc.) need to be displayed to users, the medical knowledge texts must be structured in a specific format, such as correct segmentation and reasonable indentation. If a structured medical text can be displayed externally through manual editing, it is time-consuming and labor-intensive. Those skilled in the art urgently need to find a new technical solution to solve the above-mentioned problems.
 To
发明内容Summary of the invention
基于此,有必要针对上述技术问题,提供一种医学文本结构化方法、装置、计算机设备及存储介质,用于避免人工编辑的错误率高和人工编辑花费时间多的问题,并可提高非结构化医学知识文本转换为结构化医学知识文本的效率。Based on this, it is necessary to address the above technical problems and provide a medical text structuring method, device, computer equipment and storage medium, which are used to avoid the high error rate of manual editing and the time-consuming manual editing, and improve the non-structural The efficiency of transforming medical knowledge text into structured medical knowledge text.
一种医学文本结构化方法,包括:A method of structuring medical texts, including:
抓取待处理医学源文本中整段的非结构化医学知识文本;Grab the entire unstructured medical knowledge text in the medical source text to be processed;
识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;
将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;
运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
一种医学文本结构化装置,包括:A medical text structuring device, including:
抓取模块,用于抓取待处理医学源文本中整段的非结构化医学知识文本;The grab module is used to grab the entire paragraph of unstructured medical knowledge text in the medical source text to be processed;
拆分模块,用于识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;A splitting module, used to identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
第一获取模块,用于将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;The first acquiring module is configured to acquire a semantic feature vector corresponding to each of the first characteristic sentences after inputting the first characteristic sentence into the preset language recognition model;
第二获取模块,用于将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;The second acquisition module is configured to input all the semantic feature vectors into the preset article semantic recognition model, and then acquire the second feature sentence output by the preset article semantic recognition model; the second feature sentence contains all The preset article semantic recognition model determines a preset number of positions to be segmented according to the context relationship of the unstructured medical knowledge text;
插入模块,用于调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;The insertion module is used to call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and compare it with the second characteristic sentence in the first code file After inserting the dividing symbol at the position corresponding to the position to be divided, the second code file is obtained;
展示模块,用于运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。The display module is used to run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
抓取待处理医学源文本中整段的非结构化医学知识文本;Grab the entire unstructured medical knowledge text in the medical source text to be processed;
识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;
将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;
运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
抓取待处理医学源文本中整段的非结构化医学知识文本;Grab the entire unstructured medical knowledge text in the medical source text to be processed;
识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;
将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;
运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
上述医学文本结构化方法、装置、计算机设备及存储介质,利用模型以及分割符号去代替之前人工对待处理医学源文本中的非结构化医学知识文本进行编辑工作的方式,避免了人工编辑的错误率高和人工编辑花费时间多的问题,提高了非结构化医学知识文本转换为结构化医学知识文本的效率。The above-mentioned medical text structuring method, device, computer equipment and storage medium use models and segmentation symbols to replace the previous manual editing of unstructured medical knowledge text in the medical source text, avoiding the error rate of manual editing The problem of high and the time spent on manual editing improves the efficiency of converting unstructured medical knowledge texts into structured medical knowledge texts.
 To
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中医学文本结构化方法的一应用环境示意图;FIG. 1 is a schematic diagram of an application environment of the medical text structuring method in an embodiment of the present application;
图2是本申请一实施例中医学文本结构化方法的一流程图;2 is a flowchart of a method for structuring medical text in an embodiment of the present application;
图3是本申请一实施例中医学文本结构化装置的结构示意图;FIG. 3 is a schematic structural diagram of a medical text structuring device in an embodiment of the present application;
图4是本申请一实施例中计算机设备的一示意图。Fig. 4 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请提供的医学文本结构化方法,可应用在如图1的应用环境中,其中,客户端通过网络与服务器进行通信。其中,客户端可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The medical text structuring method provided in this application can be applied in the application environment as shown in Fig. 1, in which the client communicates with the server through the network. Among them, the client can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种医学文本结构化方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 2, a method for structuring medical texts is provided. The method is applied to the server in FIG. 1 as an example for description, including the following steps:
S10,抓取待处理医学源文本中整段的非结构化医学知识文本;S10: Grab the entire paragraph of unstructured medical knowledge text in the medical source text to be processed;
可理解地,待处理医学源文本可以指网页中医学领域的非结构化医学知识文本,其中非结构化医学知识文本可包括但不限于网页上的各种医疗药品的配方、医疗药品的各种医治功能说明或者医疗药品的产品说明书等;非结构化医学知识文本是指没有固定格式的文本,其中,固定格式包括但不限于文本段落格式,文字格式、缩进格式和间距格式,由于本实施例待处理医学源文本中的文本是不同用户上传,且各个用户在编辑过程中用到的格式会不一致,在同一个待处理医学源文本不同用户上传整段文本,最后展现出的整段文本会存在文本格式闭不一致的问题,且待处理医学源文本中各种输入组件或者展示组件等可能出现不兼容文本的问题,而将整段文本从一个待处理医学源文本中的展示组件复制到另一个待处理医学源文本中的输入组件中,可能会将之前结构化医学知识文本变成非结构化医学知识文本;具体地,本实施例中的抓取待处理医学源文本中非结构化医学知识文本可通过识别文本来确定,识别到待处理医学源文本中展示界面中所有的文本后,可通过用户选择的文本来作为非结构化医学知识文本,也可通过NLP模型来识别,并在NLP模型识别到待处理医学源文本中的文本存在多种不一致格式或者未存在格式,此时可将待处理医学源文本中的文本抓取出作为非结构化医学知识文本。Understandably, the medical source text to be processed may refer to the unstructured medical knowledge text in the medical field on the webpage, where the unstructured medical knowledge text may include, but is not limited to, the formulations of various medical drugs on the webpage, and various types of medical drugs. Descriptions of medical functions or product instructions of medical drugs; unstructured medical knowledge text refers to text without a fixed format. The fixed format includes but is not limited to text paragraph format, text format, indentation format, and spacing format. Due to this implementation For example, the text in the medical source text to be processed is uploaded by different users, and the format used by each user in the editing process will be inconsistent. In the same medical source text to be processed, different users upload the entire text, and the entire text is finally displayed There will be inconsistencies in the text format, and various input components or display components in the medical source text to be processed may have incompatible text issues, and the entire text is copied from the display component in the medical source text to be processed to In another input component in the medical source text to be processed, the previously structured medical knowledge text may be turned into an unstructured medical knowledge text; specifically, in this embodiment, the unstructured medical source text is captured in the medical source text to be processed. The medical knowledge text can be determined by recognizing the text. After recognizing all the text in the display interface of the medical source text to be processed, the text selected by the user can be used as unstructured medical knowledge text, or it can be recognized by the NLP model, and When the NLP model recognizes that the text in the medical source text to be processed has multiple inconsistent formats or does not exist, the text in the medical source text to be processed can be captured as an unstructured medical knowledge text.
S20,识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;S20: Identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
可理解地,具体可通过识别标点符号的组件开识别非结构化医学知识文本中的标点符号,也可通过NLP模型来进行识别标点符号,可通过识别到的标点符号来未分结构化医学知识文本中的语句进行断句拆分,而得到上述的多个第一特征语句,其中,第一特征语句中的标点符号是可通过句号、感叹号或者问号等其他可对完整句子进行拆分的符号。本实施例将句子拆分成多个第一特征语句,每个句子代表的是一个完整句子的特征,该特征将为后续的语义识别过程中提供一个完整句子的联系关系,避免句子之间出现混合识别的现象。Understandably, specific punctuation marks in unstructured medical knowledge text can be identified by identifying the components of punctuation marks, or punctuation marks can be identified by NLP model, and structured medical knowledge can be undivided by the recognized punctuation marks Sentences in the text are segmented to obtain multiple first characteristic sentences, wherein the punctuation marks in the first characteristic sentences are symbols that can be used to split complete sentences such as periods, exclamation marks, or question marks. In this embodiment, the sentence is split into a plurality of first feature sentences, and each sentence represents the feature of a complete sentence. This feature will provide a connection relationship of a complete sentence in the subsequent semantic recognition process, avoiding the occurrence of sentences between sentences. The phenomenon of hybrid recognition.
S30,将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;S30: After inputting the first characteristic sentence into a preset language recognition model, obtain a semantic characteristic vector corresponding to each of the first characteristic sentences;
可理解地,预设语言识别模型是可为bert模型,其中bert模型可用于捕捉第一特征语句和第一特征语句各个字的级别描述,且bert模型的目标是用大规模无标注语料进行训练,以获取第一特征语句中包含丰富语义信息的表现。其中,bert模型的核心为Transformer模块,而Transformer模块是利用Attention机制创建的,创建后的Transformer模块可组装成上述的bert模型。本实施例是利用bert模型中的字到句的关系来获取到第一特征语句所对应的语义特征向量。Understandably, the preset language recognition model can be a bert model, where the bert model can be used to capture the first feature sentence and the level description of each word of the first feature sentence, and the goal of the bert model is to train on a large-scale unlabeled corpus , In order to obtain the expression of rich semantic information contained in the first characteristic sentence. Among them, the core of the bert model is the Transformer module, and the Transformer module is created using the Attention mechanism, and the created Transformer module can be assembled into the above-mentioned bert model. This embodiment uses the word-to-sentence relationship in the bert model to obtain the semantic feature vector corresponding to the first feature sentence.
S40,将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;S40. After inputting all the semantic feature vectors into a preset article semantic recognition model, obtain a second feature sentence output by the preset article semantic recognition model; the second feature sentence includes the preset article semantics A preset number of positions to be segmented determined by the recognition model according to the context relationship of the unstructured medical knowledge text;
可理解地,预设文章语义识别模型为LSTM模型,LSTM模型的目标是用于长时间记住信息,以识别出输入文本中各个完整的语句。其中,LSTM模型的核心处理是通过3个门限完成,3个门限分别为遗忘门限、输入门限和输出门限,另外当LSTM模型结合输入文本的上下文就可确定出一个完整的第二特征语句,一个第二特征语句会形成两个待分割位置。Understandably, the preset article semantic recognition model is the LSTM model, and the goal of the LSTM model is to remember information for a long time to identify each complete sentence in the input text. Among them, the core processing of the LSTM model is completed through three thresholds. The three thresholds are the forgetting threshold, the input threshold and the output threshold. In addition, when the LSTM model is combined with the context of the input text, a complete second feature sentence can be determined. The second characteristic sentence will form two positions to be divided.
S50,调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;S50. Invoke the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and compare it with the to-be-divided second characteristic sentence in the first code file. After inserting the segmentation symbol at the position corresponding to the position, the second code file is obtained;
可理解地,第一代码文件代表是待处理医学源文本所对应的后台代码文件,具体可通过脚本语言调用出第一代码文件;第二特征语句是由待处理医学源文本中的非结构化医学知识文本转换而来的,因此第二特征语句在待处理医学源文本中的第一代码文件也存在文字显示位置(文字显示位置包含多个第二特征语句),具体可通过在第一代码文件中查询写入第二特征语句至文字显示位置所对应的代码语言,最后再通过在代码语言在文字显示位置中识别与第二特征语句对应的字,以确定出第二特征语句;分割符号可理解成html符号,具体可在文字显示位置对应的两个待分割位置中插入两个分割符号(也即文字显示位置中至少包含了两个待分割位置)令,两个分割符号对一个第二特征语句形成分割,其中,分割符号包括div符号,h1到h6标题符号等。Understandably, the first code file represents the background code file corresponding to the medical source text to be processed. Specifically, the first code file can be called by the script language; the second characteristic sentence is composed of the unstructured text in the medical source text to be processed. The medical knowledge text is converted, so the second feature sentence also has a text display position in the first code file in the medical source text to be processed (the text display position contains multiple second feature sentences), which can be specified in the first code Query and write the second characteristic sentence in the file to the code language corresponding to the text display position, and finally identify the word corresponding to the second characteristic sentence in the text display position in the code language to determine the second characteristic sentence; segmentation symbol It can be understood as an html symbol. Specifically, two division symbols can be inserted in the two positions to be divided corresponding to the text display position (that is, the text display position contains at least two positions to be divided). Two characteristic sentences form a division, where the division symbols include div symbols, h1 to h6 title symbols, etc.
S60,运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。S60: Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
可理解地,第二代码文件是包括分割符号的后台代码文件,此时要在具体的待处理医学源文本显示的话,需要通运行第二代码文件。Understandably, the second code file is a background code file including segmentation symbols. At this time, if the specific medical source text to be processed is to be displayed, the second code file needs to be run.
进一步地,所述抓取待处理医学源文本中整段的非结构化医学知识文本之后,还包括:Further, after the grabbing the entire paragraph of the unstructured medical knowledge text in the medical source text to be processed, the method further includes:
通过预设自然语言处理模型对所述非结构化医学知识文本进行检测,对所述非结构化中存在错误的词语进行标记并获取标记结果;Detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured, and obtaining the marking result;
调用出所述待处理医学源文本的第一代码文件,根据所述标记结果对所述第一代码文件中存在错误的词语进行修正处理,得到第三代码文件,并对所述第三代码文件进行运行后,得到修正完成后的非结构化医学知识文本。The first code file of the medical source text to be processed is called, and the wrong words in the first code file are corrected according to the marking result to obtain a third code file, and the third code file is After running, the revised unstructured medical knowledge text is obtained.
可理解地,预设自然语言处理模型可为NLP模型,通过该模型建立的语义识别功能来对非结构化医学知识文本出现重复或者错误字的词语进行标记,并通过标记后的标记结果对第一代码文件中的写入错误的词语进行修正,其中,修正包括删除重复词语以及错别字词语。Understandably, the preset natural language processing model can be an NLP model. The semantic recognition function established by the model is used to mark unstructured medical knowledge texts with repetitive or incorrect words, and the result of the mark is used to mark the first word. Correction of incorrectly written words in a code file, where the correction includes the deletion of repeated words and typos.
进一步地,所述第一特征语句存储于区块链中;所述预设语言识别模型为bert模型;Further, the first characteristic sentence is stored in a blockchain; the preset language recognition model is a bert model;
所述将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量,包括:After inputting the first characteristic sentence into a preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:
将所述第一特征语句输入至所述bert模型后,通过所述bert模型查询所述第一特征语句中各个字的字向量;After inputting the first characteristic sentence into the bert model, query the word vector of each word in the first characteristic sentence through the bert model;
通过所述bert模型中的Attention机制选取所述第一特征语句中一个所述字向量作为Query向量,并将所述第一特征语句其他所述字向量作为Key向量;Using the Attention mechanism in the bert model to select one of the word vectors in the first characteristic sentence as a Query vector, and use the other word vectors of the first characteristic sentence as a Key vector;
对所述Query向量与各个所述Key向量进行相似度计算后得到权重系数,并通过所述权重系数对所述Query向量和所述Key向量对应的Value值进行加权运算,得到所述Attention机制输出与所述Query向量对应的第一增强语义特征向量;After calculating the similarity between the Query vector and each of the Key vectors, weight coefficients are obtained, and the Value values corresponding to the Query vector and the Key vector are weighted by the weight coefficients to obtain the Attention mechanism output The first enhanced semantic feature vector corresponding to the Query vector;
通过所述bert模型中的多个推叠Transformer Encoder对所述第一增强语义特征向量进行线性转换,得到第二增强语义特征向量;Performing linear transformation on the first enhanced semantic feature vector by using multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;
将与所述第一特征语句中每个字的字向量所对应的第二增强语义特征向量进行组合后,得到与所述第一特征语句对应的所述语义特征向量。After combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence, the semantic feature vector corresponding to the first feature sentence is obtained.
可理解地,本实施例主要是通过bert模型中的Attention机制让模型能注意力放在输入的第一特征语句上;本实施例中的Attention机制包括Query向量、Key向量和valve值,其中,Query向量和Key向量都来源于字向量,而每一个字向量都有对应的valve值,Attention本质可以被描述为一个查询(Query)到一系列(键Key-值Value)对的映射;具体地,本实施例首先将第一特征语句输入至bert模型后,通过bert模型对第一特征语句中的各个字进行查询,通过bert模型将查询到各个字转换为一维的字向量,接着以一个第一特征语句的其中一个字向量作为一个目标向量Query向量,以第一特征语句句中的其他字向量作为Key向量,再接着对Query向量和每个Key向量进行相似度计算后得到权重系数,其中,常用的相似度函数包括但不限于点积、拼接和感知机,然后使用预设softmax函数对得到的权重系数进行归一化,将归一化的权重系数和Query向量和Key向量对应的键值value进行加权求和运算,得到最后Attention机制输出的与Query向量对应的第一增强语义特征向量,最后再利用由Attention机制构成的每一个Transformer Encoder对第一增强语义特征向量进行数据处理,其中数据处理包括残缺连接(将字向量与第一增强语义特征向量直接相加,作为最后输出)、对某一层神经网络节点作0均值1方差的标准化和线性转换(对第一增强语义特征向量进行线性变换,以增强bert模型的表达能力),对每一个字向量所对应的第二增强语义特征向量进行组合后,得到与第一特征语句对应的语义特征向量。本实施例使用bert模型作为预设语言识别模型,其可实现目的为:1、可学习第一特征语句之间的关系,也即联系上下文;2、很好获取到句子级别的语义表征(第二增强语义特征向量)。Understandably, this embodiment mainly uses the Attention mechanism in the bert model to allow the model to focus on the input first feature sentence; the Attention mechanism in this embodiment includes Query vector, Key vector and valve value, among which, Both the Query vector and the Key vector are derived from word vectors, and each word vector has a corresponding valve value. The essence of Attention can be described as a mapping from a query (Query) to a series of (Key-Value) pairs; specifically In this embodiment, after the first characteristic sentence is first input into the bert model, each word in the first characteristic sentence is queried through the bert model, and each word in the query is converted into a one-dimensional word vector through the bert model, and then a One of the word vectors of the first characteristic sentence is used as a target vector Query vector, and other word vectors in the first characteristic sentence are used as the Key vector, and then the Query vector and each Key vector are similarly calculated to obtain the weight coefficient. Among them, the commonly used similarity functions include but are not limited to dot product, splicing and perceptron, and then use the preset softmax function to normalize the weight coefficients obtained, and the normalized weight coefficients are corresponding to the Query vector and the Key vector. The key value value performs a weighted summation operation to obtain the first enhanced semantic feature vector corresponding to the Query vector output by the last Attention mechanism. Finally, each Transformer Encoder composed of the Attention mechanism is used to perform data processing on the first enhanced semantic feature vector. The data processing includes incomplete connection (adding the word vector and the first enhanced semantic feature vector directly as the final output), normalization and linear transformation of a layer of neural network nodes with 0 mean and 1 variance (for the first enhanced semantic feature) The vector is linearly transformed to enhance the expressive ability of the bert model), after combining the second enhanced semantic feature vector corresponding to each word vector, the semantic feature vector corresponding to the first feature sentence is obtained. In this embodiment, the bert model is used as the preset language recognition model, and its achievable purposes are: 1. It can learn the relationship between the first feature sentences, that is, connect the context; 2. It is good to obtain the semantic representation of the sentence level (No. 2. Enhanced semantic feature vector).
另外需要强调的是,为进一步保证上述第一特征语句的私密和安全性,上述第一特征语句还可以存储于一区块链的节点中。其中,本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。区块链提供的去中心化的完全分布式DNS服务通过网络中各个节点之间的点对点数据传输服务就能实现域名的查询和解析,可用于确保某个重要的基础设施的操作系统和固件没有被篡改,可以监控软件的状态和完整性,发现不良的篡改,并确保所传输的数据没用经过篡改,将第一特征语句存储在区块链中,能够确保第一特征语句的私密和安全性。In addition, it should be emphasized that, in order to further ensure the privacy and security of the first characteristic sentence, the first characteristic sentence may also be stored in a node of a blockchain. Among them, the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer. The decentralized and fully distributed DNS service provided by the blockchain can realize the query and resolution of domain names through the point-to-point data transmission service between various nodes in the network, which can be used to ensure that the operating system and firmware of an important infrastructure are not available. If it is tampered with, it can monitor the status and integrity of the software, find bad tampering, and ensure that the transmitted data has not been tampered with. Store the first characteristic sentence in the blockchain to ensure the privacy and security of the first characteristic sentence sex.
进一步地,所述在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号,得到第二代码文件之后,还包括:Further, after inserting a segmentation symbol in the first code file at a position corresponding to the position to be segmented of the second characteristic sentence to obtain the second code file, the method further includes:
根据预设样式格式调用出对应的层叠样式表,并将所述层叠样式表加入至所述第二代码文件中。The corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is added to the second code file.
可理解地,本实施例主要是在第二代码文件中加入如颜色、字体号、框体等对应的层叠样式表以在待处理医学源文本中展示出具体的格式状态,如CSS 中的color,CSS中的font-size和css中的box。Understandably, this embodiment mainly adds corresponding cascading style sheets such as color, font number, frame, etc. to the second code file to display the specific format status in the medical source text to be processed, such as color in CSS , Font-size in CSS and box in css.
进一步地,所述预设文章语义识别模型为LSTM模型;Further, the preset article semantic recognition model is an LSTM model;
所述将所有的所述语义特征向量输入至预设文章语义识别模型之后,还包括:After inputting all the semantic feature vectors into a preset article semantic recognition model, the method further includes:
通过所述LSTM模型中的遗忘门限选择丢弃信息;Choosing to discard information according to the forgetting threshold in the LSTM model;
通过所述LSTM模型中的输入门限和所述丢弃信息从所述语义特征向量中选择所需信息;Selecting required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;
通过所述LSTM模型中的输出门限和所述所需信息输出所述第二特征语句。The second characteristic sentence is output through the output threshold in the LSTM model and the required information.
可理解地,LSTM模型为一种门限RNN,LSTM模型的关键就是细胞状态,因此,LSTM模型设计的各个门限就是用于去除或者增加信息到细胞状态(就可看作成语义特征向量)的能力,其中,每一个门限都包含一个 sigmoid 神经网络层和一个 pointwise 乘法操作,Sigmoid 神经网络层输出 0 到 1 之间的数值,描述每个部分有多少量可以通过,0代表不通过,1代表通过;遗忘门可决定细胞状态中的丢弃信息,丢弃信息为上一个语义特征向量所对应的主语,输入门限可更新细胞状态中的存放信息,具体先由输入门限将丢弃信息从语义特征向量丢弃掉,并从丢弃掉的语义特征向量决定所需更新的所需信息,输出门限可决定输出的第二特征语句,依据上述输入门限中已确定好的所需信息而输出第二特征语句。Understandably, the LSTM model is a kind of threshold RNN. The key of the LSTM model is the cell state. Therefore, each threshold designed by the LSTM model is the ability to remove or add information to the cell state (which can be regarded as a semantic feature vector). Among them, each threshold includes a sigmoid neural network layer and a pointwise multiplication operation. The sigmoid neural network layer outputs a value between 0 and 1, describing how much of each part can pass, 0 means no, 1 means pass; The forget gate can determine the discarded information in the cell state. The discarded information is the subject corresponding to the last semantic feature vector. The input threshold can update the stored information in the cell state. Specifically, the discarded information is discarded from the semantic feature vector by the input threshold. And from the discarded semantic feature vector, the required information to be updated is determined, the output threshold can determine the second feature sentence to be output, and the second feature sentence is output according to the required information that has been determined in the above input threshold.
综上所述,上述提供了一种医学文本结构化方法,利用模型以及分割符号去代替之前人工对待处理医学源文本中的非结构化医学知识文本进行编辑工作,避免人工编辑的错误率高和人工编辑花费时间多的问题,并可提高非结构化医学知识文本转换为结构化医学知识文本的效率。本方法可应用于智慧医疗中,从而推动智慧城市的建设。In summary, the above provides a medical text structuring method that uses models and segmentation symbols to replace the previous manual editing of unstructured medical knowledge text in the medical source text, avoiding the high error rate of manual editing and Manual editing of problems that take a lot of time can improve the efficiency of converting unstructured medical knowledge texts into structured medical knowledge texts. This method can be applied to smart medical care to promote the construction of smart cities.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
在一实施例中,提供一种医学文本结构化装置,该医学文本结构化装置与上述实施例中医学文本结构化方法一一对应。如图3所示,该医学文本结构化装置包括抓取模块11、拆分模块12、第一获取模块13、第二获取模块14、插入模块15和展示模块16。各功能模块详细说明如下:In one embodiment, a medical text structuring device is provided, and the medical text structuring device corresponds to the medical text structuring method in the above-mentioned embodiment in a one-to-one correspondence. As shown in FIG. 3, the medical text structuring device includes a grabbing module 11, a splitting module 12, a first acquiring module 13, a second acquiring module 14, an inserting module 15 and a display module 16. The detailed description of each functional module is as follows:
抓取模块11,用于抓取待处理医学源文本中整段的非结构化医学知识文本;The grabbing module 11 is used to grab the entire unstructured medical knowledge text in the medical source text to be processed;
拆分模块12,用于识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;The splitting module 12 is used to identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
第一获取模块13,用于将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;The first obtaining module 13 is configured to obtain a semantic feature vector corresponding to each of the first characteristic sentences after inputting the first characteristic sentence into the preset language recognition model;
第二获取模块14,用于将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;The second acquiring module 14 is configured to input all the semantic feature vectors into a preset article semantic recognition model, and then acquire the second feature sentence output by the preset article semantic recognition model; the second feature sentence includes A preset number of positions to be segmented determined by the preset article semantic recognition model according to the context relationship of the unstructured medical knowledge text;
插入模块15,用于调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;The insertion module 15 is used to call the first code file of the medical source text to be processed, query the second feature sentence from the first code file, and compare the second feature sentence in the first code file with the After inserting the dividing symbol at the position corresponding to the position to be divided of the sentence, the second code file is obtained;
展示模块16,用于运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。The display module 16 is configured to run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
进一步地,所述医学文本结构化装置还包括:Further, the medical text structuring device further includes:
标记模块,用于通过预设自然语言处理模型对所述非结构化医学知识文本进行检测,对所述非结构化中存在错误的词语进行标记并获取标记结果;The marking module is used to detect the unstructured medical knowledge text through a preset natural language processing model, mark the words with errors in the unstructured, and obtain the marking result;
运行模块,用于调用出所述待处理医学源文本的第一代码文件,根据所述标记结果对所述第一代码文件中存在错误的词语进行修正处理,得到第三代码文件,并对所述第三代码文件进行运行后,得到修正完成后的非结构化医学知识文本。The running module is used for calling the first code file of the medical source text to be processed, correcting the wrong words in the first code file according to the marking result, obtaining the third code file, and checking all the After the third code file is run, the revised unstructured medical knowledge text is obtained.
进一步地,所述预设语言识别模型为bert模型,所述第一获取模块包括:Further, the preset language recognition model is a bert model, and the first acquisition module includes:
输入子模块,用于将所述第一特征语句输入至所述bert模型后,通过所述bert模型查询所述第一特征语句中各个字的字向量;The input sub-module is used to query the word vector of each word in the first characteristic sentence through the bert model after inputting the first characteristic sentence into the bert model;
选取子模块,用于通过所述bert模型中的Attention机制选取所述第一特征语句中一个所述字向量作为Query向量,并将所述第一特征语句其他所述字向量作为Key向量;A selection sub-module, configured to select one of the word vectors in the first characteristic sentence as a Query vector through the Attention mechanism in the bert model, and use the other word vectors of the first characteristic sentence as a Key vector;
加权运算子模块,用于对所述Query向量与各个所述Key向量进行相似度计算后得到权重系数,并通过所述权重系数对所述Query向量和所述Key向量对应的Value值进行加权运算,得到所述Attention机制输出与所述Query向量对应的第一增强语义特征向量;The weighting operation sub-module is used to calculate the similarity between the Query vector and each of the Key vectors to obtain weight coefficients, and perform weighting operations on the Value values corresponding to the Query vector and the Key vector through the weight coefficients , Obtain the first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;
线性转换子模块,用于通过所述bert模型中的多个推叠Transformer Encoder对所述第一增强语义特征向量进行线性转换,得到第二增强语义特征向量;A linear conversion sub-module, configured to perform linear conversion on the first enhanced semantic feature vector through multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;
组合子模块,用于将与所述第一特征语句中每个字的字向量所对应的第二增强语义特征向量进行组合后,得到与所述第一特征语句对应的所述语义特征向量。The combination sub-module is used to combine the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.
进一步地,所述医学文本结构化装置还包括:Further, the medical text structuring device further includes:
加入模块,用于根据预设样式格式调用出对应的层叠样式表,并将所述层叠样式表加入至所述第二代码文件中。The adding module is used for calling out the corresponding cascading style sheet according to the preset style format, and adding the cascading style sheet to the second code file.
进一步地,所述预设文章语义识别模型为LSTM模型,所述第二获取模块包括:Further, the preset article semantic recognition model is an LSTM model, and the second acquisition module includes:
第一选择子模块,用于通过所述LSTM模型中的遗忘门限选择丢弃信息;The first selection sub-module is configured to select discarding information through the forgetting threshold in the LSTM model;
第二选择子模块,用于通过所述LSTM模型中的输入门限和所述丢弃信息从所述语义特征向量中选择所需信息;The second selection sub-module is configured to select required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;
输出子模块,用于通过所述LSTM模型中的输出门限和所述所需信息输出所述第二特征语句。The output sub-module is configured to output the second characteristic sentence through the output threshold in the LSTM model and the required information.
关于医学文本结构化装置的具体限定可以参见上文中对于医学文本结构化方法的限定,在此不再赘述。上述医学文本结构化装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the medical text structuring device, please refer to the above definition of the medical text structuring method, which will not be repeated here. Each module in the above-mentioned medical text structuring device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括计算机可读指令、内存储器。该计算机可读指令存储有操作系统、计算机可读指令和数据库。该内存储器为计算机可读指令中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储多条历史测试数据,每条历史测试数据对应有测试问题记录。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种医学文本结构化方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 4. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes computer readable instructions and internal memory. The computer readable instructions are stored with an operating system, computer readable instructions and a database. The internal memory provides an environment for the operation of the operating system and the computer-readable instructions in the computer-readable instructions. The database of the computer equipment is used to store multiple pieces of historical test data, and each piece of historical test data corresponds to a test problem record. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a medical text structuring method.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例中所述医学文本结构化方法。In one embodiment, a computer device is provided, which includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. Describe the structuring method of medical text.
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质;该可读存储介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现上述实施例中所述医学文本结构化方法。In one embodiment, one or more readable storage media storing computer readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the medical text structuring method described in the above embodiments.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.
 To

Claims (20)

  1. 一种医学文本结构化方法,其中,包括: A method for structuring medical text, including:
    抓取待处理医学源文本中整段的非结构化医学知识文本;Grab the entire unstructured medical knowledge text in the medical source text to be processed;
    识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
    将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;
    将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
    调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;
    运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
  2. 如权利要求1所述的医学文本结构化方法,其中,所述抓取待处理医学源文本中整段的非结构化医学知识文本之后,还包括: 5. The method for structuring medical texts according to claim 1, wherein, after capturing the entire unstructured medical knowledge text in the medical source text to be processed, the method further comprises:
    通过预设自然语言处理模型对所述非结构化医学知识文本进行检测,对所述非结构化中存在错误的词语进行标记并获取标记结果;Detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured, and obtaining the marking result;
    调用出所述待处理医学源文本的第一代码文件,根据所述标记结果对所述第一代码文件中存在错误的词语进行修正处理,得到第三代码文件,并对所述第三代码文件进行运行后,得到修正完成后的非结构化医学知识文本。The first code file of the medical source text to be processed is called, the wrong words in the first code file are corrected according to the marking result, and the third code file is obtained, and the third code file is After running, the revised unstructured medical knowledge text is obtained.
  3. 如权利要求1所述的医学文本结构化方法,其中,所述预设语言识别模型为bert模型; 5. The medical text structuring method according to claim 1, wherein the preset language recognition model is a bert model;
    所述将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量,包括:After inputting the first characteristic sentence into a preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:
    将所述第一特征语句输入至所述bert模型后,通过所述bert模型查询所述第一特征语句中各个字的字向量;After inputting the first characteristic sentence into the bert model, query the word vector of each word in the first characteristic sentence through the bert model;
    通过所述bert模型中的Attention机制选取所述第一特征语句中一个所述字向量作为Query向量,并将所述第一特征语句其他所述字向量作为Key向量;Using the Attention mechanism in the bert model to select one of the word vectors in the first characteristic sentence as a Query vector, and use the other word vectors of the first characteristic sentence as a Key vector;
    对所述Query向量与各个所述Key向量进行相似度计算后得到权重系数,并通过所述权重系数对所述Query向量和所述Key向量对应的Value值进行加权运算,得到所述Attention机制输出与所述Query向量对应的第一增强语义特征向量;After calculating the similarity between the Query vector and each of the Key vectors, weight coefficients are obtained, and the Value values corresponding to the Query vector and the Key vector are weighted by the weight coefficients to obtain the Attention mechanism output The first enhanced semantic feature vector corresponding to the Query vector;
    通过所述bert模型中的多个推叠Transformer Encoder对所述第一增强语义特征向量进行线性转换,得到第二增强语义特征向量;Performing linear transformation on the first enhanced semantic feature vector by using multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;
    将与所述第一特征语句中每个字的字向量所对应的第二增强语义特征向量进行组合后,得到与所述第一特征语句对应的所述语义特征向量。After combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence, the semantic feature vector corresponding to the first feature sentence is obtained.
  4. 如权利要求1所述的医学文本结构化方法,其中,所述在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号,得到第二代码文件之后,还包括: The medical text structuring method according to claim 1, wherein said inserting a segmentation symbol in a position corresponding to the position to be segmented of said second characteristic sentence in the first code file, after obtaining the second code file, further comprises :
    根据预设样式格式调用出对应的层叠样式表,并将所述层叠样式表嵌套至所述第二代码文件中。The corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is nested into the second code file.
  5. 如权利要求1所述的医学文本结构化方法,其中,所述预设文章语义识别模型为LSTM模型; 5. The medical text structuring method according to claim 1, wherein the preset article semantic recognition model is an LSTM model;
    所述将所有的所述语义特征向量输入至预设文章语义识别模型之后,还包括:After inputting all the semantic feature vectors into a preset article semantic recognition model, the method further includes:
    通过所述LSTM模型中的遗忘门限选择丢弃信息;Choosing to discard information according to the forgetting threshold in the LSTM model;
    通过所述LSTM模型中的输入门限和所述丢弃信息从所述语义特征向量中选择所需信息;Selecting required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;
    通过所述LSTM模型中的输出门限和所述所需信息输出所述第二特征语句。The second characteristic sentence is output through the output threshold in the LSTM model and the required information.
  6. 一种医学文本结构化装置,其中,包括: A medical text structuring device, which includes:
    抓取模块,用于抓取待处理医学源文本中整段的非结构化医学知识文本;The grab module is used to grab the entire paragraph of unstructured medical knowledge text in the medical source text to be processed;
    拆分模块,用于识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;A splitting module, used to identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
    第一获取模块,用于将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;The first acquiring module is configured to acquire a semantic feature vector corresponding to each of the first characteristic sentences after inputting the first characteristic sentence into the preset language recognition model;
    第二获取模块,用于将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;The second acquisition module is configured to input all the semantic feature vectors into the preset article semantic recognition model, and then acquire the second feature sentence output by the preset article semantic recognition model; the second feature sentence contains all The preset article semantic recognition model determines a preset number of positions to be segmented according to the context relationship of the unstructured medical knowledge text;
    插入模块,用于调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;The insertion module is used to call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and compare it with the second characteristic sentence in the first code file After inserting the dividing symbol at the position corresponding to the position to be divided, the second code file is obtained;
    展示模块,用于运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。The display module is used to run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
  7. 如权利要求6所述的医学文本结构化装置,其中,所述医学文本结构化装置还包括: 7. The medical text structuring device according to claim 6, wherein the medical text structuring device further comprises:
    标记模块,用于通过预设自然语言处理模型对所述非结构化医学知识文本进行检测,对所述非结构化中存在错误的词语进行标记并获取标记结果;The marking module is used to detect the unstructured medical knowledge text through a preset natural language processing model, mark the words with errors in the unstructured, and obtain the marking result;
    运行模块,用于调用出所述待处理医学源文本的第一代码文件,根据所述标记结果对所述第一代码文件中存在错误的词语进行修正处理,得到第三代码文件,并对所述第三代码文件进行运行后,得到修正完成后的非结构化医学知识文本。The running module is used for calling the first code file of the medical source text to be processed, correcting the wrong words in the first code file according to the marking result, obtaining the third code file, and checking all the After the third code file is run, the revised unstructured medical knowledge text is obtained.
  8. 如权利要求6所述的医学文本结构化装置,其中,所述预设语言识别模型为bert模型;所述第一获取模块包括: 7. The medical text structuring device according to claim 6, wherein the preset language recognition model is a bert model; and the first acquisition module comprises:
    输入子模块,用于将所述第一特征语句输入至所述bert模型后,通过所述bert模型查询所述第一特征语句中各个字的字向量;The input sub-module is used to query the word vector of each word in the first characteristic sentence through the bert model after inputting the first characteristic sentence into the bert model;
    选取子模块,用于通过所述bert模型中的Attention机制选取所述第一特征语句中一个所述字向量作为Query向量,并将所述第一特征语句其他所述字向量作为Key向量;A selection sub-module, configured to select one of the word vectors in the first characteristic sentence as a Query vector through the Attention mechanism in the bert model, and use the other word vectors of the first characteristic sentence as a Key vector;
    加权运算子模块,用于对所述Query向量与各个所述Key向量进行相似度计算后得到权重系数,并通过所述权重系数对所述Query向量和所述Key向量对应的Value值进行加权运算,得到所述Attention机制输出与所述Query向量对应的第一增强语义特征向量;The weighting operation sub-module is used to calculate the similarity between the Query vector and each of the Key vectors to obtain weight coefficients, and perform weighting operations on the Value values corresponding to the Query vector and the Key vector through the weight coefficients , Obtain the first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;
    线性转换子模块,用于通过所述bert模型中的多个推叠Transformer Encoder对所述第一增强语义特征向量进行线性转换,得到第二增强语义特征向量;A linear conversion sub-module, configured to perform linear conversion on the first enhanced semantic feature vector through multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;
    组合子模块,用于将与所述第一特征语句中每个字的字向量所对应的第二增强语义特征向量进行组合后,得到与所述第一特征语句对应的所述语义特征向量。The combination sub-module is used to combine the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.
  9. 如权利要求6所述的医学文本结构化装置,其中,所述医学文本结构化装置还包括: 7. The medical text structuring device according to claim 6, wherein the medical text structuring device further comprises:
    加入模块,用于根据预设样式格式调用出对应的层叠样式表,并将所述层叠样式表加入至所述第二代码文件中。The adding module is used for calling out the corresponding cascading style sheet according to the preset style format, and adding the cascading style sheet to the second code file.
  10. 如权利要求6所述的医学文本结构化装置,其中,所述医学文本结构化装置还包括: 7. The medical text structuring device according to claim 6, wherein the medical text structuring device further comprises:
    加入模块,用于根据预设样式格式调用出对应的层叠样式表,并将所述层叠样式表加入至所述第二代码文件中。The adding module is used for calling out the corresponding cascading style sheet according to the preset style format, and adding the cascading style sheet to the second code file.
    进一步地,所述预设文章语义识别模型为LSTM模型,所述第二获取模块包括:Further, the preset article semantic recognition model is an LSTM model, and the second acquisition module includes:
    第一选择子模块,用于通过所述LSTM模型中的遗忘门限选择丢弃信息;The first selection sub-module is configured to select discarding information through the forgetting threshold in the LSTM model;
    第二选择子模块,用于通过所述LSTM模型中的输入门限和所述丢弃信息从所述语义特征向量中选择所需信息;The second selection sub-module is configured to select required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;
    输出子模块,用于通过所述LSTM模型中的输出门限和所述所需信息输出所述第二特征语句。The output sub-module is configured to output the second characteristic sentence through the output threshold in the LSTM model and the required information.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤: A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
    抓取待处理医学源文本中整段的非结构化医学知识文本;Grab the entire unstructured medical knowledge text in the medical source text to be processed;
    识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
    将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;
    将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
    调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;
    运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
  12. 如权利要求11所述的计算机设备,其中,所述抓取待处理医学源文本中整段的非结构化医学知识文本之后,所述处理器执行所述计算机可读指令时实现如下步骤: 11. The computer device according to claim 11, wherein, after the unstructured medical knowledge text in the entire paragraph of the medical source text to be processed is captured, the processor executes the following steps when executing the computer readable instruction:
    通过预设自然语言处理模型对所述非结构化医学知识文本进行检测,对所述非结构化中存在错误的词语进行标记并获取标记结果;Detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured, and obtaining the marking result;
    调用出所述待处理医学源文本的第一代码文件,根据所述标记结果对所述第一代码文件中存在错误的词语进行修正处理,得到第三代码文件,并对所述第三代码文件进行运行后,得到修正完成后的非结构化医学知识文本。The first code file of the medical source text to be processed is called, the wrong words in the first code file are corrected according to the marking result, and the third code file is obtained, and the third code file is After running, the revised unstructured medical knowledge text is obtained.
  13. 如权利要求11所述的计算机设备,其中,所述预设语言识别模型为bert模型; 11. The computer device of claim 11, wherein the preset language recognition model is a bert model;
    所述将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量,包括:After inputting the first characteristic sentence into a preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:
    将所述第一特征语句输入至所述bert模型后,通过所述bert模型查询所述第一特征语句中各个字的字向量;After inputting the first characteristic sentence into the bert model, query the word vector of each word in the first characteristic sentence through the bert model;
    通过所述bert模型中的Attention机制选取所述第一特征语句中一个所述字向量作为Query向量,并将所述第一特征语句其他所述字向量作为Key向量;Using the Attention mechanism in the bert model to select one of the word vectors in the first characteristic sentence as a Query vector, and use the other word vectors of the first characteristic sentence as a Key vector;
    对所述Query向量与各个所述Key向量进行相似度计算后得到权重系数,并通过所述权重系数对所述Query向量和所述Key向量对应的Value值进行加权运算,得到所述Attention机制输出与所述Query向量对应的第一增强语义特征向量;After calculating the similarity between the Query vector and each of the Key vectors, weight coefficients are obtained, and the Value values corresponding to the Query vector and the Key vector are weighted by the weight coefficients to obtain the Attention mechanism output The first enhanced semantic feature vector corresponding to the Query vector;
    通过所述bert模型中的多个推叠Transformer Encoder对所述第一增强语义特征向量进行线性转换,得到第二增强语义特征向量;Performing linear transformation on the first enhanced semantic feature vector by using multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;
    将与所述第一特征语句中每个字的字向量所对应的第二增强语义特征向量进行组合后,得到与所述第一特征语句对应的所述语义特征向量。After combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence, the semantic feature vector corresponding to the first feature sentence is obtained.
  14. 如权利要求11所述的计算机设备,其中,所述在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号,得到第二代码文件之后,所述处理器执行所述计算机可读指令时还实现如下步骤: The computer device according to claim 11, wherein, after inserting a segmentation symbol in a position corresponding to the position to be segmented of the second characteristic sentence in the first code file, and obtaining the second code file, the processor executes The computer-readable instructions further implement the following steps:
    根据预设样式格式调用出对应的层叠样式表,并将所述层叠样式表嵌套至所述第二代码文件中。The corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is nested into the second code file.
  15. 如权利要求11所述的计算机设备,其中,所述预设文章语义识别模型为LSTM模型; 11. The computer device of claim 11, wherein the preset article semantic recognition model is an LSTM model;
    所述将所有的所述语义特征向量输入至预设文章语义识别模型之后,所述处理器执行所述计算机可读指令时实现如下步骤:After inputting all the semantic feature vectors into a preset article semantic recognition model, the processor executes the following steps when executing the computer-readable instructions:
    通过所述LSTM模型中的遗忘门限选择丢弃信息;Choosing to discard information according to the forgetting threshold in the LSTM model;
    通过所述LSTM模型中的输入门限和所述丢弃信息从所述语义特征向量中选择所需信息;Selecting required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;
    通过所述LSTM模型中的输出门限和所述所需信息输出所述第二特征语句。The second characteristic sentence is output through the output threshold in the LSTM model and the required information.
  16. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤: One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    抓取待处理医学源文本中整段的非结构化医学知识文本;Grab the entire unstructured medical knowledge text in the medical source text to be processed;
    识别所述非结构化医学知识文本中所有的标点符号,按照所述标点符号将所述非结构文本拆分成多个第一特征语句;Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
    将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量;After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;
    将所有的所述语义特征向量输入至预设文章语义识别模型后,获取所述预设文章语义识别模型输出的第二特征语句;所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置;After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;
    调用出所述待处理医学源文本的第一代码文件,从所述第一代码文件中查询所述第二特征语句,并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后,得到第二代码文件;Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;
    运行所述第二代码文件,以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
  17. 如权利要求16所述的可读存储介质,其中,所述抓取待处理医学源文本中整段的非结构化医学知识文本之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤: The readable storage medium of claim 16, wherein after the unstructured medical knowledge text in the entire paragraph of the medical source text to be processed is captured, when the computer-readable instructions are executed by one or more processors , Causing the one or more processors to perform the following steps:
    通过预设自然语言处理模型对所述非结构化医学知识文本进行检测,对所述非结构化中存在错误的词语进行标记并获取标记结果;Detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured, and obtaining the marking result;
    调用出所述待处理医学源文本的第一代码文件,根据所述标记结果对所述第一代码文件中存在错误的词语进行修正处理,得到第三代码文件,并对所述第三代码文件进行运行后,得到修正完成后的非结构化医学知识文本。The first code file of the medical source text to be processed is called, the wrong words in the first code file are corrected according to the marking result, and the third code file is obtained, and the third code file is After running, the revised unstructured medical knowledge text is obtained.
  18. 如权利要求16所述的可读存储介质,其中,所述预设语言识别模型为bert模型; 15. The readable storage medium of claim 16, wherein the preset language recognition model is a bert model;
    所述将所述第一特征语句输入至预设语言识别模型后,获取与每一个所述第一特征语句对应的一个语义特征向量,包括:After inputting the first characteristic sentence into a preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:
    将所述第一特征语句输入至所述bert模型后,通过所述bert模型查询所述第一特征语句中各个字的字向量;After inputting the first characteristic sentence into the bert model, query the word vector of each word in the first characteristic sentence through the bert model;
    通过所述bert模型中的Attention机制选取所述第一特征语句中一个所述字向量作为Query向量,并将所述第一特征语句其他所述字向量作为Key向量;Using the Attention mechanism in the bert model to select one of the word vectors in the first characteristic sentence as a Query vector, and use the other word vectors of the first characteristic sentence as a Key vector;
    对所述Query向量与各个所述Key向量进行相似度计算后得到权重系数,并通过所述权重系数对所述Query向量和所述Key向量对应的Value值进行加权运算,得到所述Attention机制输出与所述Query向量对应的第一增强语义特征向量;After calculating the similarity between the Query vector and each of the Key vectors, weight coefficients are obtained, and the Value values corresponding to the Query vector and the Key vector are weighted by the weight coefficients to obtain the Attention mechanism output The first enhanced semantic feature vector corresponding to the Query vector;
    通过所述bert模型中的多个推叠Transformer Encoder对所述第一增强语义特征向量进行线性转换,得到第二增强语义特征向量;Performing linear transformation on the first enhanced semantic feature vector by using multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;
    将与所述第一特征语句中每个字的字向量所对应的第二增强语义特征向量进行组合后,得到与所述第一特征语句对应的所述语义特征向量。After combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence, the semantic feature vector corresponding to the first feature sentence is obtained.
  19. 如权利要求16所述的计算机设备,其中,所述在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号,得到第二代码文件之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤: The computer device according to claim 16, wherein the dividing symbol is inserted into the position corresponding to the position to be divided of the second characteristic sentence in the first code file, and after the second code file is obtained, the computer readable When the instruction is executed by one or more processors, the one or more processors execute the following steps:
    根据预设样式格式调用出对应的层叠样式表,并将所述层叠样式表嵌套至所述第二代码文件中。The corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is nested into the second code file.
  20. 如权利要求16所述的计算机设备,其中,所述预设文章语义识别模型为LSTM模型; The computer device of claim 16, wherein the preset article semantic recognition model is an LSTM model;
    所述将所有的所述语义特征向量输入至预设文章语义识别模型之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:After all the semantic feature vectors are input to the preset article semantic recognition model, when the computer-readable instructions are executed by one or more processors, the one or more processors are caused to perform the following steps:
    通过所述LSTM模型中的遗忘门限选择丢弃信息;Choosing to discard information according to the forgetting threshold in the LSTM model;
    通过所述LSTM模型中的输入门限和所述丢弃信息从所述语义特征向量中选择所需信息;Selecting required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;
    通过所述LSTM模型中的输出门限和所述所需信息输出所述第二特征语句。The second characteristic sentence is output through the output threshold in the LSTM model and the required information.
     To
PCT/CN2020/124215 2020-09-08 2020-10-28 Medical text structuring method and apparatus, computer device and storage medium WO2021164301A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010935255.2A CN112016274B (en) 2020-09-08 2020-09-08 Medical text structuring method, device, computer equipment and storage medium
CN202010935255.2 2020-09-08

Publications (1)

Publication Number Publication Date
WO2021164301A1 true WO2021164301A1 (en) 2021-08-26

Family

ID=73516342

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124215 WO2021164301A1 (en) 2020-09-08 2020-10-28 Medical text structuring method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN112016274B (en)
WO (1) WO2021164301A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034204A (en) * 2022-05-12 2022-09-09 浙江大学 Method for generating structured medical text, computer device, storage medium and program product
CN116882496A (en) * 2023-09-07 2023-10-13 中南大学湘雅医院 Medical knowledge base construction method for multistage logic reasoning

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016274B (en) * 2020-09-08 2024-03-08 平安科技(深圳)有限公司 Medical text structuring method, device, computer equipment and storage medium
CN113138773B (en) * 2021-04-19 2024-04-16 杭州科技职业技术学院 Cloud computing distributed service clustering method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032648A (en) * 2019-03-19 2019-07-19 微医云(杭州)控股有限公司 A kind of case history structuring analytic method based on medical domain entity
CN110222654A (en) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 Text segmenting method, device, equipment and storage medium
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN112016274A (en) * 2020-09-08 2020-12-01 平安科技(深圳)有限公司 Medical text structuring method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032739B (en) * 2019-04-18 2021-07-13 清华大学 Method and system for extracting named entities of Chinese electronic medical record

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN110032648A (en) * 2019-03-19 2019-07-19 微医云(杭州)控股有限公司 A kind of case history structuring analytic method based on medical domain entity
CN110222654A (en) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 Text segmenting method, device, equipment and storage medium
CN112016274A (en) * 2020-09-08 2020-12-01 平安科技(深圳)有限公司 Medical text structuring method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, XIAOLU; MAO, XUEMIN: "Applications of Extensible Markup Language (XML) in Medical Cases", COMPUTER KNOWLEDGE AND TECHNOLOGY, vol. 8, no. 25, 5 September 2012 (2012-09-05), pages 5952 - 5954,5973, XP055838727 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034204A (en) * 2022-05-12 2022-09-09 浙江大学 Method for generating structured medical text, computer device, storage medium and program product
CN116882496A (en) * 2023-09-07 2023-10-13 中南大学湘雅医院 Medical knowledge base construction method for multistage logic reasoning
CN116882496B (en) * 2023-09-07 2023-12-05 中南大学湘雅医院 Medical knowledge base construction method for multistage logic reasoning

Also Published As

Publication number Publication date
CN112016274B (en) 2024-03-08
CN112016274A (en) 2020-12-01

Similar Documents

Publication Publication Date Title
WO2021135910A1 (en) Machine reading comprehension-based information extraction method and related device
WO2021164301A1 (en) Medical text structuring method and apparatus, computer device and storage medium
US11636264B2 (en) Stylistic text rewriting for a target author
WO2021212749A1 (en) Method and apparatus for labelling named entity, computer device, and storage medium
US20210390873A1 (en) Deep knowledge tracing with transformers
WO2022142011A1 (en) Method and device for address recognition, computer device, and storage medium
US11914968B2 (en) Official document processing method, device, computer equipment and storage medium
WO2021114620A1 (en) Medical-record quality control method, apparatus, computer device, and storage medium
CN109977014B (en) Block chain-based code error identification method, device, equipment and storage medium
EP4006909B1 (en) Method, apparatus and device for quality control and storage medium
WO2022088671A1 (en) Automated question answering method and apparatus, device, and storage medium
WO2021218023A1 (en) Emotion determining method and apparatus for multiple rounds of questions and answers, computer device, and storage medium
CN110598210B (en) Entity recognition model training, entity recognition method, entity recognition device, entity recognition equipment and medium
US11615247B1 (en) Labeling method and apparatus for named entity recognition of legal instrument
JP2020135456A (en) Generation device, learning device, generation method and program
CN112036172A (en) Entity identification method and device based on abbreviated data of model and computer equipment
CN116821373A (en) Map-based prompt recommendation method, device, equipment and medium
CN112582073B (en) Medical information acquisition method, device, electronic equipment and medium
CN113283231B (en) Method for acquiring signature bit, setting system, signature system and storage medium
CN114385694A (en) Data processing method and device, computer equipment and storage medium
CN108932225A (en) For natural language demand to be converted into the method and system of semantic modeling language statement
CN116796730A (en) Text error correction method, device, equipment and storage medium based on artificial intelligence
Rohit et al. System for Enhancing Accuracy of Noisy Text using Deep Network Language Models
CN110826325A (en) Language model pre-training method and system based on confrontation training and electronic equipment
CN113434652B (en) Intelligent question-answering method, intelligent question-answering device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920238

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920238

Country of ref document: EP

Kind code of ref document: A1