WO2021164301A1

WO2021164301A1 - Medical text structuring method and apparatus, computer device and storage medium

Info

Publication number: WO2021164301A1
Application number: PCT/CN2020/124215
Authority: WO
Inventors: 朱威; 何义龙
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-09-08
Filing date: 2020-10-28
Publication date: 2021-08-26
Also published as: CN112016274B; CN112016274A

Abstract

The present application relates to an artificial intelligence technology applicable to the field of medical text processing, and particularly discloses a medical text structuring method and apparatus, a computer device, and a storage medium. The method comprises: acquiring unstructured medical knowledge text; splitting the unstructured text into a plurality of first feature statements; inputting the first feature statements to a preset language recognition model to obtain semantic feature vectors; inputting the semantic feature vectors to a preset article semantics recognition model to obtain output second feature statements; calling a first code file of a medical source text to be processed; inserting segmentation symbols to positions in said first code file corresponding to the segmentation-pending positions of the second feature statements, to obtain a second code file; running the second code file so as to display on the medical source text to be processed structured medical knowledge text corresponding to the unstructured medical knowledge text. The present method improves conversion efficiency for structured medical knowledge texts.

Description

Medical text structuring method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on September 8, 2020 with the application number 202010935255.2 and the invention title "Medical text structuring method, device, computer equipment and storage medium". The entire content of the application is approved The reference is incorporated in this application.

Technical field

This application relates to the field of intelligent decision-making in artificial intelligence, and in particular to a method, device, computer equipment, and storage medium for structuring medical text.

Background technique

目前，同一个医学源文本中包含大量的医学知识文本，且这些文本会涉及到医学领域中的多种医学知识，当需要在界面中展示这些医学知识文本时，需要人工对此类医学知识文本进行有效编辑以令其结构化而便于查看，但发明人意识到源文本中的医学知识文本的文本格式通常参差不齐，其中的大部分医学知识文本又是以非结构化的形式呈现，因此，很容易导致人工编辑出错，且编辑效率低、编辑花费时间多。尤其是在需要将一些新出现的医学知识文本（医疗领域新出的产品说明书等）向用户展示时，要求医学知识文本必须是具有结构化的特定格式，比如分段正确，缩进合理。如果通过人工手动编辑形成可以对外展示的结构化医学文本，但其做法既耗时又耗力。因本领域技术人员亟需寻找一种新的技术方案以解决上述的问题。At present, the same medical source text contains a large number of medical knowledge texts, and these texts will involve a variety of medical knowledge in the medical field. When these medical knowledge texts need to be displayed in the interface, it is necessary to manually edit such medical knowledge texts. Effective editing is carried out to make it structured and easy to view, but the inventor realizes that the text format of the medical knowledge text in the source text is usually uneven, and most of the medical knowledge text is presented in an unstructured form, so , It is easy to cause manual editing errors, and the editing efficiency is low, and the editing takes a lot of time. Especially when some new medical knowledge texts (new product manuals in the medical field, etc.) need to be displayed to users, the medical knowledge texts must be structured in a specific format, such as correct segmentation and reasonable indentation. If a structured medical text can be displayed externally through manual editing, it is time-consuming and labor-intensive. Those skilled in the art urgently need to find a new technical solution to solve the above-mentioned problems.

To

发明内容Summary of the invention

基于此，有必要针对上述技术问题，提供一种医学文本结构化方法、装置、计算机设备及存储介质，用于避免人工编辑的错误率高和人工编辑花费时间多的问题，并可提高非结构化医学知识文本转换为结构化医学知识文本的效率。Based on this, it is necessary to address the above technical problems and provide a medical text structuring method, device, computer equipment and storage medium, which are used to avoid the high error rate of manual editing and the time-consuming manual editing, and improve the non-structural The efficiency of transforming medical knowledge text into structured medical knowledge text.

一种医学文本结构化方法，包括：A method of structuring medical texts, including:

抓取待处理医学源文本中整段的非结构化医学知识文本；Grab the entire unstructured medical knowledge text in the medical source text to be processed;

识别所述非结构化医学知识文本中所有的标点符号，按照所述标点符号将所述非结构文本拆分成多个第一特征语句；Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;

将所述第一特征语句输入至预设语言识别模型后，获取与每一个所述第一特征语句对应的一个语义特征向量；After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;

将所有的所述语义特征向量输入至预设文章语义识别模型后，获取所述预设文章语义识别模型输出的第二特征语句；所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置；After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;

调用出所述待处理医学源文本的第一代码文件，从所述第一代码文件中查询所述第二特征语句，并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后，得到第二代码文件；Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;

运行所述第二代码文件，以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.

一种医学文本结构化装置，包括：A medical text structuring device, including:

抓取模块，用于抓取待处理医学源文本中整段的非结构化医学知识文本；The grab module is used to grab the entire paragraph of unstructured medical knowledge text in the medical source text to be processed;

拆分模块，用于识别所述非结构化医学知识文本中所有的标点符号，按照所述标点符号将所述非结构文本拆分成多个第一特征语句；A splitting module, used to identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;

第一获取模块，用于将所述第一特征语句输入至预设语言识别模型后，获取与每一个所述第一特征语句对应的一个语义特征向量；The first acquiring module is configured to acquire a semantic feature vector corresponding to each of the first characteristic sentences after inputting the first characteristic sentence into the preset language recognition model;

第二获取模块，用于将所有的所述语义特征向量输入至预设文章语义识别模型后，获取所述预设文章语义识别模型输出的第二特征语句；所述第二特征语句中包含所述预设文章语义识别模型根据所述非结构化医学知识文本的上下文关联关系确定的预设数量的待分割位置；The second acquisition module is configured to input all the semantic feature vectors into the preset article semantic recognition model, and then acquire the second feature sentence output by the preset article semantic recognition model; the second feature sentence contains all The preset article semantic recognition model determines a preset number of positions to be segmented according to the context relationship of the unstructured medical knowledge text;

插入模块，用于调用出所述待处理医学源文本的第一代码文件，从所述第一代码文件中查询所述第二特征语句，并在第一代码文件中与所述第二特征语句的待分割位置对应的位置插入分割符号之后，得到第二代码文件；The insertion module is used to call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and compare it with the second characteristic sentence in the first code file After inserting the dividing symbol at the position corresponding to the position to be divided, the second code file is obtained;

展示模块，用于运行所述第二代码文件，以在所述待处理医学源文本上展示与所述非结构化医学知识文本对应的结构化医学知识文本。The display module is used to run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.

一种计算机设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令，所述处理器执行所述计算机可读指令时实现如下步骤：A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

一个或多个存储有计算机可读指令的可读存储介质，所述计算机可读指令被一个或多个处理器执行时，使得所述一个或多个处理器执行如下步骤：One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

上述医学文本结构化方法、装置、计算机设备及存储介质，利用模型以及分割符号去代替之前人工对待处理医学源文本中的非结构化医学知识文本进行编辑工作的方式，避免了人工编辑的错误率高和人工编辑花费时间多的问题，提高了非结构化医学知识文本转换为结构化医学知识文本的效率。The above-mentioned medical text structuring method, device, computer equipment and storage medium use models and segmentation symbols to replace the previous manual editing of unstructured medical knowledge text in the medical source text, avoiding the error rate of manual editing The problem of high and the time spent on manual editing improves the efficiency of converting unstructured medical knowledge texts into structured medical knowledge texts.

To

附图说明Description of the drawings

为了更清楚地说明本申请实施例的技术方案，下面将对本申请实施例的描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

图1是本申请一实施例中医学文本结构化方法的一应用环境示意图；FIG. 1 is a schematic diagram of an application environment of the medical text structuring method in an embodiment of the present application;

图2是本申请一实施例中医学文本结构化方法的一流程图；2 is a flowchart of a method for structuring medical text in an embodiment of the present application;

图3是本申请一实施例中医学文本结构化装置的结构示意图；FIG. 3 is a schematic structural diagram of a medical text structuring device in an embodiment of the present application;

图4是本申请一实施例中计算机设备的一示意图。Fig. 4 is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The medical text structuring method provided in this application can be applied in the application environment as shown in Fig. 1, in which the client communicates with the server through the network. Among them, the client can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.

In one embodiment, as shown in FIG. 2, a method for structuring medical texts is provided. The method is applied to the server in FIG. 1 as an example for description, including the following steps:

S10: Grab the entire paragraph of unstructured medical knowledge text in the medical source text to be processed;

Understandably, the medical source text to be processed may refer to the unstructured medical knowledge text in the medical field on the webpage, where the unstructured medical knowledge text may include, but is not limited to, the formulations of various medical drugs on the webpage, and various types of medical drugs. Descriptions of medical functions or product instructions of medical drugs; unstructured medical knowledge text refers to text without a fixed format. The fixed format includes but is not limited to text paragraph format, text format, indentation format, and spacing format. Due to this implementation For example, the text in the medical source text to be processed is uploaded by different users, and the format used by each user in the editing process will be inconsistent. In the same medical source text to be processed, different users upload the entire text, and the entire text is finally displayed There will be inconsistencies in the text format, and various input components or display components in the medical source text to be processed may have incompatible text issues, and the entire text is copied from the display component in the medical source text to be processed to In another input component in the medical source text to be processed, the previously structured medical knowledge text may be turned into an unstructured medical knowledge text; specifically, in this embodiment, the unstructured medical source text is captured in the medical source text to be processed. The medical knowledge text can be determined by recognizing the text. After recognizing all the text in the display interface of the medical source text to be processed, the text selected by the user can be used as unstructured medical knowledge text, or it can be recognized by the NLP model, and When the NLP model recognizes that the text in the medical source text to be processed has multiple inconsistent formats or does not exist, the text in the medical source text to be processed can be captured as an unstructured medical knowledge text.

S20: Identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;

Understandably, specific punctuation marks in unstructured medical knowledge text can be identified by identifying the components of punctuation marks, or punctuation marks can be identified by NLP model, and structured medical knowledge can be undivided by the recognized punctuation marks Sentences in the text are segmented to obtain multiple first characteristic sentences, wherein the punctuation marks in the first characteristic sentences are symbols that can be used to split complete sentences such as periods, exclamation marks, or question marks. In this embodiment, the sentence is split into a plurality of first feature sentences, and each sentence represents the feature of a complete sentence. This feature will provide a connection relationship of a complete sentence in the subsequent semantic recognition process, avoiding the occurrence of sentences between sentences. The phenomenon of hybrid recognition.

S30: After inputting the first characteristic sentence into a preset language recognition model, obtain a semantic characteristic vector corresponding to each of the first characteristic sentences;

Understandably, the preset language recognition model can be a bert model, where the bert model can be used to capture the first feature sentence and the level description of each word of the first feature sentence, and the goal of the bert model is to train on a large-scale unlabeled corpus , In order to obtain the expression of rich semantic information contained in the first characteristic sentence. Among them, the core of the bert model is the Transformer module, and the Transformer module is created using the Attention mechanism, and the created Transformer module can be assembled into the above-mentioned bert model. This embodiment uses the word-to-sentence relationship in the bert model to obtain the semantic feature vector corresponding to the first feature sentence.

S40. After inputting all the semantic feature vectors into a preset article semantic recognition model, obtain a second feature sentence output by the preset article semantic recognition model; the second feature sentence includes the preset article semantics A preset number of positions to be segmented determined by the recognition model according to the context relationship of the unstructured medical knowledge text;

Understandably, the preset article semantic recognition model is the LSTM model, and the goal of the LSTM model is to remember information for a long time to identify each complete sentence in the input text. Among them, the core processing of the LSTM model is completed through three thresholds. The three thresholds are the forgetting threshold, the input threshold and the output threshold. In addition, when the LSTM model is combined with the context of the input text, a complete second feature sentence can be determined. The second characteristic sentence will form two positions to be divided.

S50. Invoke the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and compare it with the to-be-divided second characteristic sentence in the first code file. After inserting the segmentation symbol at the position corresponding to the position, the second code file is obtained;

Understandably, the first code file represents the background code file corresponding to the medical source text to be processed. Specifically, the first code file can be called by the script language; the second characteristic sentence is composed of the unstructured text in the medical source text to be processed. The medical knowledge text is converted, so the second feature sentence also has a text display position in the first code file in the medical source text to be processed (the text display position contains multiple second feature sentences), which can be specified in the first code Query and write the second characteristic sentence in the file to the code language corresponding to the text display position, and finally identify the word corresponding to the second characteristic sentence in the text display position in the code language to determine the second characteristic sentence; segmentation symbol It can be understood as an html symbol. Specifically, two division symbols can be inserted in the two positions to be divided corresponding to the text display position (that is, the text display position contains at least two positions to be divided). Two characteristic sentences form a division, where the division symbols include div symbols, h1 to h6 title symbols, etc.

S60: Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.

Understandably, the second code file is a background code file including segmentation symbols. At this time, if the specific medical source text to be processed is to be displayed, the second code file needs to be run.

Further, after the grabbing the entire paragraph of the unstructured medical knowledge text in the medical source text to be processed, the method further includes:

Detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured, and obtaining the marking result;

The first code file of the medical source text to be processed is called, and the wrong words in the first code file are corrected according to the marking result to obtain a third code file, and the third code file is After running, the revised unstructured medical knowledge text is obtained.

Understandably, the preset natural language processing model can be an NLP model. The semantic recognition function established by the model is used to mark unstructured medical knowledge texts with repetitive or incorrect words, and the result of the mark is used to mark the first word. Correction of incorrectly written words in a code file, where the correction includes the deletion of repeated words and typos.

Further, the first characteristic sentence is stored in a blockchain; the preset language recognition model is a bert model;

After inputting the first characteristic sentence into a preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:

After inputting the first characteristic sentence into the bert model, query the word vector of each word in the first characteristic sentence through the bert model;

Using the Attention mechanism in the bert model to select one of the word vectors in the first characteristic sentence as a Query vector, and use the other word vectors of the first characteristic sentence as a Key vector;

After calculating the similarity between the Query vector and each of the Key vectors, weight coefficients are obtained, and the Value values corresponding to the Query vector and the Key vector are weighted by the weight coefficients to obtain the Attention mechanism output The first enhanced semantic feature vector corresponding to the Query vector;

Performing linear transformation on the first enhanced semantic feature vector by using multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;

After combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence, the semantic feature vector corresponding to the first feature sentence is obtained.

Understandably, this embodiment mainly uses the Attention mechanism in the bert model to allow the model to focus on the input first feature sentence; the Attention mechanism in this embodiment includes Query vector, Key vector and valve value, among which, Both the Query vector and the Key vector are derived from word vectors, and each word vector has a corresponding valve value. The essence of Attention can be described as a mapping from a query (Query) to a series of (Key-Value) pairs; specifically In this embodiment, after the first characteristic sentence is first input into the bert model, each word in the first characteristic sentence is queried through the bert model, and each word in the query is converted into a one-dimensional word vector through the bert model, and then a One of the word vectors of the first characteristic sentence is used as a target vector Query vector, and other word vectors in the first characteristic sentence are used as the Key vector, and then the Query vector and each Key vector are similarly calculated to obtain the weight coefficient. Among them, the commonly used similarity functions include but are not limited to dot product, splicing and perceptron, and then use the preset softmax function to normalize the weight coefficients obtained, and the normalized weight coefficients are corresponding to the Query vector and the Key vector. The key value value performs a weighted summation operation to obtain the first enhanced semantic feature vector corresponding to the Query vector output by the last Attention mechanism. Finally, each Transformer Encoder composed of the Attention mechanism is used to perform data processing on the first enhanced semantic feature vector. The data processing includes incomplete connection (adding the word vector and the first enhanced semantic feature vector directly as the final output), normalization and linear transformation of a layer of neural network nodes with 0 mean and 1 variance (for the first enhanced semantic feature) The vector is linearly transformed to enhance the expressive ability of the bert model), after combining the second enhanced semantic feature vector corresponding to each word vector, the semantic feature vector corresponding to the first feature sentence is obtained. In this embodiment, the bert model is used as the preset language recognition model, and its achievable purposes are: 1. It can learn the relationship between the first feature sentences, that is, connect the context; 2. It is good to obtain the semantic representation of the sentence level (No. 2. Enhanced semantic feature vector).

In addition, it should be emphasized that, in order to further ensure the privacy and security of the first characteristic sentence, the first characteristic sentence may also be stored in a node of a blockchain. Among them, the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer. The decentralized and fully distributed DNS service provided by the blockchain can realize the query and resolution of domain names through the point-to-point data transmission service between various nodes in the network, which can be used to ensure that the operating system and firmware of an important infrastructure are not available. If it is tampered with, it can monitor the status and integrity of the software, find bad tampering, and ensure that the transmitted data has not been tampered with. Store the first characteristic sentence in the blockchain to ensure the privacy and security of the first characteristic sentence sex.

Further, after inserting a segmentation symbol in the first code file at a position corresponding to the position to be segmented of the second characteristic sentence to obtain the second code file, the method further includes:

The corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is added to the second code file.

Understandably, this embodiment mainly adds corresponding cascading style sheets such as color, font number, frame, etc. to the second code file to display the specific format status in the medical source text to be processed, such as color in CSS , Font-size in CSS and box in css.

Further, the preset article semantic recognition model is an LSTM model;

After inputting all the semantic feature vectors into a preset article semantic recognition model, the method further includes:

Choosing to discard information according to the forgetting threshold in the LSTM model;

Selecting required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;

The second characteristic sentence is output through the output threshold in the LSTM model and the required information.

Understandably, the LSTM model is a kind of threshold RNN. The key of the LSTM model is the cell state. Therefore, each threshold designed by the LSTM model is the ability to remove or add information to the cell state (which can be regarded as a semantic feature vector). Among them, each threshold includes a sigmoid neural network layer and a pointwise multiplication operation. The sigmoid neural network layer outputs a value between 0 and 1, describing how much of each part can pass, 0 means no, 1 means pass; The forget gate can determine the discarded information in the cell state. The discarded information is the subject corresponding to the last semantic feature vector. The input threshold can update the stored information in the cell state. Specifically, the discarded information is discarded from the semantic feature vector by the input threshold. And from the discarded semantic feature vector, the required information to be updated is determined, the output threshold can determine the second feature sentence to be output, and the second feature sentence is output according to the required information that has been determined in the above input threshold.

In summary, the above provides a medical text structuring method that uses models and segmentation symbols to replace the previous manual editing of unstructured medical knowledge text in the medical source text, avoiding the high error rate of manual editing and Manual editing of problems that take a lot of time can improve the efficiency of converting unstructured medical knowledge texts into structured medical knowledge texts. This method can be applied to smart medical care to promote the construction of smart cities.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

In one embodiment, a medical text structuring device is provided, and the medical text structuring device corresponds to the medical text structuring method in the above-mentioned embodiment in a one-to-one correspondence. As shown in FIG. 3, the medical text structuring device includes a grabbing module 11, a splitting module 12, a first acquiring module 13, a second acquiring module 14, an inserting module 15 and a display module 16. The detailed description of each functional module is as follows:

The grabbing module 11 is used to grab the entire unstructured medical knowledge text in the medical source text to be processed;

The splitting module 12 is used to identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;

The first obtaining module 13 is configured to obtain a semantic feature vector corresponding to each of the first characteristic sentences after inputting the first characteristic sentence into the preset language recognition model;

The second acquiring module 14 is configured to input all the semantic feature vectors into a preset article semantic recognition model, and then acquire the second feature sentence output by the preset article semantic recognition model; the second feature sentence includes A preset number of positions to be segmented determined by the preset article semantic recognition model according to the context relationship of the unstructured medical knowledge text;

The insertion module 15 is used to call the first code file of the medical source text to be processed, query the second feature sentence from the first code file, and compare the second feature sentence in the first code file with the After inserting the dividing symbol at the position corresponding to the position to be divided of the sentence, the second code file is obtained;

The display module 16 is configured to run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.

Further, the medical text structuring device further includes:

The marking module is used to detect the unstructured medical knowledge text through a preset natural language processing model, mark the words with errors in the unstructured, and obtain the marking result;

The running module is used for calling the first code file of the medical source text to be processed, correcting the wrong words in the first code file according to the marking result, obtaining the third code file, and checking all the After the third code file is run, the revised unstructured medical knowledge text is obtained.

Further, the preset language recognition model is a bert model, and the first acquisition module includes:

The input sub-module is used to query the word vector of each word in the first characteristic sentence through the bert model after inputting the first characteristic sentence into the bert model;

A selection sub-module, configured to select one of the word vectors in the first characteristic sentence as a Query vector through the Attention mechanism in the bert model, and use the other word vectors of the first characteristic sentence as a Key vector;

The weighting operation sub-module is used to calculate the similarity between the Query vector and each of the Key vectors to obtain weight coefficients, and perform weighting operations on the Value values corresponding to the Query vector and the Key vector through the weight coefficients , Obtain the first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;

A linear conversion sub-module, configured to perform linear conversion on the first enhanced semantic feature vector through multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;

The combination sub-module is used to combine the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.

Further, the medical text structuring device further includes:

The adding module is used for calling out the corresponding cascading style sheet according to the preset style format, and adding the cascading style sheet to the second code file.

Further, the preset article semantic recognition model is an LSTM model, and the second acquisition module includes:

The first selection sub-module is configured to select discarding information through the forgetting threshold in the LSTM model;

The second selection sub-module is configured to select required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;

The output sub-module is configured to output the second characteristic sentence through the output threshold in the LSTM model and the required information.

For the specific definition of the medical text structuring device, please refer to the above definition of the medical text structuring method, which will not be repeated here. Each module in the above-mentioned medical text structuring device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 4. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes computer readable instructions and internal memory. The computer readable instructions are stored with an operating system, computer readable instructions and a database. The internal memory provides an environment for the operation of the operating system and the computer-readable instructions in the computer-readable instructions. The database of the computer equipment is used to store multiple pieces of historical test data, and each piece of historical test data corresponds to a test problem record. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a medical text structuring method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. Describe the structuring method of medical text.

In one embodiment, one or more readable storage media storing computer readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the medical text structuring method described in the above embodiments.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

To

Claims

A method for structuring medical text, including:

Grab the entire unstructured medical knowledge text in the medical source text to be processed;

Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;

After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;

After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;

Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;

Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
5. The method for structuring medical texts according to claim 1, wherein, after capturing the entire unstructured medical knowledge text in the medical source text to be processed, the method further comprises:

Detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured, and obtaining the marking result;

The first code file of the medical source text to be processed is called, the wrong words in the first code file are corrected according to the marking result, and the third code file is obtained, and the third code file is After running, the revised unstructured medical knowledge text is obtained.
5. The medical text structuring method according to claim 1, wherein the preset language recognition model is a bert model;

After inputting the first characteristic sentence into a preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:

After inputting the first characteristic sentence into the bert model, query the word vector of each word in the first characteristic sentence through the bert model;

Using the Attention mechanism in the bert model to select one of the word vectors in the first characteristic sentence as a Query vector, and use the other word vectors of the first characteristic sentence as a Key vector;

After calculating the similarity between the Query vector and each of the Key vectors, weight coefficients are obtained, and the Value values corresponding to the Query vector and the Key vector are weighted by the weight coefficients to obtain the Attention mechanism output The first enhanced semantic feature vector corresponding to the Query vector;

Performing linear transformation on the first enhanced semantic feature vector by using multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;

After combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence, the semantic feature vector corresponding to the first feature sentence is obtained.
The medical text structuring method according to claim 1, wherein said inserting a segmentation symbol in a position corresponding to the position to be segmented of said second characteristic sentence in the first code file, after obtaining the second code file, further comprises :

The corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is nested into the second code file.
5. The medical text structuring method according to claim 1, wherein the preset article semantic recognition model is an LSTM model;

After inputting all the semantic feature vectors into a preset article semantic recognition model, the method further includes:

Choosing to discard information according to the forgetting threshold in the LSTM model;

Selecting required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;

The second characteristic sentence is output through the output threshold in the LSTM model and the required information.
A medical text structuring device, which includes:

The grab module is used to grab the entire paragraph of unstructured medical knowledge text in the medical source text to be processed;

A splitting module, used to identify all punctuation marks in the unstructured medical knowledge text, and split the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;

The first acquiring module is configured to acquire a semantic feature vector corresponding to each of the first characteristic sentences after inputting the first characteristic sentence into the preset language recognition model;

The second acquisition module is configured to input all the semantic feature vectors into the preset article semantic recognition model, and then acquire the second feature sentence output by the preset article semantic recognition model; the second feature sentence contains all The preset article semantic recognition model determines a preset number of positions to be segmented according to the context relationship of the unstructured medical knowledge text;

The insertion module is used to call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and compare it with the second characteristic sentence in the first code file After inserting the dividing symbol at the position corresponding to the position to be divided, the second code file is obtained;

The display module is used to run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
7. The medical text structuring device according to claim 6, wherein the medical text structuring device further comprises:

The marking module is used to detect the unstructured medical knowledge text through a preset natural language processing model, mark the words with errors in the unstructured, and obtain the marking result;

The running module is used for calling the first code file of the medical source text to be processed, correcting the wrong words in the first code file according to the marking result, obtaining the third code file, and checking all the After the third code file is run, the revised unstructured medical knowledge text is obtained.
7. The medical text structuring device according to claim 6, wherein the preset language recognition model is a bert model; and the first acquisition module comprises:

The input sub-module is used to query the word vector of each word in the first characteristic sentence through the bert model after inputting the first characteristic sentence into the bert model;

A selection sub-module, configured to select one of the word vectors in the first characteristic sentence as a Query vector through the Attention mechanism in the bert model, and use the other word vectors of the first characteristic sentence as a Key vector;

The weighting operation sub-module is used to calculate the similarity between the Query vector and each of the Key vectors to obtain weight coefficients, and perform weighting operations on the Value values corresponding to the Query vector and the Key vector through the weight coefficients , Obtain the first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;

A linear conversion sub-module, configured to perform linear conversion on the first enhanced semantic feature vector through multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;

The combination sub-module is used to combine the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.
7. The medical text structuring device according to claim 6, wherein the medical text structuring device further comprises:

The adding module is used for calling out the corresponding cascading style sheet according to the preset style format, and adding the cascading style sheet to the second code file.
7. The medical text structuring device according to claim 6, wherein the medical text structuring device further comprises:

The adding module is used for calling out the corresponding cascading style sheet according to the preset style format, and adding the cascading style sheet to the second code file.

Further, the preset article semantic recognition model is an LSTM model, and the second acquisition module includes:

The first selection sub-module is configured to select discarding information through the forgetting threshold in the LSTM model;

The second selection sub-module is configured to select required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;

The output sub-module is configured to output the second characteristic sentence through the output threshold in the LSTM model and the required information.
A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:

Grab the entire unstructured medical knowledge text in the medical source text to be processed;

Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;

After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;

After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;

Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;

Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
11. The computer device according to claim 11, wherein, after the unstructured medical knowledge text in the entire paragraph of the medical source text to be processed is captured, the processor executes the following steps when executing the computer readable instruction:

Detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured, and obtaining the marking result;

The first code file of the medical source text to be processed is called, the wrong words in the first code file are corrected according to the marking result, and the third code file is obtained, and the third code file is After running, the revised unstructured medical knowledge text is obtained.
11. The computer device of claim 11, wherein the preset language recognition model is a bert model;

After inputting the first characteristic sentence into a preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:

After inputting the first characteristic sentence into the bert model, query the word vector of each word in the first characteristic sentence through the bert model;

Using the Attention mechanism in the bert model to select one of the word vectors in the first characteristic sentence as a Query vector, and use the other word vectors of the first characteristic sentence as a Key vector;

After calculating the similarity between the Query vector and each of the Key vectors, weight coefficients are obtained, and the Value values corresponding to the Query vector and the Key vector are weighted by the weight coefficients to obtain the Attention mechanism output The first enhanced semantic feature vector corresponding to the Query vector;

Performing linear transformation on the first enhanced semantic feature vector by using multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;

After combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence, the semantic feature vector corresponding to the first feature sentence is obtained.
The computer device according to claim 11, wherein, after inserting a segmentation symbol in a position corresponding to the position to be segmented of the second characteristic sentence in the first code file, and obtaining the second code file, the processor executes The computer-readable instructions further implement the following steps:

The corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is nested into the second code file.
11. The computer device of claim 11, wherein the preset article semantic recognition model is an LSTM model;

After inputting all the semantic feature vectors into a preset article semantic recognition model, the processor executes the following steps when executing the computer-readable instructions:

Choosing to discard information according to the forgetting threshold in the LSTM model;

Selecting required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;

The second characteristic sentence is output through the output threshold in the LSTM model and the required information.
One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Grab the entire unstructured medical knowledge text in the medical source text to be processed;

Identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;

After inputting the first characteristic sentence into the preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences;

After all the semantic feature vectors are input to the preset article semantic recognition model, the second feature sentence output by the preset article semantic recognition model is obtained; the second feature sentence includes the preset article semantic recognition model A preset number of positions to be segmented determined according to the context relationship of the unstructured medical knowledge text;

Call the first code file of the medical source text to be processed, query the second characteristic sentence from the first code file, and correspond to the position to be divided of the second characteristic sentence in the first code file After inserting the segmentation symbol at the position of, the second code file is obtained;

Run the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the to-be-processed medical source text.
The readable storage medium of claim 16, wherein after the unstructured medical knowledge text in the entire paragraph of the medical source text to be processed is captured, when the computer-readable instructions are executed by one or more processors , Causing the one or more processors to perform the following steps:

Detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured, and obtaining the marking result;

The first code file of the medical source text to be processed is called, the wrong words in the first code file are corrected according to the marking result, and the third code file is obtained, and the third code file is After running, the revised unstructured medical knowledge text is obtained.
15. The readable storage medium of claim 16, wherein the preset language recognition model is a bert model;

After inputting the first characteristic sentence into a preset language recognition model, obtaining a semantic characteristic vector corresponding to each of the first characteristic sentences includes:

After inputting the first characteristic sentence into the bert model, query the word vector of each word in the first characteristic sentence through the bert model;

Using the Attention mechanism in the bert model to select one of the word vectors in the first characteristic sentence as a Query vector, and use the other word vectors of the first characteristic sentence as a Key vector;

After calculating the similarity between the Query vector and each of the Key vectors, weight coefficients are obtained, and the Value values corresponding to the Query vector and the Key vector are weighted by the weight coefficients to obtain the Attention mechanism output The first enhanced semantic feature vector corresponding to the Query vector;

Performing linear transformation on the first enhanced semantic feature vector by using multiple stacked Transformer Encoders in the bert model to obtain a second enhanced semantic feature vector;

After combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence, the semantic feature vector corresponding to the first feature sentence is obtained.
The computer device according to claim 16, wherein the dividing symbol is inserted into the position corresponding to the position to be divided of the second characteristic sentence in the first code file, and after the second code file is obtained, the computer readable When the instruction is executed by one or more processors, the one or more processors execute the following steps:

The corresponding cascading style sheet is called according to the preset style format, and the cascading style sheet is nested into the second code file.
The computer device of claim 16, wherein the preset article semantic recognition model is an LSTM model;

After all the semantic feature vectors are input to the preset article semantic recognition model, when the computer-readable instructions are executed by one or more processors, the one or more processors are caused to perform the following steps:

Choosing to discard information according to the forgetting threshold in the LSTM model;

Selecting required information from the semantic feature vector through the input threshold in the LSTM model and the discarding information;

The second characteristic sentence is output through the output threshold in the LSTM model and the required information.

To