CN112016274B - Medical text structuring method, device, computer equipment and storage medium - Google Patents

Medical text structuring method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN112016274B
CN112016274B CN202010935255.2A CN202010935255A CN112016274B CN 112016274 B CN112016274 B CN 112016274B CN 202010935255 A CN202010935255 A CN 202010935255A CN 112016274 B CN112016274 B CN 112016274B
Authority
CN
China
Prior art keywords
text
medical
code file
unstructured
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010935255.2A
Other languages
Chinese (zh)
Other versions
CN112016274A (en
Inventor
朱威
何义龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010935255.2A priority Critical patent/CN112016274B/en
Priority to PCT/CN2020/124215 priority patent/WO2021164301A1/en
Publication of CN112016274A publication Critical patent/CN112016274A/en
Application granted granted Critical
Publication of CN112016274B publication Critical patent/CN112016274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to an artificial intelligence technology, which is applied to the field of medical text processing, and particularly discloses a medical text structuring method, a device, computer equipment and a storage medium. The method comprises the following steps: grabbing unstructured medical knowledge text; splitting the unstructured text into a plurality of first feature sentences; inputting the first feature sentence into a preset language identification model, and obtaining a semantic feature vector; inputting all semantic feature vectors into a preset article semantic recognition model, and obtaining an output second feature sentence; calling a first code file of a medical source text to be processed, and inserting a segmentation symbol into a position corresponding to a position to be segmented of a second characteristic statement in the first code file to obtain a second code file; and running the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the medical source text to be processed. The invention can improve the conversion efficiency of the structured medical knowledge text.

Description

Medical text structuring method, device, computer equipment and storage medium
Technical Field
The present invention relates to the field of intelligent decision making in artificial intelligence, and in particular, to a method, an apparatus, a computer device, and a storage medium for structuring medical text.
Background
At present, a large amount of medical knowledge texts are contained in the same medical source text, and the texts can relate to various medical knowledge in the medical field, when the medical knowledge texts need to be displayed in an interface, manual effective editing is needed for structuring the medical knowledge texts so as to facilitate viewing, but the text formats of the medical knowledge texts in the source text are often irregular, and most of the medical knowledge texts are presented in unstructured forms, so that manual editing errors are easily caused, the editing efficiency is low, and the editing time is long. Especially when it is required to present some new medical knowledge texts (new product specifications in medical field, etc.) to the user, it is required that the medical knowledge texts must be in a structured specific format, such as correct segmentation and reasonable indentation. If the structured medical text which can be displayed externally is formed by manual editing, the method is time-consuming and labor-consuming. There is a need for a new solution to the above problems.
Disclosure of Invention
Based on the foregoing, it is necessary to provide a method, an apparatus, a computer device and a storage medium for structuring a medical text, which are used for avoiding the problems of high error rate of manual editing and long time spent on manual editing, and improving the efficiency of converting unstructured medical knowledge text into structured medical knowledge text.
A method of medical text structuring comprising:
capturing an unstructured medical knowledge text of the whole section in a medical source text to be processed;
identifying all punctuations in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuations;
after the first feature sentences are input into a preset language identification model, a semantic feature vector corresponding to each first feature sentence is obtained;
inputting all the semantic feature vectors into a preset article semantic recognition model, and then obtaining a second feature sentence output by the preset article semantic recognition model; the second feature sentences comprise a preset number of positions to be segmented, which are determined by the preset article semantic recognition model according to the context association relationship of the unstructured medical knowledge text;
calling a first code file of the medical source text to be processed, inquiring the second characteristic statement from the first code file, and inserting a segmentation symbol into a position corresponding to a position to be segmented of the second characteristic statement in the first code file to obtain a second code file;
and running the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the medical source text to be processed.
A medical text structuring apparatus comprising:
the grabbing module is used for grabbing an unstructured medical knowledge text of the whole section in the medical source text to be processed;
the splitting module is used for identifying all punctuation marks in the unstructured medical knowledge text and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
the first acquisition module is used for acquiring a semantic feature vector corresponding to each first feature sentence after the first feature sentences are input into a preset language identification model;
the second acquisition module is used for acquiring a second feature sentence output by the preset article semantic recognition model after inputting all the semantic feature vectors into the preset article semantic recognition model; the second feature sentences comprise a preset number of positions to be segmented, which are determined by the preset article semantic recognition model according to the context association relationship of the unstructured medical knowledge text;
the inserting module is used for calling out a first code file of the medical source text to be processed, inquiring the second characteristic statement from the first code file, and inserting a segmentation symbol into a position corresponding to a position to be segmented of the second characteristic statement in the first code file to obtain a second code file;
and the display module is used for running the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the medical source text to be processed.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-described medical text structuring method when executing the computer program.
A computer readable storage medium storing a computer program which, when executed by a processor, implements the above-described medical text structuring method.
According to the medical text structuring method, device, computer equipment and storage medium, the mode that the unstructured medical knowledge text in the medical source text to be processed is manually edited before is replaced by the model and the segmentation symbols, the problems that the error rate of manual editing is high and the time spent for manual editing is long are avoided, and the efficiency of converting the unstructured medical knowledge text into the structured medical knowledge text is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic view of an application environment of a method for structuring a medical text according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of text structuring in medicine in accordance with one embodiment of the present invention;
FIG. 3 is a schematic diagram of a device for structuring text in medicine in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a computer device in accordance with an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The medical text structuring method provided by the invention can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server through a network. The clients may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
In one embodiment, as shown in fig. 2, a method for structuring medical text is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
s10, capturing an unstructured medical knowledge text of the whole section in a medical source text to be processed;
understandably, the medical source text to be processed may refer to unstructured medical knowledge text in the field of medicine in web pages, wherein the unstructured medical knowledge text may include, but is not limited to, formulas of various medical drugs, various treatment function specifications of medical drugs, product specifications of medical drugs, etc. on web pages; the unstructured medical knowledge text refers to a text without a fixed format, wherein the fixed format comprises but is not limited to text paragraph format, text format, indentation format and interval format, and as the text in the medical source text to be processed in the embodiment is uploaded by different users and the formats used by the users in the editing process are inconsistent, uploading the whole text in different users in the same medical source text to be processed, the finally displayed whole text has the problem of inconsistent text format, various input components or display components in the medical source text to be processed may have the problem of incompatible text, and copying the whole text from the display component in one medical source text to the input component in another medical source text to be processed may change the prior structured medical knowledge text into the unstructured medical knowledge text; specifically, the capturing of the unstructured medical knowledge text in the medical source text to be processed in the embodiment may be determined by identifying text, after all the text in the display interface in the medical source text to be processed is identified, the text selected by the user may be used as the unstructured medical knowledge text, or may be identified by the NLP model, and when the NLP model identifies that the text in the medical source text to be processed has multiple inconsistent formats or no format, the text in the medical source text to be processed may be captured out as the unstructured medical knowledge text.
S20, identifying all punctuation marks in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
it can be understood that, the punctuation mark in the unstructured medical knowledge text can be identified by a punctuation mark identifying component, the punctuation mark can be identified by an NLP model, and sentences in the unstructured medical knowledge text can be broken and split through the identified punctuation mark, so that the plurality of first characteristic sentences can be obtained, wherein the punctuation mark in the first characteristic sentences is a mark which can be split through periods, exclamation marks or question marks and other symbols which can split complete sentences. In the embodiment, sentences are split into a plurality of first characteristic sentences, each sentence represents the characteristic of a complete sentence, and the characteristic provides the contact relation of the complete sentence in the subsequent semantic recognition process, so that the phenomenon of mixed recognition among sentences is avoided.
S30, after the first feature sentences are input into a preset language identification model, acquiring a semantic feature vector corresponding to each first feature sentence;
it is understood that the preset language recognition model may be a bert model, where the bert model may be used to capture the first feature sentence and the level description of each word of the first feature sentence, and the goal of the bert model is to train with a large scale unlabeled corpus to obtain a representation containing rich semantic information in the first feature sentence. The core of the bert model is a transducer module, and the transducer module is created by using an attribute mechanism, and the created transducer module can be assembled into the bert model. In this embodiment, the semantic feature vector corresponding to the first feature sentence is obtained by using the word-sentence relationship in the bert model.
S40, after inputting all the semantic feature vectors into a preset article semantic recognition model, obtaining a second feature sentence output by the preset article semantic recognition model; the second feature sentences comprise a preset number of positions to be segmented, which are determined by the preset article semantic recognition model according to the context association relationship of the unstructured medical knowledge text;
understandably, the pre-set article semantic recognition model is an LSTM model that aims to memorize information for a long time to recognize each complete sentence in the input text. The core processing of the LSTM model is completed through 3 thresholds, wherein the 3 thresholds are respectively a forgetting threshold, an input threshold and an output threshold, in addition, when the LSTM model is combined with the context of an input text, a complete second characteristic statement can be determined, and two positions to be segmented can be formed by one second characteristic statement.
S50, calling a first code file of the medical source text to be processed, inquiring the second characteristic statement from the first code file, and inserting a segmentation symbol into a position corresponding to a position to be segmented of the second characteristic statement in the first code file to obtain a second code file;
understandably, the first code file represents a background code file corresponding to the medical source text to be processed, and the first code file can be specifically called out through a script language; the second feature sentence is converted from unstructured medical knowledge text in the medical source text to be processed, so that a text display position (the text display position contains a plurality of second feature sentences) exists in the first code file in the medical source text to be processed, specifically, the second feature sentence can be determined by inquiring and writing a code language corresponding to the second feature sentence to the text display position in the first code file, and finally, identifying a word corresponding to the second feature sentence in the text display position by the code language; the segmentation symbol can be understood as an html symbol, and specifically two segmentation symbols (i.e. the text display position at least includes two positions to be segmented) can be inserted into two positions to be segmented corresponding to the text display position, where the two segmentation symbols form segmentation for a second feature sentence, and the segmentation symbols include div symbols, h1 to h6 title symbols, and the like.
And S60, running the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the medical source text to be processed.
It is understood that the second code file is a background code file comprising segmentation symbols, which needs to be run through when the text of the specific medical source to be processed is to be displayed.
Further, after capturing the whole unstructured medical knowledge text in the medical source text to be processed, the method further comprises:
detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured text, and obtaining marking results;
and calling a first code file of the medical source text to be processed, correcting the words with errors in the first code file according to the marking result to obtain a third code file, and operating the third code file to obtain the corrected unstructured medical knowledge text.
Understandably, the preset natural language processing model may be an NLP model, the word with repeated or wrong word in the unstructured medical knowledge text is marked by the semantic recognition function established by the model, and the wrongly written word in the first code file is corrected by the marked marking result, wherein the correction includes deleting the repeated word and wrongly written word.
Further, the first feature statement is stored in a blockchain; the preset language identification model is a bert model;
after the first feature sentences are input into a preset language identification model, a semantic feature vector corresponding to each first feature sentence is obtained, and the method comprises the following steps:
after the first characteristic statement is input into the bert model, inquiring word vectors of all words in the first characteristic statement through the bert model;
selecting one word vector in the first characteristic statement as a Query vector through an Attention mechanism in the bert model, and taking other word vectors of the first characteristic statement as Key vectors;
performing similarity calculation on the Query vector and each Key vector to obtain a weight coefficient, and performing weighting operation on Value values corresponding to the Query vector and the Key vector through the weight coefficient to obtain a first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;
performing linear conversion on the first enhanced semantic feature vector through a plurality of stacks Transformer Encoder in the bert model to obtain a second enhanced semantic feature vector;
and combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.
Understandably, in this embodiment, the Attention of the model is mainly paid to the input first feature sentence through an Attention mechanism in the bert model; the attribute mechanism in this embodiment includes a Query vector, a Key vector and a Value, where the Query vector and the Key vector are derived from word vectors, and each word vector has a corresponding Value, and the attribute essence can be described as a mapping of a Query (Query) to a series of Key-Value pairs; specifically, in this embodiment, after a first feature sentence is input into a bert model, each word in the first feature sentence is queried through the bert model, the queried words are converted into one-dimensional word vectors through the bert model, then one of the word vectors of the first feature sentence is used as a target vector Query vector, other word vectors in the first feature sentence are used as Key vectors, then similarity calculation is performed on the Query vector and each Key vector to obtain a weight coefficient, wherein a common similarity function comprises but is not limited to a dot product, a stitching and a perception machine, the obtained weight coefficient is normalized through a preset softmax function, the normalized weight coefficient and a Key value corresponding to the Query vector are subjected to weighted summation operation, a first enhancement semantic feature vector corresponding to the Query vector output by a final attribute mechanism is obtained, each Transformer Encoder formed by the attribute mechanism is used for carrying out data processing on the first enhancement semantic feature vector, and the data processing comprises the linear enhancement feature vector corresponding to a linear feature vector after the first feature vector is converted into a linear feature vector, and a linear feature vector is obtained by the linear enhancement feature vector is obtained by the linear feature vector, and the linear feature vector is converted into a linear feature vector corresponding to a feature vector (a linear feature vector is obtained by the linear feature vector). The present embodiment uses the bert model as a preset language recognition model, which can achieve the following objectives: 1. relationships between the first feature sentences, i.e., contact contexts, may be learned; 2. semantic characterization at sentence level (second enhanced semantic feature vector) is well obtained.
It should be emphasized that, to further ensure the privacy and security of the first feature sentence, the first feature sentence may also be stored in a node of a blockchain. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like. The decentralized completely distributed DNS service provided by the blockchain can realize the inquiry and analysis of domain names through the point-to-point data transmission service among all nodes in the network, can be used for ensuring that an operating system and firmware of a certain important infrastructure are not tampered, monitoring the state and the integrity of software, finding bad tampering, ensuring that transmitted data are not tampered, storing a first characteristic statement in the blockchain, and ensuring the privacy and the safety of the first characteristic statement.
Further, the inserting the segmentation symbol into the first code file at the position corresponding to the position to be segmented of the second feature sentence, to obtain a second code file, further includes:
and calling a corresponding cascading style sheet according to a preset style format, and adding the cascading style sheet into the second code file.
As can be appreciated, in this embodiment, a cascading style sheet corresponding to a color, a font number, a frame body, etc. is mainly added to the second code file to display a specific format status in the medical source text to be processed, such as color in CSS, font-size in CSS, and box in CSS.
Further, the preset article semantic recognition model is an LSTM model;
after inputting all the semantic feature vectors into a preset article semantic recognition model, the method comprises the following steps:
selecting discarding information through a forgetting threshold in the LSTM model;
selecting required information from the semantic feature vectors through an input threshold in the LSTM model and the discard information;
and outputting the second characteristic statement through an output threshold in the LSTM model and the required information.
It is understood that the LSTM model is a threshold RNN, the key of the LSTM model is the cell state, so each threshold of the LSTM model design is the ability to remove or add information to the cell state (which can be regarded as a semantic feature vector), where each threshold contains a Sigmoid neural network layer and a pointwise multiplication operation, the Sigmoid neural network layer outputs a value between 0 and 1, describing how many of each part can pass, 0 represents no pass, and 1 represents pass; the forgetting gate can determine the discard information in the cell state, the discard information corresponds to the subject corresponding to the semantic feature vector, the input threshold can update the storage information in the cell state, the discard information is firstly discarded from the semantic feature vector by the input threshold, the required information to be updated is determined from the discarded semantic feature vector, the output threshold can determine the output second feature sentence, and the second feature sentence is output according to the determined required information in the input threshold.
In summary, the above-mentioned method for structuring a medical text uses a model and a segmentation symbol to replace the previous manual editing work of unstructured medical knowledge text in a medical source text to be processed, so as to avoid the problems of high error rate of manual editing and long time spent on manual editing, and improve the efficiency of converting unstructured medical knowledge text into structured medical knowledge text. The method can be applied to intelligent medical treatment, thereby promoting the construction of intelligent cities.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
In one embodiment, a medical text structuring device is provided, which corresponds to the medical text structuring method in the above embodiment one by one. As shown in fig. 3, the medical text structuring apparatus comprises a grabbing module 11, a splitting module 12, a first obtaining module 13, a second obtaining module 14, an inserting module 15 and a displaying module 16. The functional modules are described in detail as follows:
a grabbing module 11, configured to grab an unstructured medical knowledge text of a whole segment in a medical source text to be processed;
a splitting module 12, configured to identify all punctuations in the unstructured medical knowledge text, and split the unstructured text into a plurality of first feature sentences according to the punctuations;
the first obtaining module 13 is configured to obtain a semantic feature vector corresponding to each first feature sentence after inputting the first feature sentence into a preset language recognition model;
the second obtaining module 14 is configured to obtain a second feature sentence output by the preset article semantic recognition model after inputting all the semantic feature vectors into the preset article semantic recognition model; the second feature sentences comprise a preset number of positions to be segmented, which are determined by the preset article semantic recognition model according to the context association relationship of the unstructured medical knowledge text;
the inserting module 15 is configured to call a first code file of the medical source text to be processed, query the first code file for the second feature sentence, and insert a segmentation symbol in the first code file corresponding to a position to be segmented of the second feature sentence to obtain a second code file;
a display module 16, configured to execute the second code file to display a structured medical knowledge text corresponding to the unstructured medical knowledge text on the medical source text to be processed.
Further, the medical text structuring apparatus further comprises:
the marking module is used for detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured text and obtaining marking results;
and the operation module is used for calling out the first code file of the medical source text to be processed, correcting the words with errors in the first code file according to the marking result to obtain a third code file, and operating the third code file to obtain the corrected unstructured medical knowledge text.
Further, the preset language identification model is a bert model, and the first obtaining module includes:
the input sub-module is used for inquiring word vectors of all words in the first characteristic statement through the bert model after the first characteristic statement is input to the bert model;
the selecting submodule is used for selecting one word vector in the first characteristic statement as a Query vector through an Attention mechanism in the bert model, and taking other word vectors of the first characteristic statement as Key vectors;
the weighting operation sub-module is used for obtaining a weighting coefficient after similarity calculation is carried out on the Query vector and each Key vector, and carrying out weighting operation on Value values corresponding to the Query vector and the Key vector through the weighting coefficient to obtain a first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;
the linear conversion sub-module is used for carrying out linear conversion on the first enhanced semantic feature vector through a plurality of stacks Transformer Encoder in the bert model to obtain a second enhanced semantic feature vector;
and the combination sub-module is used for combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.
Further, the medical text structuring apparatus further comprises:
and the adding module is used for calling out a corresponding cascading style sheet according to a preset style format and adding the cascading style sheet into the second code file.
Further, the preset article semantic recognition model is an LSTM model, and the second obtaining module includes:
the first selecting sub-module is used for selecting discarding information through a forgetting threshold in the LSTM model;
a second selecting sub-module, configured to select required information from the semantic feature vectors through an input threshold in the LSTM model and the discard information;
and the output sub-module is used for outputting the second characteristic statement through an output threshold in the LSTM model and the required information.
For specific limitations of the medical text structuring device, reference may be made to the above limitations of the medical text structuring method, which are not repeated here. The various modules in the medical text structuring apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data involved in the medical text structuring method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a medical text structuring method.
In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method for structuring a text in medicine in the above embodiments, such as steps S10 to S60 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the apparatus for structuring text in the above embodiment of the chinese medicine, such as the functions of the modules 11 to 16 shown in fig. 3. In order to avoid repetition, a description thereof is omitted.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the method for structuring text in the above-described embodiments, such as steps S10 to S30 shown in fig. 2. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the text structuring apparatus in the above embodiment, such as the functions of the modules 11 to 16 shown in fig. 3. In order to avoid repetition, a description thereof is omitted.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (10)

1. A method of medical text structuring comprising:
the NLP model captures an unstructured medical knowledge text of the whole section in the medical source text to be processed; the unstructured medical knowledge text is derived from various texts in the medical field on the webpage; the NLP model takes texts in the medical source texts to be processed, which have a plurality of inconsistent formats or non-existing formats, as the unstructured medical knowledge texts;
identifying all punctuations in the unstructured medical knowledge text, and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuations;
after the first feature sentences are input into a preset language identification model, a semantic feature vector corresponding to each first feature sentence is obtained;
inputting all the semantic feature vectors into a preset article semantic recognition model, and then obtaining a second feature sentence output by the preset article semantic recognition model; the second feature sentences comprise a preset number of positions to be segmented, which are determined by the preset article semantic recognition model according to the context association relationship of the unstructured medical knowledge text; one of the second feature sentences forms two positions to be segmented;
calling a first code file of the medical source text to be processed, inquiring the second characteristic statement from the first code file, and inserting a segmentation symbol into a position corresponding to a position to be segmented of the second characteristic statement in the first code file to obtain a second code file; the segmentation symbol is an html symbol;
and running the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the medical source text to be processed.
2. The method for structuring medical text according to claim 1, wherein after capturing the whole unstructured medical knowledge text in the medical source text to be processed, the method further comprises:
detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured text, and obtaining marking results;
and calling a first code file of the medical source text to be processed, correcting the words with errors in the first code file according to the marking result to obtain a third code file, and operating the third code file to obtain the corrected unstructured medical knowledge text.
3. The medical text structuring method according to claim 1, wherein said preset language recognition model is a bert model;
after the first feature sentences are input into a preset language identification model, a semantic feature vector corresponding to each first feature sentence is obtained, and the method comprises the following steps:
after the first characteristic statement is input into the bert model, inquiring word vectors of all words in the first characteristic statement through the bert model;
selecting one word vector in the first characteristic statement as a Query vector through an Attention mechanism in the bert model, and taking other word vectors of the first characteristic statement as Key vectors;
performing similarity calculation on the Query vector and each Key vector to obtain a weight coefficient, and performing weighting operation on Value values corresponding to the Query vector and the Key vector through the weight coefficient to obtain a first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;
performing linear conversion on the first enhanced semantic feature vector through a plurality of stacks Transformer Encoder in the bert model to obtain a second enhanced semantic feature vector;
and combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.
4. The method for structuring medical text according to claim 1, wherein inserting a segmentation symbol in a first code file at a position corresponding to a position to be segmented of the second feature sentence, to obtain a second code file, further comprises:
and calling a corresponding cascading style sheet according to a preset style format, and nesting the cascading style sheet into the second code file.
5. The medical text structuring method according to claim 1, wherein said pre-set article semantic recognition model is an LSTM model;
after inputting all the semantic feature vectors into a preset article semantic recognition model, the method comprises the following steps:
selecting discarding information through a forgetting threshold in the LSTM model;
selecting required information from the semantic feature vectors through an input threshold in the LSTM model and the discard information;
and outputting the second characteristic statement through an output threshold in the LSTM model and the required information.
6. A medical text structuring device, comprising:
the grabbing module is used for grabbing an unstructured medical knowledge text of the whole section in the medical source text to be processed by the NLP model; the unstructured medical knowledge text is derived from various texts in the medical field on the webpage; the NLP model takes texts in the medical source texts to be processed, which have a plurality of inconsistent formats or non-existing formats, as the unstructured medical knowledge texts;
the splitting module is used for identifying all punctuation marks in the unstructured medical knowledge text and splitting the unstructured text into a plurality of first characteristic sentences according to the punctuation marks;
the first acquisition module is used for acquiring a semantic feature vector corresponding to each first feature sentence after the first feature sentences are input into a preset language identification model;
the second acquisition module is used for acquiring a second feature sentence output by the preset article semantic recognition model after inputting all the semantic feature vectors into the preset article semantic recognition model; the second feature sentences comprise a preset number of positions to be segmented, which are determined by the preset article semantic recognition model according to the context association relationship of the unstructured medical knowledge text; one of the second feature sentences forms two positions to be segmented;
the inserting module is used for calling out a first code file of the medical source text to be processed, inquiring the second characteristic statement from the first code file, and inserting a segmentation symbol into a position corresponding to a position to be segmented of the second characteristic statement in the first code file to obtain a second code file; the segmentation symbol is an html symbol;
and the display module is used for running the second code file to display the structured medical knowledge text corresponding to the unstructured medical knowledge text on the medical source text to be processed.
7. The medical text structuring apparatus according to claim 6, further comprising:
the marking module is used for detecting the unstructured medical knowledge text through a preset natural language processing model, marking the words with errors in the unstructured text and obtaining marking results;
and the operation module is used for calling out the first code file of the medical source text to be processed, correcting the words with errors in the first code file according to the marking result to obtain a third code file, and operating the third code file to obtain the corrected unstructured medical knowledge text.
8. The medical text structuring apparatus according to claim 6, wherein said preset language recognition model is a bert model; the first acquisition module includes:
the input sub-module is used for inquiring word vectors of all words in the first characteristic statement through the bert model after the first characteristic statement is input to the bert model;
the selecting submodule is used for selecting one word vector in the first characteristic statement as a Query vector through an Attention mechanism in the bert model, and taking other word vectors of the first characteristic statement as Key vectors;
the weighting operation sub-module is used for obtaining a weighting coefficient after similarity calculation is carried out on the Query vector and each Key vector, and carrying out weighting operation on Value values corresponding to the Query vector and the Key vector through the weighting coefficient to obtain a first enhanced semantic feature vector corresponding to the Query vector output by the Attention mechanism;
the linear conversion sub-module is used for carrying out linear conversion on the first enhanced semantic feature vector through a plurality of stacked transformerEncoders in the bert model to obtain a second enhanced semantic feature vector;
and the combination sub-module is used for combining the second enhanced semantic feature vector corresponding to the word vector of each word in the first feature sentence to obtain the semantic feature vector corresponding to the first feature sentence.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the medical text structuring method according to any of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, which stores a computer program, characterized in that the computer program, when executed by a processor, implements the medical text structuring method according to any one of claims 1 to 5.
CN202010935255.2A 2020-09-08 2020-09-08 Medical text structuring method, device, computer equipment and storage medium Active CN112016274B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010935255.2A CN112016274B (en) 2020-09-08 2020-09-08 Medical text structuring method, device, computer equipment and storage medium
PCT/CN2020/124215 WO2021164301A1 (en) 2020-09-08 2020-10-28 Medical text structuring method and apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010935255.2A CN112016274B (en) 2020-09-08 2020-09-08 Medical text structuring method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112016274A CN112016274A (en) 2020-12-01
CN112016274B true CN112016274B (en) 2024-03-08

Family

ID=73516342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010935255.2A Active CN112016274B (en) 2020-09-08 2020-09-08 Medical text structuring method, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112016274B (en)
WO (1) WO2021164301A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016274B (en) * 2020-09-08 2024-03-08 平安科技(深圳)有限公司 Medical text structuring method, device, computer equipment and storage medium
CN113138773B (en) * 2021-04-19 2024-04-16 杭州科技职业技术学院 Cloud computing distributed service clustering method
CN115034204B (en) * 2022-05-12 2023-05-23 浙江大学 Method for generating structured medical text, computer device and storage medium
CN114996457B (en) * 2022-06-24 2024-09-20 联仁健康医疗大数据科技股份有限公司 Data processing method and device, electronic equipment and storage medium
CN117725197A (en) * 2023-03-28 2024-03-19 书行科技(北京)有限公司 Method, device, equipment and storage medium for determining abstract of search result
CN116882496B (en) * 2023-09-07 2023-12-05 中南大学湘雅医院 Medical knowledge base construction method for multistage logic reasoning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032648A (en) * 2019-03-19 2019-07-19 微医云(杭州)控股有限公司 A kind of case history structuring analytic method based on medical domain entity
CN110032739A (en) * 2019-04-18 2019-07-19 清华大学 Chinese electronic health record name entity abstracting method and system
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222654A (en) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 Text segmenting method, device, equipment and storage medium
CN112016274B (en) * 2020-09-08 2024-03-08 平安科技(深圳)有限公司 Medical text structuring method, device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN110032648A (en) * 2019-03-19 2019-07-19 微医云(杭州)控股有限公司 A kind of case history structuring analytic method based on medical domain entity
CN110032739A (en) * 2019-04-18 2019-07-19 清华大学 Chinese electronic health record name entity abstracting method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王晓璐 等.基于XML的医学病案案例化研究.《电脑知识与技术》.2012,第8卷(第25期),第5952-5954、5957页. *

Also Published As

Publication number Publication date
CN112016274A (en) 2020-12-01
WO2021164301A1 (en) 2021-08-26

Similar Documents

Publication Publication Date Title
CN112016274B (en) Medical text structuring method, device, computer equipment and storage medium
CN111814466B (en) Information extraction method based on machine reading understanding and related equipment thereof
US11954139B2 (en) Deep document processing with self-supervised learning
WO2022088672A1 (en) Machine reading comprehension method and apparatus based on bert, and device and storage medium
CN111680634B (en) Document file processing method, device, computer equipment and storage medium
US12026280B2 (en) Automated data anonymization
US11481605B2 (en) 2D document extractor
WO2022088671A1 (en) Automated question answering method and apparatus, device, and storage medium
CN113010679A (en) Question and answer pair generation method, device and equipment and computer readable storage medium
CN113868419A (en) Text classification method, device, equipment and medium based on artificial intelligence
CN111563380A (en) Named entity identification method and device
CN116796730A (en) Text error correction method, device, equipment and storage medium based on artificial intelligence
CN117851605B (en) Industry knowledge graph construction method, computer equipment and storage medium
CN118172785A (en) Document information extraction method, apparatus, device, storage medium, and program product
CN118095205A (en) Information extraction method, device and equipment of layout file and storage medium
WO2024098282A1 (en) Geometric problem-solving method and apparatus, and device and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN113434652B (en) Intelligent question-answering method, intelligent question-answering device, equipment and storage medium
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN110569401A (en) paper marking method and device, computer equipment and storage medium
CN116306506A (en) Intelligent mail template method based on content identification
CN114331932A (en) Target image generation method and device, computing equipment and computer storage medium
CN112257400A (en) Table data extraction method and device, computer equipment and storage medium
CN111863268B (en) Method suitable for extracting and structuring medical report content
CN111797237A (en) Text entity relation identification method, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant