CN110019711A - A kind of control method and device of pair of medicine text data structureization processing - Google Patents

A kind of control method and device of pair of medicine text data structureization processing Download PDF

Info

Publication number
CN110019711A
CN110019711A CN201711205811.5A CN201711205811A CN110019711A CN 110019711 A CN110019711 A CN 110019711A CN 201711205811 A CN201711205811 A CN 201711205811A CN 110019711 A CN110019711 A CN 110019711A
Authority
CN
China
Prior art keywords
medicine
text data
entity
sequence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711205811.5A
Other languages
Chinese (zh)
Inventor
罗震
吴谨准
贾虎
徐盛
顾春宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BASEBIT (SHANGHAI) INFORMATION TECHNOLOGY Co.,Ltd.
Original Assignee
吴谨准
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 吴谨准 filed Critical 吴谨准
Priority to CN201711205811.5A priority Critical patent/CN110019711A/en
Publication of CN110019711A publication Critical patent/CN110019711A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The present invention provides the control method of a kind of pair of medicine text data structureization processing, include the following steps: that a. is based on structural model and extracts to obtain multiple medicine entities mappings to medicine text data progress entity, the structural model includes multiple sequence labels, the sequence label is formed by way of model training on the basis of artificial mark, and the medicine text data includes multiple word insertions;B. multiple medicine entity mappings are combined to obtain the structured text.The present invention provides the control device of a kind of pair of medicine text data structureization processing, including entity draw-out device and structurizer.The present invention avoids the dependence to matching dictionary by the way of entity extraction, extracts effect, generalization ability and scalability and is improved, reduces maintenance cost.

Description

A kind of control method and device of pair of medicine text data structureization processing
Technical field
It is the invention belongs to technical field of information processing, in particular to a kind of that medical text is carried out using artificial intelligence technology The control method and device of the method for processing, especially a kind of pair of medicine text data structureization processing.
Background technique
Artificial intelligence (English: Artificial Intelligence, Al) refers to be shown by the machine that people manufactures The intelligence come.Usual artificial intelligence refers to the intelligence realized by common computer.Artificial intelligence includes weak artificial intelligence and strong man Work intelligence.It is generally believed that weak artificial intelligence (also referred to as narrow sense artificial intelligence) refers to being absorbed in and solves the problems, such as some specific area Artificial intelligence technology, it is also assumed that being the technical tool applied to the field.
Natural language processing technique is an important branch of narrow sense artificial intelligence, pay attention to processing to natural language and With having been obtained and be widely applied in human-computer interaction.The scope of natural language processing includes information retrieval, information pumping It takes, the fields such as machine translation, text are read aloud, segmented, part-of-speech tagging, autoabstract.
It, can using participle, the mark in natural language processing technique in the practical application of health medical treatment big data field To analyze for doctor using the case history of natural language description, symptom, medical information and the event etc. of patient are therefrom extracted Information.The clinical research research of doctor and artificial intelligence assisting in diagnosis and treatment system etc. are answered in the acquisition and standardization of these information It builds and all plays an important role.
Not currently exist the control method that natural language processing is carried out specifically for medical big data field, that is, at present The analysis of medical text is still handled using traditional participle, mask method, traditional segmenting method are as follows: establish word Allusion quotation;Score model is established according to the frequency that adjacent word occurs;Unacquainted neologisms auxiliary other methods are solved.It brings in this way Defect: first is that treatment process is longer, response speed is not fast enough;Second is that successful match rate is low when encountering neologisms.For medical text For this, its main feature is that language construction is relatively easy, it include a large amount of specialized vocabularies, and there are many long word, so if using existing Some participles, mask method identify medical text, can further slow down response speed, the reason is that, existing participle Method (i.e. by establish dictionary in the way of) be more for traditional language construction, be not then its strong point for technical term, In this way when identifying medical text, it will lead to the neologisms frequently occurred, next is exactly that successful match rate further drops therewith It is low.
Summary of the invention
For technological deficiency of the existing technology, according to an aspect of the present invention, a kind of pair of medicine text data knot The control method of structureization processing, the medicine text data for that will correspond to natural language carry out structuring processing and obtain structuring Text includes the following steps:
A. entity is carried out to medicine text data based on structural model to extract to obtain multiple medicine entities mappings, the structure Model includes multiple sequence labels, and the sequence label is formed by way of model training on the basis of artificial mark, institute Stating medicine text data includes multiple word insertions;
B. multiple medicine entity mappings are combined to obtain the structured text.
Preferably, the step a includes the following steps:
A1. two-way length memory network in short-term is inputted after the medicine text data being converted to word insertion two-dimensional matrix;
A2. the two-way length in short-term memory network output length be the corresponding sequence length of the medicine text data and Width is the medicine text data two-dimensional matrix of designated length;
A3. it is right to be passed into the maximum sequence label institute of conditional random fields acquisition score for the medicine text data two-dimensional matrix The word insertion answered is mapped as the medicine entity, and the score is by conditional random fields according to the overall situation of the medicine text data Information determines that a sequence label corresponds to a score based on the structural model.
Preferably, it is also executed the following steps: before the step a
I. two-way length memory network in short-term is inputted after standard word sequence being converted to word insertion two-dimensional matrix;
Ii. two-way length memory network output in short-term length is the corresponding sequence length of the standard word sequence and width Degree is the standard word sequence two-dimensional matrix of designated length, and the standard word sequence two-dimensional matrix is passed into conditional random fields;
Iii. the conditional random fields calculate the conditional probability of the structural model and obtain the loss of the structural model Value, updates each layer weight of the structural model using the algorithm of backpropagation and optimizes the penalty values;
Iv. step i, ii, iii are repeated until the structural model is restrained.
Preferably, the step b includes the following steps:
B1. participle operation is executed to the medicine text data and obtains text word segmentation result, to multiple medicine entities Mapping executes participle operation acquisition medicine entity and maps word segmentation result, and the participle operation is completed by participle model, and described point Word model is formed by way of model training on the basis of artificial mark;
B2. the text word segmentation result is matched with medicine entity mapping word segmentation result, and based on described Multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings with result;
B3. multiple preferred medicine entity mappings are combined to obtain the structured text.
Preferably, include the following steps: after the step b
C. multiple medicine entity mappings are input to database and carry out the multiple standard information segments of conversion acquisition;
D. multiple standard information segments are combined and obtain standard Structured text.
According to another aspect of the present invention, the control device of a kind of pair of medicine text data structureization processing is also provided, is used Structuring processing, which is carried out, in the medicine text data that will correspond to natural language obtains structured text, comprising:
Entity draw-out device is used to carry out entity to medicine text data based on structural model to extract to obtain multiple medicine Entity mapping, the structural model include multiple sequence labels, and the sequence label passes through model on the basis of artificial mark Trained mode is formed, and the medicine text data includes multiple word insertions;
Structurizer is used to be combined to obtain the structured text to multiple medicine entity mappings.
Preferably, the entity draw-out device includes following device:
First input unit inputs two-way length after being used to be converted to the medicine text data word insertion two-dimensional matrix Short-term memory network;
First output device, being used for two-way length memory network output in short-term length is the medicine text data pair The sequence length and width answered are the medicine text data two-dimensional matrix of designated length;
First acquisition device is used for the medicine text data two-dimensional matrix and is passed into conditional random fields acquisition score most The insertion of word corresponding to big sequence label is mapped as the medicine entity, and the score is by conditional random fields according to the doctor The global information for learning text data determines that a sequence label corresponds to a score based on the structural model.
Preferably, the control device further includes following device:
Secondary input device inputs two-way length and remembers in short-term after being used to be converted to standard word sequence word insertion two-dimensional matrix Recall network;
Second output device, being used for the two-way length, memory network output length is corresponding for the standard word sequence in short-term Sequence length and width be designated length standard word sequence two-dimensional matrix, and by the standard word sequence two-dimensional matrix quilt Incoming conditional random fields;
Second acquisition device is used for the conditional random fields and calculates described in conditional probability and the acquisition of the structural model The penalty values of structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the loss Value.
Preferably, the structurizer includes following device:
Device is segmented, is used to execute the medicine text data participle operation and obtains text word segmentation result and right Multiple medicine entity mappings execute participle operation acquisition medicine entity and map word segmentation result, and the participle operation passes through participle Model is completed, and the participle model is formed by way of model training on the basis of artificial mark;
Coalignment is used to match the text word segmentation result with medicine entity mapping word segmentation result, And multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result;
Combination unit is used to be combined to obtain the structured text to multiple preferred medicine entity mappings.
Preferably, the control device further includes following device:
Reforming unit, is used to multiple medicine entities mappings being input to database and carries out conversion and obtain multiple standards Information segment;
Multiple standard information segments are combined and obtain standard Structured text by standard combination device.
The present invention carries out entity extraction to medicine text data by the inclusion of the structural model of multiple sequence labels and is cured Entity mapping is learned, and maps to form final structured text based on medicine entity.The present invention is kept away by the way of entity extraction The dependence to matching dictionary is exempted from, has extracted effect, generalization ability and scalability and be improved, reduce maintenance cost.Meanwhile The present invention can also realize the purpose for automatically updating structural model, be also more in line with medical industries specialized vocabulary in this way and be continuously increased Characteristic.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 shows the specific embodiment of the present invention, the control of a kind of pair of medicine text data structureization processing The flow chart of method processed;
Fig. 2 shows a specific embodiments of the invention, are combined to obtain to multiple medicine entity mappings The process of the structured text;
Fig. 3 shows the first embodiment of the present invention, the control method of a kind of pair of medicine text data structureization processing Flow chart;
Fig. 4 shows the second embodiment of the present invention, the flow chart for training structure model;
Fig. 5 shows the fourth embodiment of the present invention, and a kind of pair of medicine text data structureization processing obtains standard knots The flow chart of the control method of structure text;
Fig. 6 shows another embodiment of the present invention, the control of a kind of pair of medicine text data structureization processing The functional block diagram of device processed;And
Fig. 7 shows the fifth embodiment of the present invention, a kind of functional block diagram of the control device of training structure model.
Specific embodiment
Fig. 1 shows the specific embodiment of the present invention, the control of a kind of pair of medicine text data structureization processing The flow chart of method processed, the medicine text data for that will correspond to natural language carry out structuring processing and obtain structuring text This.It also include medical treatment it will be appreciated by those skilled in the art that both including natural language in content corresponding to the medicine text data Technical term, before executing the following steps, the medicine text data can be converted into multiple words as unit of monocase Insertion, and then realization indicates each monocase with fixed length vector, for subsequent data processing use.Specifically, the doctor Learning text data can be understood as a word sequence, such as the medicine text data includes " A, B, C, D, E, F ", and we are uncommon Hope a vector (the usually vector of low-dimensional) out corresponding to each character therein, if the corresponding vector of A is [0.30.7], B is corresponding to two [- 0.3 0.6], and so on obtain the corresponding vector of alphabet, this is the medicine text data packet The whole words insertion included, can carry out subsequent operation on this basis.It will be appreciated by those skilled in the art that illustrate herein only be for Explanation, is not offered as limitation of the invention.
Step S101 is first carried out, entity is carried out to medicine text data based on structural model and extracts to obtain multiple medicine realities Body mapping, the structural model include multiple sequence labels, and the sequence label is instructed on the basis of artificial mark by model Experienced mode is formed, and the medicine text data includes multiple word insertions.Specifically, the sequence label can be understood as according to The professional classification of medical terminology is manually set, for example, sequence label can correspond to symptom, symptom modification;Inspection item, Inspection result;Disease, disease modification;Treatment, treatment modification;Drug, drug modification;Disease time etc..Those skilled in the art Member understands, compared to dictionary is established, sets that multiple sequence labels are quicker, while data volume is also significantly smaller than the data of dictionary Amount.Institute's predicate is embedded in more specifically, is mapped by segment of multiple sequence labels to the medicine text data Multiple medicine entity mappings can be obtained.For example, the corresponding content of text of the medicine text data is that " at 9 points in patient's morning is just Examine, patient reflects that night yesterday is generated heat suddenly to 37 degree, but the state of mind is good at present ", correspondingly, multiple sequence labels pair The content answered includes " disease time ", " symptom " and " symptom description ", then by multiple sequence labels to the medicine textual data After carrying out entity extraction, " time: yesterday, symptom: fever, symptom description: 37 degree " is obtained, it thus can be to the medicine text Data make refining, and are finally included into medicine large database concept.Based on above description as can be seen that in this step, do not need pair The context for each words that medical text is included is made identification, therefore its processing time that medical text is greatly saved, and The purpose of this step is to be screened based on structural model to medical text, and then make contributions for the accumulation of big data.
Further, step S102 is executed, multiple medicine entity mappings are combined to obtain the structuring text This.Specifically, this step can be understood as the step of arranging to the operation result of above-mentioned steps 101, the structuring text Originally it can be ranked up, can not also be ranked up according to common logic of natural language according to common logic of natural language, For example, mapping the layout that is directly modular by multiple medicine entities that the step S101 is obtained, that is, each doctor Entity mapping is learned to be divided into corresponding module, and the module belongs to a part of medicine big data.Still with above-mentioned step After for rapid 101 specific example, obtaining " time: yesterday, symptom: fever, symptom description: 37 degree ", a kind of combination is obtained The result arrived is " yesterday generates heat 37 degree ", and another combination is respectively to incorporate " yesterday ", " fever ", " 37 degree " not into Same module, and corresponding module can be the crowd in some corresponding region, and then can be to the medicine of the crowd in the region Big data is counted.
Specifically, Fig. 2 shows a specific implementations of step S102, carry out to multiple medicine entity mappings Combination obtains the process of the structured text, includes the following steps:
Step S1021 is executed, participle operation is executed to the medicine text data and obtains text word segmentation result, to multiple institutes It states the mapping of medicine entity and executes participle operation acquisition medicine entity mapping word segmentation result, the participle operation is complete by participle model At the participle model is formed by way of model training on the basis of artificial mark.Specifically, it is mentioned to reduce entity Generated error is taken, after completing the mapping of entity extraction butterfly medicine entity, medicine entity can also be mapped and be optimized Processing correspondingly then carries out participle operation firstly the need of to medicine text data and the mapping of medicine entity, and participle operation is existing The common scheme of technology, a kind of mode are based on string matching, i.e. scanning character string, if it find that the substring and word of character string Identical, even if matching, a kind of mode is completed based on statistics and machine learning, and those skilled in the art can combine existing Segmentation methods are achieved.
Further, execute step S1022, by the text word segmentation result and the medicine entity map word segmentation result into Row matching, and multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result. Specifically, the purpose of this step is that the mapping of medicine entity is proofreaded based on medicine text data, and reason is medicine text data It is urtext, it is more objective as benchmark text.It preferably, can be with base after in addition to being proofreaded based on medicine text data It is proofreaded in the logical construction that multiple medicine entities map itself, that is, all the medicine entity maps word segmentation result for judgement Logic continuity, then obtain the mapping of preferred medicine entity on this basis.
Further, step S1023 is executed, multiple preferred medicine entity mappings are combined to obtain the structure Change text.
Fig. 3 shows the first embodiment of the present invention, the control method of a kind of pair of medicine text data structureization processing Flow chart, specifically comprise the following steps:
Step S201 is executed, two-way length is inputted after the medicine text data is converted to word insertion two-dimensional matrix and remembers in short-term Recall network.Specifically, the medicine text data is converted to word insertion two-dimensional matrix and can be embedded in conjunction in above-mentioned Fig. 1 about word Partial description, i.e., the described medicine text data will illustrate two-way length memory network in short-term with represented by bivector matrix, Firstly the need of introducing long memory network in short-term (referred to as " LSTM "), LSTM is actually to solve circulation memory network (referred to as " RNN ") the problem of and occur, improved by the hidden layer to RNN and realize its function, LSTM is substantially or one RNN, we can be understood as one and carry out an improved network on RNN framework, be cooperated by multitiered network more Better artificial intelligence operation effect is realized in layer multiplexing, and LSTM includes at least the cell for memory, is used for parameter input and output Input Gate and Output Gate, there are also for forgetting Forge Gate.On this basis, two-way long short-term memory Network (referred to as " BiLSTM ") can be understood as the improvement to (referred to as " BiRNN ") of bidirectional circulating memory network, and BiRNN It is different from RNN place to be, past context can either be accessed, additionally it is possible to access following context, basic thought It is to propose that each training sequence is forwardly and rearwardly two RNN respectively, and the two RNN are connected to an output layer, It can be supplied to the complete contextual information with future in the past that each in output layer list entries is put in this way, accordingly Ground, BiLSTM are also the improvement to BiRNN, that is, a two-way cell unit is increased on the basis of BiRNN, tool Body can be understood in conjunction with the description of this section of first half.It will be appreciated by those skilled in the art that the above-mentioned network architecture is actually being answered Can also correspond to specific algorithm in, but this and be not belonging to emphasis of the invention, it will not be described here.
Further, step S202 is executed, two-way length memory network output in short-term length is the medicine textual data It is the medicine text data two-dimensional matrix of designated length according to corresponding sequence length and width.Specifically, the medicine text The corresponding sequence length of data is determined based on the length that multiple words included by the medicine text data are embedded in, described specified Length is can to set different numerical value according to different algorithm bases by an artificial preset length value.More For the specifically medicine text data two-dimensional matrix can be understood as by BiLSTM to whole words be embedded in corresponding vector into The row two-dimensional matrix that treated obtains.
Further, step S203 is executed, the medicine text data two-dimensional matrix is passed into conditional random fields and obtains The insertion of word corresponding to maximum sequence label is divided to map as the medicine entity, the score is by conditional random fields according to institute The global information of medicine text data is stated based on structural model determination, is obtained described in one described sequence label correspondence one Point.Specifically, conditional random fields (referred to as " CRF ") are one of common algorithms of natural language processing field, are usually used in syntax point Analysis, name Entity recognition, part-of-speech tagging etc. use Markov Chain as the score metastasis model for implying variable, by can Observation state differentiates implicit variable, and it is a discrimination model that score is got also through mark collection statistics.CRF is substantially hidden Markov Chain and Observable state containing variable are to the score for implying variable, in the application of the prior art, with grammatical term for the character Property score for, it is assumed that part of speech label meets Markov property, i.e., current part of speech only have with a upper part of speech score turn Shifting relationship and it is unrelated with the part of speech of other positions, for example adjective is 0.5 followed by adjectival score, with modified " " It is scored at 0.5, is scored at 0 with verb.
Further, based on the application above to the description of CRF and the prior art, in the application of CRF, usually to phase The target of adjacent relationship is judged, such as in the judgement of above-mentioned part of speech, is confined to current part of speech and upper one or next word The score transfer relationship of property.And apply CRF in this step, it is used for the score that grammatical term for the character insertion meets sequence label, without It is same as the prior art, is to carry out global judgement and be not based on front-rear position relationship (adjacent pass in other words in this step System) judged, that is, judge whether the insertion of some word can map as medicine entity, needs to comprehensively consider the label sequence The relationship of column and whole sequence labels included by the structural model obtains the sequence label score in turn, and specific judgement is calculated Why method can use such technical solution, reason is still in the present invention not in conjunction with the realization of the prior art Need to obtain the value that each word is embedded in corresponding text, but it is embedding to screen the word to match with sequence label by CRF Enter, more specifically, judges if it is the prior art, then should be the judgement between whole word insertions, and turn of the invention It is changed to the judgement of word insertion and sequence label, this is " to be based on the structure according to the global information of the medical text data The process that model " is judged, it will be appreciated by those skilled in the art that the data volume of sequence label is significantly less than whole word insertions Data volume, by avoiding the judgment mode using front-rear position relationship (neighbouring relations in other words), it can significantly improve fortune Calculate efficiency.Meanwhile the dependence avoided to matching dictionary is extracted using entity, it extracts effect, generalization ability and scalability and obtains It improves, reduces maintenance cost.
Further, step S204 is executed, multiple medicine entity mappings are combined to obtain the structuring text This.This step can understand in conjunction with step S102.
Fig. 4 shows the second embodiment of the present invention, for the flow chart of training structure model, specifically, for training The step S101 that the process of structural model can be shown in FIG. 1 is carried out before executing, and is included the following steps:
Step S301 is first carried out, inputs two-way long short-term memory after standard word sequence is converted to word insertion two-dimensional matrix Network.Specifically, " standard word sequence " herein can be interpreted as to " the medicine text data " of step S201, it correspondingly, can To combine the description of step S201 to understand this step.
Further, step S302 is executed, two-way length memory network output in short-term length is the standard word sequence Corresponding sequence length and width are the standard word sequence two-dimensional matrix of designated length, and by the standard word sequence Two-Dimensional Moment Battle array is passed into conditional random fields.Specifically, " standard word sequence " herein can be interpreted as to " the medicine textual data of step S202 According to ", correspondingly, this step can be understood in conjunction with the description of step S202.
Further, step S303 is executed, the conditional random fields calculate the conditional probability of the structural model and acquisition The penalty values of the structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the damage Mistake value.Specifically, the step S301 and step S302 mentioned standard word sequence can be understood as training sample, can be with Do different variations according to trained actual conditions, and the conditional probability that the conditional random fields calculate the sequence label can be with Understand are as follows: in specified criteria random field P (Y | X), list entries x and output sequence y, design conditions probability P (Yi=yi| x), P (Yi-1=yi-1, Yi=yi| x) and the problem of corresponding mathematic expectaion, x therein, y are the corresponding score of the sequence label, Its specific algorithmic formula can be achieved in conjunction with the prior art.Correspondingly, in the whole marks for obtaining the structural model After the conditional probability for signing sequence, then the mean square error of all conditional probabilities of the sequence label is calculated, according to the mean square error It can be obtained the penalty values of the structural model, and penalty values are for judging whether current structural model is perfect enough.
Further, after obtaining penalty values, also penalty values are optimized, judge current structure model to improve Accuracy rate.Specifically, the algorithm of backpropagation be used to train at present the most frequently used of artificial neural network (referred to as " ANN ") and Most effective algorithm.Its main thought is: training set data being input to the input layer of ANN, by hidden layer, is finally reached defeated Out layer and export as a result, this is the propagated forward process of ANN;Since the output result and actual result of ANN have error, then calculate Error between estimated value and actual value, and by the error from output layer to hidden layer backpropagation, until traveling to input layer; During backpropagation, according to the value of error transfer factor various parameters;The continuous iteration above process, until convergence.
It will be appreciated by those skilled in the art that in order to obtain the perfect structural model of approach, need constantly to repeat step S301, S302 and S303 is until the structural model is restrained, that is, step S301 to step S303 is a training circulation.
Fig. 5 shows the fourth embodiment of the present invention, and a kind of pair of medicine text data structureization processing obtains standard knots The flow chart of the control method of structure text, specifically comprises the following steps:
Step S501 is executed, carries out entity to medicine text data based on structural model and extract to obtain multiple medicine entities to reflect It penetrates, the structural model includes multiple sequence labels, and the sequence label is formed by way of manually marking, the medicine text Notebook data includes multiple word insertions;Step S502 is executed, multiple medicine entity mappings are combined to obtain the structure Change text.The step S501 and step S502 can be understood in conjunction with step S101 and step S102.
Further, step S503 is executed, multiple medicine entity mappings are input to database and carry out conversion acquisition Multiple standard information segments.It will be appreciated by those skilled in the art that in order to advise the format content for ultimately generating structured text more Model is unified, can pre-establish database, is the information of standard by medicine entity Mapping and Converting by the database, such as can be with The format content for being converted to [type, description, value, time, additional information] is engaged in realization to each dimensional information of medical text Standard expression.
Further, step S504 is executed, multiple standard information segments are combined and obtain standard Structured text This.Specifically, combination can be combined according to the logical construction of natural language, can not also logically structure be carried out Combination, this can be achieved the object of the present invention.
As another embodiment of the invention, Fig. 6 shows a kind of pair of medicine text data structureization processing Control device functional block diagram, for will correspond to natural language medicine text data carry out structuring processing obtain structure Change text, comprising:
Entity draw-out device 10 is used for it and is used to carry out entity to medicine text data based on structural model to extract to obtain Multiple medicine entity mappings, the structural model include multiple sequence labels, and the sequence label is on the basis of artificial mark It is formed by way of model training, the medicine text data includes multiple word insertions.
Structurizer 20 is used to be combined to obtain the structured text to multiple medicine entity mappings.
Reforming unit 30, is used to multiple medicine entities mappings being input to database and carries out conversion and obtain multiple marks Calibration information segment.
Multiple standard information segments are combined and obtain standard Structured text by standard combination device 40.
Preferably, the entity draw-out device 10 includes following device:
First input unit 101, input is double after being used to be converted to the medicine text data word insertion two-dimensional matrix To long memory network in short-term;
First output device 102, being used for two-way length memory network output in short-term length is the medicine textual data It is the medicine text data two-dimensional matrix of designated length according to corresponding sequence length and width;
First acquisition device 103 is used to the medicine text data two-dimensional matrix and is passed into conditional random fields obtain The word insertion greater than the first score threshold is divided to map as the medicine entity, the score is by conditional random fields according to the doctor The global information for treating text data determines that the score corresponds to the insertion of institute's predicate based on the structural model.
Preferably, the structurizer 20 further includes following device:
Device 201 is segmented, is used to execute the medicine text data participle operation acquisition text word segmentation result, and Participle operation acquisition medicine entity is executed to multiple medicine entity mappings and maps word segmentation result;
Coalignment 202 is used for the text word segmentation result and medicine entity mapping word segmentation result progress Match, and multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result;
Combination unit 203203 is used to be combined to obtain the structuring to multiple preferred medicine entity mappings Text.
Fig. 7 shows the fifth embodiment of the present invention, a kind of functional block diagram of the control device of training structure model, packet It includes:
Secondary input device 104 inputs two-way length after being used to be converted to standard word sequence word insertion two-dimensional matrix When memory network;
Second output device 105, being used for two-way length memory network output in short-term length is the standard word sequence Corresponding sequence length and width are the standard word sequence two-dimensional matrix of designated length, and by the standard word sequence Two-Dimensional Moment Battle array is passed into conditional random fields;
Second acquisition device 106 is used for conditional probability and acquisition that the conditional random fields calculate the structural model The penalty values of the structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the damage Mistake value.
It will be appreciated by those skilled in the art that figure 6 above realization of each apparatus function into Fig. 7 can be hardware, by processor The software or combination of execution.It specifically, can be by preparatory burning program described in if realized by software module In processor, or by software installation into preset system;If by hardware realization, using field-programmable gate array Column (FPGA) realize corresponding function immobilization.
Further, the software module can store in RAM memory, flash memory, ROM memory, eprom memory, The storage medium of hard disk or any other form known in the art.By the way that the storage medium is coupled to processor, thus So that the processor is read information from the storage medium, and information can be written to the storage medium.As A kind of variation, the storage medium can be component part or the processor and the equal position of the storage medium of processor In on specific integrated circuit (ASIC).
Further, the hardware can be that by the general processor of concrete function, digital signal processor (DSP), specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or The combination of transistor logic, discrete hardware components or the above hardware.As a kind of variation, can also be set by calculating Standby combination realizes, for example, the combination of DSP and microprocessor, the combination of multi-microprocessor, with DSP communicating one combined Or combination of multi-microprocessor etc..
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring substantive content of the invention.

Claims (10)

1. the control method of a kind of pair of medicine text data structureization processing, for the medicine textual data of natural language will to be corresponded to Structured text is obtained according to structuring processing is carried out, which comprises the steps of:
A. entity is carried out to medicine text data based on structural model to extract to obtain multiple medicine entities mappings, the structural model Including multiple sequence labels, the sequence label is formed by way of model training on the basis of artificial mark, the doctor Learning text data includes multiple word insertions;
B. multiple medicine entity mappings are combined to obtain the structured text.
2. control method according to claim 1, which is characterized in that the step a includes the following steps:
A1. two-way length memory network in short-term is inputted after the medicine text data being converted to word insertion two-dimensional matrix;
A2. two-way length memory network output in short-term length is the corresponding sequence length of the medicine text data and width For the medicine text data two-dimensional matrix of designated length;
A3. the medicine text data two-dimensional matrix is passed into conditional random fields and obtains corresponding to the maximum sequence label of score Word insertion is mapped as the medicine entity, and the score is by conditional random fields according to the global information of the medicine text data It is determined based on the structural model, the corresponding score of a sequence label.
3. control method according to claim 2, which is characterized in that also executed the following steps: before the step a
I. two-way length memory network in short-term is inputted after standard word sequence being converted to word insertion two-dimensional matrix;
Ii. two-way length memory network output in short-term length is the corresponding sequence length of the standard word sequence and width is The standard word sequence two-dimensional matrix of designated length, and the standard word sequence two-dimensional matrix is passed into conditional random fields;
Iii. the conditional random fields calculate the conditional probability of the structural model and obtain the penalty values of the structural model, make Each layer weight of the structural model is updated with the algorithm of backpropagation and optimizes the penalty values;
Iv. step i, ii, iii are repeated until the structural model is restrained.
4. control method according to any one of claim 1 to 4, which is characterized in that the step b includes following step It is rapid:
B1. participle operation is executed to the medicine text data and obtains text word segmentation result, multiple medicine entities are mapped It executes participle operation and obtains medicine entity mapping word segmentation result, the participle operation is completed by participle model, the participle mould Type is formed by way of model training on the basis of artificial mark;
B2. the text word segmentation result is matched with medicine entity mapping word segmentation result, and is tied based on the matching Fruit filters out multiple preferred medicine entity mappings from multiple medicine entity mappings;
B3. multiple preferred medicine entity mappings are combined to obtain the structured text.
5. control method according to any one of claim 1 to 5, which is characterized in that include as follows after the step b Step:
C. multiple medicine entity mappings are input to database and carry out the multiple standard information segments of conversion acquisition;
D. multiple standard information segments are combined and obtain standard Structured text.
6. the control device of a kind of pair of medicine text data structureization processing, for the medicine textual data of natural language will to be corresponded to Structured text is obtained according to structuring processing is carried out characterized by comprising
Entity draw-out device is used to carry out entity to medicine text data based on structural model to extract to obtain multiple medicine entities Mapping, the structural model include multiple sequence labels, and the sequence label passes through model training on the basis of artificial mark Mode formed, the medicine text data includes the insertion of multiple words;
Structurizer is used to be combined to obtain the structured text to multiple medicine entity mappings.
7. control device according to claim 6, which is characterized in that the entity draw-out device includes following device:
First input unit inputs two-way length in short-term after being used to be converted to the medicine text data word insertion two-dimensional matrix Memory network;
First output device, being used for two-way length memory network output in short-term length is that the medicine text data is corresponding Sequence length and width are the medicine text data two-dimensional matrix of designated length;
First acquisition device, be used for the medicine text data two-dimensional matrix be passed into conditional random fields obtain score it is maximum The insertion of word corresponding to sequence label is mapped as the medicine entity, and the score is by conditional random fields according to the medicine text The global information of notebook data determines that a sequence label corresponds to a score based on the structural model.
8. control device according to claim 7, which is characterized in that the control device further includes following device:
Secondary input device inputs two-way long short-term memory net after being used to be converted to standard word sequence word insertion two-dimensional matrix Network;
Second output device, being used for two-way length memory network output in short-term length is the corresponding sequence of the standard word sequence Column length and width are the standard word sequence two-dimensional matrix of designated length, and the standard word sequence two-dimensional matrix is passed into Conditional random fields;
Second acquisition device is used for the conditional random fields and calculates the conditional probability of the structural model and obtain the structure The penalty values of model update each layer weight of the structural model using the algorithm of backpropagation and optimize the penalty values.
9. the control device according to any one of claim 6 to 8, which is characterized in that the structurizer includes such as Lower device:
Device is segmented, is used to execute the medicine text data participle operation acquisition text word segmentation result, and to multiple The medicine entity mapping executes participle operation acquisition medicine entity and maps word segmentation result, and the participle operation passes through participle model It completes, the participle model is formed by way of model training on the basis of artificial mark;
Coalignment is used to match the text word segmentation result with medicine entity mapping word segmentation result, and base Multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings in the matching result;
Combination unit is used to be combined to obtain the structured text to multiple preferred medicine entity mappings.
10. control device according to any one of claims 6 to 9, which is characterized in that the control device further include as Lower device:
Reforming unit, is used to multiple medicine entities mappings being input to database and carries out conversion and obtain multiple standard information Segment;
Multiple standard information segments are combined and obtain standard Structured text by standard combination device.
CN201711205811.5A 2017-11-27 2017-11-27 A kind of control method and device of pair of medicine text data structureization processing Pending CN110019711A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711205811.5A CN110019711A (en) 2017-11-27 2017-11-27 A kind of control method and device of pair of medicine text data structureization processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711205811.5A CN110019711A (en) 2017-11-27 2017-11-27 A kind of control method and device of pair of medicine text data structureization processing

Publications (1)

Publication Number Publication Date
CN110019711A true CN110019711A (en) 2019-07-16

Family

ID=67186798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711205811.5A Pending CN110019711A (en) 2017-11-27 2017-11-27 A kind of control method and device of pair of medicine text data structureization processing

Country Status (1)

Country Link
CN (1) CN110019711A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491519A (en) * 2019-07-17 2019-11-22 上海明品医学数据科技有限公司 A kind of method of inspection of medical data
CN111326226A (en) * 2020-02-14 2020-06-23 腾讯科技(深圳)有限公司 Analysis processing and display method, device, equipment and storage medium of electronic medical record
CN111859968A (en) * 2020-06-15 2020-10-30 深圳航天科创实业有限公司 Text structuring method, text structuring device and terminal equipment
CN112507722A (en) * 2020-11-30 2021-03-16 北京百度网讯科技有限公司 Electronic medical record content quality control method and device
CN113689951A (en) * 2021-08-04 2021-11-23 翼健(上海)信息科技有限公司 Intelligent diagnosis guiding method, system and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120265521A1 (en) * 2005-05-05 2012-10-18 Scott Miller Methods and systems relating to information extraction
CN105260360A (en) * 2015-10-27 2016-01-20 小米科技有限责任公司 Named entity identification method and device
CN105955952A (en) * 2016-05-03 2016-09-21 成都数联铭品科技有限公司 Information extraction method based on bi-directional recurrent neural network
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106980609A (en) * 2017-03-21 2017-07-25 大连理工大学 A kind of name entity recognition method of the condition random field of word-based vector representation
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120265521A1 (en) * 2005-05-05 2012-10-18 Scott Miller Methods and systems relating to information extraction
CN105260360A (en) * 2015-10-27 2016-01-20 小米科技有限责任公司 Named entity identification method and device
CN105955952A (en) * 2016-05-03 2016-09-21 成都数联铭品科技有限公司 Information extraction method based on bi-directional recurrent neural network
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
CN106980609A (en) * 2017-03-21 2017-07-25 大连理工大学 A kind of name entity recognition method of the condition random field of word-based vector representation

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491519A (en) * 2019-07-17 2019-11-22 上海明品医学数据科技有限公司 A kind of method of inspection of medical data
CN110491519B (en) * 2019-07-17 2024-01-02 上海明品医学数据科技有限公司 Medical data checking method
CN111326226A (en) * 2020-02-14 2020-06-23 腾讯科技(深圳)有限公司 Analysis processing and display method, device, equipment and storage medium of electronic medical record
CN111326226B (en) * 2020-02-14 2023-03-03 腾讯科技(深圳)有限公司 Analysis processing and display method, device, equipment and storage medium of electronic medical record
CN111859968A (en) * 2020-06-15 2020-10-30 深圳航天科创实业有限公司 Text structuring method, text structuring device and terminal equipment
CN112507722A (en) * 2020-11-30 2021-03-16 北京百度网讯科技有限公司 Electronic medical record content quality control method and device
CN112507722B (en) * 2020-11-30 2023-08-01 北京百度网讯科技有限公司 Method and device for controlling content quality of electronic medical record
CN113689951A (en) * 2021-08-04 2021-11-23 翼健(上海)信息科技有限公司 Intelligent diagnosis guiding method, system and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110019711A (en) A kind of control method and device of pair of medicine text data structureization processing
CN110334219B (en) Knowledge graph representation learning method based on attention mechanism integrated with text semantic features
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
CN111613339B (en) Similar medical record searching method and system based on deep learning
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN112818676B (en) Medical entity relationship joint extraction method
CN106919793A (en) A kind of data standardization processing method and device of medical big data
CN106611055A (en) Chinese hedge scope detection method based on stacked neural network
CN112364174A (en) Patient medical record similarity evaluation method and system based on knowledge graph
CN110277167A (en) The Chronic Non-Communicable Diseases Risk Forecast System of knowledge based map
CN112163429B (en) Sentence correlation obtaining method, system and medium combining cyclic network and BERT
CN111966825A (en) Power grid equipment defect text classification method based on machine learning
CN114077673A (en) Knowledge graph construction method based on BTBC model
CN112687388A (en) Interpretable intelligent medical auxiliary diagnosis system based on text retrieval
CN111476024A (en) Text word segmentation method and device and model training method
CN115688752A (en) Knowledge extraction method based on multi-semantic features
CN111950283A (en) Chinese word segmentation and named entity recognition system for large-scale medical text mining
CN115238026A (en) Medical text subject segmentation method and device based on deep learning
CN111611780A (en) Digestive endoscopy report structuring method and system based on deep learning
Jin Application optimization of nlp system under deep learning technology in text semantics and text classification
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN116595994A (en) Contradictory information prediction method, device, equipment and medium based on prompt learning
CN116843995A (en) Method and device for constructing cytographic pre-training model
CN115759095A (en) Named entity recognition method and device for tobacco plant diseases and insect pests

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200323

Address after: 361001 Department of Pediatrics, the First Affiliated Hospital of Xiamen University, No.55 Zhenhai Road, Siming District, Xiamen City, Fujian Province

Applicant after: Wu Jinzhun

Applicant after: BASEBIT (SHANGHAI) INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 361003 Department of Pediatrics, First Affiliated Hospital of Xiamen University, No. 55 Zhenhai Road, Siming District, Xiamen City, Fujian Province

Applicant before: Wu Jinzhun

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190716