CN107729392A

CN107729392A - Text structure method, apparatus, system and non-volatile memory medium

Info

Publication number: CN107729392A
Application number: CN201710852183.3A
Authority: CN
Inventors: 梁会营; 郑永升; 戎术
Original assignee: According To Hangzhou Medical Technology Co Ltd; Guangzhou Women and Childrens Medical Center
Current assignee: According To Hangzhou Medical Technology Co Ltd; Guangzhou Women and Childrens Medical Center
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2018-02-23
Anticipated expiration: 2037-09-19
Also published as: CN111680090B; CN111680089A; CN107729392B; CN111680094B; CN111680089B; CN111680094A; CN111680090A

Abstract

The present invention relates to a kind of text structure method, apparatus, system and non-volatile memory medium, this method includes：Non-structured text is obtained, non-structured text is pre-processed, and pretreated non-structured text is resolved into multiple subordinate sentences；Obtain the Q ＆ A database of the structuring entry and counter structure entry in structured text；The problem of according in Q ＆ A database, puts question to, and the content of subordinate sentence is matched into corresponding structuring entry respectively, to obtain subordinate sentence structured result；According to subordinate sentence structured result, structured text is obtained.Text structure method, apparatus, system and non-volatile memory medium provided by the invention, with reference to Q ＆ A database, non-structured text message can be fully converted to structured message, changing effect is good, accuracy rate is high, and subordinate sentence structuring processing is carried out by two LSTM networks, expression way various in free text can be handled, there is good robustness.

Description

Text structure method, apparatus, system and non-volatile memory medium

Technical field

The present invention relates to natural language processing technique field, be specifically designed a kind of text structure method, apparatus, system and Non-volatile memory medium.

Background technology

Structuring refers to that the information that text is included is decomposed into multiple parts that are mutually related after analysis, respectively There is clear and definite hierarchical structure between part.And text structureization then refers to non-structured text being converted into structuring text This, with the expression way by structuring (such as project formula, form, structure chart, flow chart) make the expression of information it is more objective, Vividly.

Nowadays, many free texts can be produced in big data epoch, particularly medical technology, and growing medical treatment Text data brings brand-new challenge to whole medical industry：Doctor carries out diagnosis and treatment to patient, is had in diagnosis and treatment process substantial amounts of Medical text generation.Wherein, most medical text datas belongs to semi-structured or unstructured data.By by half hitch Structure or non-structured medical text data are converted into the structural data that computer can be analyzed and handled, and can be answered in scientific research Made breakthroughs with, clinic diagnosis, data sharing with propagating etc..

And traditional medical text structure processing method is substantially doctor according to medicinal experience to medical text Data carry out artificial treatment.However, the mode of this medical text structureization processing is not only lost time and energy, and structure The accuracy rate for changing processing is unable to reach expected requirement.

To realize the transition problem of non-structured text, Chinese invention patent document CN03124897 discloses one kind and is used for Make the method and apparatus of text structure, this method comprises the following steps, input structureization rule；Obtain non-structured text letter Breath；Syntactic analysis is carried out to non-structured text information, produces small text fragments；From the text list of non-structured text information The text fragments defined in structuring rule are found in member；Non-structured text is believed according to the condition determined in structuring rule The text fragments of breath carry out structuring.The device includes, the input unit for non-structured text information；Advised for structuring Input unit and storage device then；For extracting the extraction element of small text unit from non-structured text information；For The structurizer of structured text information is produced according to structuring rule；With for the text unit in structured text information Processing unit.

Although the method and apparatus for enabling text structure that the patent document provides realize non-structured text to knot The conversion of structure text, it is distinctly understood that its transformation efficiency is poor, conversion accuracy rate also allows of no optimist.

And for example Chinese invention patent application document CN201610405133 discloses a kind of electronic health record text structure side Method, this method comprise the following steps：S1, it is loaded into medical knowledge base；S2, read in electronic health record text；S3, utilize positive maximum Short sentence is segmented with algorithm, obtains word and its part of speech, relative position relation in sentence；S4, judge in short sentence to disease Sick information describes semantic positive and negative；S5, extraction disease information element；S6, repeat step S2 to S5, until obtaining in electronic health record Whole content of interest；S7, the different expression for merging disease information element, according to medical science synonym dictionary, by identical disease Sick information merges, and removes redundancy；S8, the element of disease description information stored in the form of structure/class, complete structure Change process.

Although the structural method that the patent document provides can effectively extract disease from the descriptive text of case history Relevant information, the structuring expression to disease information is formed, so as to the occurrence regularity to disease, makes a definite diagnosis mode, therapeutic effect etc. Carry out deep layer exploration.But similarly, the transformation efficiency of the structural method is poor, and accuracy rate is low.

In summary, the transformation efficiency and accuracy rate for how improving text structure urgently solve as those skilled in the art One of certainly the problem of.

The content of the invention

In order to solve the above problems, it is an object of the invention to provide the text knot that a kind of transformation efficiency is good, accuracy rate is high Structure method, apparatus, system and non-volatile memory medium.

To achieve the above object, one aspect of the present invention provides a kind of text structure method, wherein, this method includes：

Non-structured text is obtained, non-structured text is pre-processed, and by pretreated non-structured text Resolve into multiple subordinate sentences；

Obtain the Q ＆ A database of the structuring entry and counter structure entry in structured text；

The problem of according in Q ＆ A database, is putd question to, and the content of subordinate sentence is matched into corresponding structuring entry respectively, with Obtain subordinate sentence structured result；

According to subordinate sentence structured result, structured text is obtained.

Further, pretreatment includes：Numeral in non-structured text and additional character are substituted for unified symbol.

Preferably, the Q ＆ A database of the structuring entry in structured text and counter structure entry is obtained, including：

Classification processing is carried out to structuring entry, obtains classification results；

Question template is set respectively for classification results, and the question and answer number of counter structure entry is formed according to question template According to storehouse.

Further, according in Q ＆ A database the problem of, is putd question to, and the content of subordinate sentence is matched into corresponding structure respectively Change entry, to obtain subordinate sentence structured result, including:

Word segmentation processing is carried out to subordinate sentence, and obtained subordinate sentence word segmentation result is inputted to the first LSTM networks, the first LSTM Network carries out the first decoding process to subordinate sentence word segmentation result, to obtain subordinate sentence decoded result；

The problem of corresponding subordinate sentence is generated based on Q ＆ A database, and word segmentation processing is carried out to problem, by the problem of obtaining points Word result is inputted to the 2nd LSTM networks, and the 2nd LSTM networks carry out the second decoding process to problem word segmentation result, to be asked Inscribe decoded result；

First LSTM networks are combined and matched somebody with somebody according to subordinate sentence decoded result and problem decoded result with the 2nd LSTM networks To, modeling, so as to obtain subordinate sentence structured result.

Preferably, according to subordinate sentence structured result, structured text is obtained, including：

Merge multiple subordinate sentence structured results, obtain paragraph structure result；

Paragraph structure result is post-processed, obtains structured text.

Further, according to subordinate sentence structured result, after obtaining structured text, in addition to：

Structured text is changed into vector and is stored in result database, and by the vector and result data of structured text Other vectors stored in storehouse carry out similarity system design, to obtain the similitude text of structured text；

Calculate the similarity between structured text and similitude text.

Another aspect of the present invention additionally provides a kind of text structure makeup and put, including：

Pretreatment module, for obtaining non-structured text, non-structured text is pre-processed, and by after pretreatment Non-structured text resolve into multiple subordinate sentences；

Entry acquisition module, for obtaining the question and answer number of the structuring entry in structured text and counter structure entry According to storehouse；

Subordinate sentence structurized module, put question to the problem of for according in Q ＆ A database, the content of subordinate sentence is matched to respectively Corresponding structuring entry, to obtain subordinate sentence structured result；

Text forms module, for according to subordinate sentence structured result, obtaining structured text.

Further, pretreatment module, it is additionally operable to the numeral in non-structured text and additional character being substituted for unification Symbol.

Preferably, entry acquisition module includes：

Sort module, for carrying out classification processing to structuring entry, obtain classification results；

Module is formed, sets question template for classification results, and counter structure entry is formed according to question template Q ＆ A database.

Further, subordinate sentence structurized module includes:

Subordinate sentence processing module, for carrying out word segmentation processing to subordinate sentence, and obtained subordinate sentence word segmentation result is inputted to first LSTM networks, the first LSTM networks carry out the first decoding process to subordinate sentence word segmentation result, to obtain subordinate sentence decoded result；

Issue handling module, for generating the problem of corresponding to subordinate sentence based on Q ＆ A database, and problem is carried out at participle Reason, word segmentation result the problem of obtaining is inputted to the 2nd LSTM networks, the 2nd LSTM networks second is carried out to problem word segmentation result Decoding process, to obtain problem decoded result；

Combine matching module, for by the first LSTM networks and the 2nd LSTM networks foundation subordinate sentence decoded result and solution Code result is combined pairing, modeling, obtains subordinate sentence structured result.

Preferably, structured text obtains module and included：

Merging module, for merging multiple subordinate sentence structured results, obtain paragraph structure result；

Post-processing module, for being post-processed to paragraph structure result, obtain structured text.

Further, text structure makeup is put and also included：

Similitude judge module, result database is stored in for structured text to be changed into vector, and by structuring The vector of text carries out similarity system design with other vectors stored in result database, to obtain the similitude of structured text Text；

Similarity calculation module, for calculating the similarity between structured text and similitude text.

Further aspect of the present invention additionally provides a kind of text structure system, including the makeup of foregoing text structure is put.

Another aspect of the invention additionally provides a kind of non-volatile memory medium, is stored with text structure on a storage medium Change program, text structure program is computer-executed to implement text structure method, including：

A is instructed, non-structured text is obtained, non-structured text is pre-processed, and will be pretreated non-structural Change text and resolve into multiple subordinate sentences；

B is instructed, obtains the Q ＆ A database of the structuring entry and counter structure entry in structured text；

Instruct c, put question to the problem of according in Q ＆ A database, by the content of subordinate sentence be matched to respectively corresponding to structuring bar Mesh, to obtain subordinate sentence structured result；

D is instructed, according to subordinate sentence structured result, obtains structured text.

As above, text structure method, apparatus, system and non-volatile memory medium provided by the present invention, with reference to asking Database is answered, non-structured text message can be fully converted to structured message, changing effect is good, and accuracy rate is high, and leads to Cross two LSTM networks and carry out subordinate sentence structuring processing, expression way various in free text can be handled, there is good be good for Strong property.

For the above of the present invention can be become apparent, preferred embodiment cited below particularly, and with reference to accompanying drawing, make detailed It is described as follows.

Brief description of the drawings

The embodiment of the present invention is described in further detail below in conjunction with accompanying drawing.

Fig. 1 is the method flow diagram for the text structure method that first preferred embodiment of the invention provides；

Fig. 2 is the case history text schematic diagram that first preferred embodiment of the invention provides；

Fig. 3 is the word segmentation result schematic diagram for the case history text that first preferred embodiment of the invention provides；

Fig. 4 is the structured result schematic diagram for the case history text that first preferred embodiment of the invention provides；

Fig. 5 is the method flow diagram for the text structure method that second preferred embodiment of the invention provides；

Fig. 6 is the module connection diagram that the text structure makeup that third preferred embodiment of the invention provides is put；

Fig. 7 is the module connection diagram that the text structure makeup that four preferred embodiment of the invention provides is put.

Embodiment

Embodiments of the present invention are illustrated by particular specific embodiment below, those skilled in the art can be by this specification Disclosed content understands other advantages and effect of the present invention easily.Although description of the invention will combine preferred embodiment Introduce together, but this feature for not representing this invention is only limitted to the embodiment.On the contrary, invented with reference to embodiment The purpose of introduction is to be possible to the other selections or transformation extended to cover the claim based on the present invention.In order to carry For understanding the depth of the present invention, many concrete details will be included in describing below.The present invention can also be thin without using these Section is implemented.In addition, in order to avoid the emphasis of the chaotic or fuzzy present invention, some details will be omitted in the de-scription.

Fig. 1 shows a kind of method flow diagram of text structure method according to first preferred embodiment of the invention, this article This structural method includes step S10, step S20, step S30 and step S40.Specifically, in step slo, text structure 1 acquisition non-structured text is put in makeup, and non-structured text is pre-processed, and pretreated non-structured text is divided Solution is into multiple subordinate sentences；In step S20, structuring entry and corresponding knot in 1 acquisition structured text are put in text structure makeup The Q ＆ A database of structure entry；In step s 30, the problem of text structure makeup puts 1 according in Q ＆ A database is putd question to, will The content of subordinate sentence is matched to corresponding structuring entry respectively, to obtain subordinate sentence structured result；In step s 40, text knot Structure makeup puts 1 according to subordinate sentence structured result, obtains structured text.

Here, text structure makeup, which puts 1, includes but is not limited to user equipment, the network equipment, or user equipment is set with network It is standby that formed equipment is integrated by network.User equipment includes but is not limited to the clients such as computer, smart mobile phone, PDA and set It is standby.The network equipment includes but is not limited to computer, network host, single network server, multiple webserver collection or multiple The cloud that server is formed, here, cloud is by a large amount of computer or the webserver structures based on cloud computing (Cloud Computing) Into, wherein, cloud computing is one kind of Distributed Calculation, the virtual super meter of one be made up of the computer collection of a group loose couplings Calculation machine.Network includes but is not limited to internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN, wireless self-organization network (Ad Hoc network) etc..

Certainly, those skilled in the art will be understood that above-mentioned text structure makeup is put 1 and is only for example, and other are existing or modern The text structure makeup being likely to occur afterwards puts 1 and is such as applicable to the application, should also be included within the application protection domain, and This is incorporated herein by reference.

Specifically, in step slo, 1 acquisition non-structured text is put in text structure makeup, and non-structured text is carried out Pretreatment, and pretreated non-structured text is resolved into multiple subordinate sentences.

Retouching comprising information such as disease symptomses, medical history, state of an illness summaries in non-structured text, e.g. medical record The property stated text, or the football news text or basketball newsletter archive of the information such as sportsman's goals, secondary attack number are included, certainly It can also be other kinds of free text.

Here, the non-structured text of 1 acquisition user input is put in text structure makeup, the original disease of doctor's input is such as obtained Text is gone through, and pretreatment operation is carried out to it, nonstandard non-structured text is converted into the non-structured text of specification, So as to facilitate follow-up conversion operation.Preferably, pretreatment includes：(1) all characters in non-structured text are converted into Double byte character, so as to simplify operation by carrying out the processing of similar ocra font ocr, lift conversion performance；(2) by unstructured text Numeral and additional character in this are substituted for unified symbol, to simplify subsequent transformation processing, here, unified symbol is non-structural for this Change text in be not in a kind of symbol, such as { Number }, { special }, during non-structured text is occurred 10, 1073rd, the additional character such as 1.763, -0.74 etc. numeral and # ,@is substituted for { Number } and { special }；(3) will be unstructured Non-visible character in text removes, and to simplify transition problem, further lifts conversion performance.Wherein, after having pre-processed, text This structurizer 1 is then made pauses in reading unpunctuated ancient writings according to fullstop in non-structured text to pretreated non-structured text, will be non- Structured text resolves into a series of subordinate sentences.For example, a certain pretreated non-structured text is " inferior lobe of right lung is visible to be dispersed in Patch shape increase in density shadow.Remaining pulmonary parenchyma has no definite Density Anomalies shadow.Two hilus pulumonis are without increase, and tracheal bronchus is unobstructed, mediastinum Have no enlarged lymph node.Both sides thoracic cavity has no hydrops, and pleura, which has no, to be thickened.", then accorded with according to the punctuate in the non-structured text After number it being made pauses in reading unpunctuated ancient writings, a series of obtained subordinate sentences are " inferior lobe of right lung is visible to be dispersed in patch shape increase in density shadow ", " remaining lung is real Matter has no definite Density Anomalies shadow ", " two hilus pulumonis are without increase, tracheal bronchus is unobstructed, mediastinum has no enlarged lymph node ", " both sides Thoracic cavity has no that hydrops, pleura have no and thickened ".

In step S20, structuring entry and counter structure bar in 1 acquisition structured text are put in text structure makeup Purpose Q ＆ A database.

Here, 1 structured text generated as needed is put in text structure makeup, the structuring in structured text is obtained Entry and Q ＆ A database corresponding with structuring entry.Think that is, text structure makeup puts 1 firstly the need of user is obtained The structured text form wanted, and therefrom each entry contents in drawing-out structure text formatting, then according to each entry Setting problem is to formulate Q ＆ A database；Or the structured text form that text structure makeup is put acquired in 1 has corresponded in itself There are the question and answer data of correlation, then text structure makeup puts 1 and directly obtains each entry contents in the structured text, and directly Follow-up structuring is carried out using its corresponding Q ＆ A database to handle.

And preferably, in the preferred embodiment, after the structuring entry in obtaining structured text, first to structure Change entry and carry out classification processing, structuring entry is divided into different type, such as numerical value, position place, obtains classification results, Then set question template respectively for classification results, i.e., to each type of structuring entry formulation question template, and according to Question template forms Q ＆ A database corresponding with structuring entry.Here, Q ＆ A database can enter according to actual a large amount of texts Row training, not stop to add question template, so as to which all information ensured in non-structured text can be entirely covered.

In step s 30, the problem of text structure makeup puts 1 according in Q ＆ A database is putd question to, by the content of subordinate sentence point Structuring entry corresponding to not being matched to, to obtain subordinate sentence structured result.I.e. for each subordinate sentence, Q ＆ A database is to it Carry out problem enquirement, and in the structuring entry according to corresponding to puing question to result to be matched to its content respectively, it is multiple so as to obtain Subordinate sentence structured result.

Specifically, step S30 includes step S31, step S32 and step S33.Wherein, in step S31, text structure Makeup puts 1 pair of subordinate sentence and carries out word segmentation processing, and obtained subordinate sentence word segmentation result is inputted to the first LSTM networks, the first LSTM nets Network carries out the first decoding process to subordinate sentence word segmentation result, to obtain subordinate sentence decoded result；In step s 32, text structure is disguised The problem of putting 1 subordinate sentence corresponding based on Q ＆ A database generation, and word segmentation processing is carried out to problem, by word segmentation result the problem of obtaining Input to the 2nd LSTM networks, the 2nd LSTM networks carries out the second decoding process to problem word segmentation result, to obtain problem decoding As a result；In step S33, text structure makeup puts 1 by the first LSTM networks and the 2nd LSTM networks according to subordinate sentence decoded result Pairing, modeling are combined with problem decoded result, so as to obtain subordinate sentence structured result.

Here, using segmentation methods, such as Forward Maximum Method segmentation methods carry out word segmentation processing to non-structured text, For example 1 acquisition unstructured case history text as shown in Figure 2 is put in text structure makeup, using segmentation methods to each subordinate sentence Word segmentation processing is carried out, can obtain subordinate sentence word segmentation result as shown in Figure 3.Similarly, to formed based on Q ＆ A database with The corresponding asked questions of each subordinate sentence, word segmentation processing also is carried out to it using segmentation methods, to obtain problem word segmentation result.With Afterwards, subordinate sentence word segmentation result is inputted to the first LSTM (shot and long term memory artificial neural network) network, and by the participle knot of problem Fruit is inputted to the 2nd LSTM networks, wherein, the first LSTM networks carry out decoding process, correspondingly, second to subordinate sentence word segmentation result LSTM networks then carry out decoding process to problem word segmentation result, and text structure makeup is put 1 based on subordinate sentence decoded result and asked Two LSTM networks of first LSTM networks and the 2nd LSTM networks are combined pairing, modeling by topic decoded result, so as to obtain The structured result of each subordinate sentence.

Wherein, for each subordinate sentence, it is necessary to be generated according to the structured result of the current subordinate sentence and Q ＆ A database new The problem of, and to it is newly-generated the problem of carry out further word segmentation processing, retrieve new subordinate sentence structured result, again The foundation that the subordinate sentence structured result arrived is then formed as problem next time again, subordinate sentence structuring processing once is carried out, successively Analogize, until problems associated enquirement finishes in Q ＆ A database.That is, needed for each subordinate sentence according to it Current structure result, decide whether to continue the problem of questioning closely, if the initial configuration result of a certain subordinate sentence is " XX be present Focus", then to continue to question closely：XX focuses positioned at whereHow is the form of XX focusesSpecifically, such as a certain subordinate sentence is first Beginning structured result is " the whether visible shadow of inferior lobe of right lung, stove", then to continue to question closely " the visible shadow of inferior lobe of right lung, stove form” Deng until problems associated enquirement finishes in Q ＆ A database, to obtain structured result as shown in Figure 5.

In step s 40, text structure makeup puts 1 according to subordinate sentence structured result, obtains structured text.Preferably, Text structure makeup puts 1 and first merges multiple subordinate sentence structured results, paragraph structure result is obtained, then to paragraph Structured result is post-processed, and obtains structured text.Refer to there be asking for answer by all in each subordinate sentence here, merging Topic is merged to obtain final result, and so as to obtain paragraph structure result, and post-processing includes：(1) by structured text In associated description standardization, such as by tonsillotome size description in I °, once, I degree specification into 1 degree, by bubble classification In " bubbling rales " specification be " coarse moist rale ", " medium bubbling rales " specification is " medium rales ", and " fine bubbling rale " specification is " thin Bubble "；(2) numeral that unified symbol is replaced by pretreatment and additional character are subjected to reduction treatment, such as by foregoing reality Apply and 10,1073,1.763 ,-the 0.74 of unified symbol is replaced by example is reduced into 10,1073,1.763 ,-the 0.74 of script, from And holding structure content of text is consistent with original non-structured text content.

As the variation of above-described embodiment, as shown in Figure 5, second preferred embodiment of the invention provides a kind of text Structural method, the method comprising the steps of S10 ', step S20 ', step S30 ', step S40 ', step S50 and step S60.

Specifically, in step S10 ', 1 acquisition non-structured text is put in text structure makeup, and non-structured text is entered Row pretreatment, and pretreated non-structured text is resolved into multiple subordinate sentences；In step S20 ', text structure makeup is put 1 obtains the Q ＆ A database of the structuring entry and counter structure entry in structured text；In step S30 ', text knot The problem of structure makeup puts 1 according in Q ＆ A database is putd question to, and the content of subordinate sentence is matched into corresponding structuring entry respectively, with Obtain subordinate sentence structured result；In step S40 ', text structure makeup puts 1 according to subordinate sentence structured result, obtains structuring Text；In step s 50, text structure makeup, which puts 1 structured text is changed into vector, is stored in result database, and will knot The vector of structure text carries out similarity system design with other vectors stored in result database, to obtain the phase of structured text Like property text；In step S60, the similarity between 1 calculating structured text and similitude text is put in text structure makeup.Its In, step S10, step S20, step S30, step described in step S10 ', step S20 ', step S30 ', step S40 ' and Fig. 1 S40 is identical or essentially identical, therefore here is omitted, and is incorporated herein by reference.

In step s 50, text structure makeup, which puts 1 structured text is changed into vector, is stored in result database, and The vector of structured text is subjected to similarity system design with other vectors stored in result database, to obtain structured text Similitude text.

Here, compare the Euclidean distance between other text vectors stored in structured text vector and database, And similarity system design is carried out according to the distance of Euclidean distance, to find out the Similar Text of the structured text from database. Such as, in medical record, can with it, retrieve case history text similar to the medical record in medical record storehouse, So as to facilitate Clinics and Practices of the doctor for the disease.

Further, in step S60, text structure makeup is put between 1 calculating structured text and similitude text Similarity.That is, put 1 for the similitude text retrieved from database, text structure makeup and calculate itself and the knot respectively Similarity between structure text, and the similarity is exported to user, to facilitate user to carry out the comparison between text, sentence It is disconnected.

Wherein, in medical record text, similarity system design and Similarity Measure can efficiently generate similar case history, and push away Recommend the similarity of similar case history, very big booster action can be played to the work of doctor, with preferably carry out the diagnosis of disease with Treatment.

Fig. 6 shows the schematic device put according to a kind of makeup of text structure of third preferred embodiment of the invention, this article This structurizer 1 includes pretreatment module 100, entry acquisition module 200, subordinate sentence structurized module 300 and text and forms mould Block 400.Specifically, pretreatment module 100, for obtaining non-structured text, non-structured text is pre-processed, and will Pretreated non-structured text resolves into multiple subordinate sentences；Entry acquisition module 200, for obtaining the knot in structured text The Q ＆ A database of structure entry and counter structure entry；Subordinate sentence structurized module 300, for according in Q ＆ A database Problem is putd question to, and the content of subordinate sentence is matched into corresponding structuring entry respectively, to obtain subordinate sentence structured result；Text is formed Module 400, for according to subordinate sentence structured result, obtaining structured text.

Specifically, pretreatment module 100, for obtaining non-structured text, non-structured text is pre-processed, and Pretreated non-structured text is resolved into multiple subordinate sentences.

Here, pretreatment module 100 obtains the non-structured text of user's input, the original case history of doctor's input is such as obtained Text, and pretreatment operation is carried out to it, nonstandard non-structured text is converted into the non-structured text of specification, from And facilitate follow-up conversion operation.Preferably, pretreatment includes：(1) all characters conversion in non-structured text is helped Angle character, so as to simplify operation by carrying out the processing of similar ocra font ocr, lift conversion performance；(2) by non-structured text In numeral and additional character be substituted for unified symbol, to simplify subsequent transformation processing, here, unified symbol is unstructured for this A kind of symbol for being not in text, such as { Number }, { special }, during non-structured text is occurred 10, 1073rd, the additional character such as 1.763, -0.74 etc. numeral and # ,@is substituted for { Number } and { special }；(3) will be unstructured Non-visible character in text removes, and to simplify transition problem, further lifts conversion performance.Wherein, after having pre-processed, in advance Processing module 100 is made pauses in reading unpunctuated ancient writings pretreated non-structured text according to fullstop in non-structured text, will be non-structural Change text and resolve into a series of subordinate sentences.

For example, a certain pretreated non-structured text is " inferior lobe of right lung is visible to be dispersed in patch shape increase in density shadow.It is remaining Pulmonary parenchyma has no definite Density Anomalies shadow.Two hilus pulumonis are without increase, and tracheal bronchus is unobstructed, and mediastinum has no enlarged lymph node.Both sides Thoracic cavity has no hydrops, and pleura, which has no, to be thickened.", then after making pauses in reading unpunctuated ancient writings according to the punctuation mark in the non-structural text to it, obtain A series of subordinate sentences for " inferior lobe of right lung is visible to be dispersed in patch shape increase in density shadow ", " remaining pulmonary parenchyma has no definite Density Anomalies Shadow ", " two hilus pulumonis are without increase, tracheal bronchus is unobstructed, mediastinum has no enlarged lymph node ", " both sides thoracic cavity has no hydrops, pleura not See and thicken ".

Entry acquisition module 200, for obtaining asking for the structuring entry in structured text and counter structure entry Answer database.

Here, the structured text that entry acquisition module 200 generates as needed, obtains the structuring in structured text Entry and Q ＆ A database corresponding with structuring entry, that is to say, that entry acquisition module 200 is thought firstly the need of user is obtained The structured text form wanted, and therefrom each entry contents in drawing-out structure text formatting, then according to each entry Setting problem is to formulate Q ＆ A database；Or the structured text form acquired in entry acquisition module 200 has corresponded in itself There are the question and answer data of correlation, then entry acquisition module 200 directly obtains each entry contents in the structured text, and directly Follow-up structuring is carried out using its corresponding Q ＆ A database to handle.

And preferably, in this preferred embodiment, entry acquisition module 200 includes taxon 201 and forms unit 202, Taxon 201, after the structuring entry in structured text is obtained, structuring entry is carried out at classification first Reason, is divided into different type, such as numerical value, position place by structuring entry, obtains classification results, then, forms unit 202 Question template is set respectively for classification results, i.e., question template is formulated to each type of structuring entry, and according to problem Template forms Q ＆ A database corresponding with structuring entry.Here, Q ＆ A database can be instructed according to actual a large amount of texts Practice, not stop to add question template, so as to which all information ensured in non-structured text can be entirely covered.

Subordinate sentence structurized module 300, put question to the problem of for according in Q ＆ A database, the content of subordinate sentence is matched respectively To corresponding structuring entry, to obtain subordinate sentence structured result.Problem is carried out to it to each subordinate sentence, Q ＆ A database Put question to, and in the structuring entry according to corresponding to puing question to result to be matched to its content respectively, so as to obtain multiple subordinate sentence structures Change result.

Specifically, subordinate sentence structurized module 300 includes subordinate sentence processing unit 301, issue handling unit 302 and combination pairing Unit 303.Wherein, subordinate sentence processing unit 301, for carrying out word segmentation processing to subordinate sentence, and obtained subordinate sentence word segmentation result is defeated Enter to the first LSTM networks, the first LSTM networks and the first decoding process is carried out to subordinate sentence word segmentation result, to obtain subordinate sentence decoding knot Fruit；Issue handling unit 302, for generating the problem of corresponding to subordinate sentence based on Q ＆ A database, and word segmentation processing is carried out to problem, The problem of obtaining, word segmentation result was inputted to the 2nd LSTM networks, and the 2nd LSTM networks carry out second to problem word segmentation result and decoded Processing, to obtain problem decoded result；Pairing unit 303 is combined, for by the first LSTM networks and the 2nd LSTM network foundations Subordinate sentence decoded result and problem decoded result are combined pairing, modeling, so as to obtain subordinate sentence structured result.

Here, using segmentation methods, such as Forward Maximum Method segmentation methods carry out word segmentation processing to non-structured text, For example subordinate sentence processing unit 301 obtains unstructured case history text as shown in Figure 2, using segmentation methods to each subordinate sentence Word segmentation processing is carried out, can obtain subordinate sentence word segmentation result as shown in Figure 3.Similarly, to formed based on Q ＆ A database with The corresponding asked questions of each subordinate sentence, word segmentation processing also is carried out to it using segmentation methods, to obtain problem word segmentation result, with Afterwards, subordinate sentence processing unit 301 inputs subordinate sentence word segmentation result to the first LSTM (shot and long term memory artificial neural network) network, asks Topic processing unit 302 inputs the word segmentation result of problem to the 2nd LSTM networks, wherein, combine first in pairing unit 303 LSTM networks carry out decoding process to subordinate sentence word segmentation result, and the 2nd LSTM networks are then carried out at decoding to problem word segmentation result Reason, and combine pairing unit 303 and be based on subordinate sentence decoded result and problem decoded result by the first LSTM networks and the 2nd LSTM nets Two LSTM networks of network are combined pairing, modeling, so as to obtain the structured result of each subordinate sentence.

Text forms module 400, for according to subordinate sentence structured result, obtaining structured text.Preferably, text is formed Module 400 includes combining unit 401 and post-processing unit 402, wherein, combining unit 401 first is by multiple subordinate sentence structuring knots Fruit merges, and obtains paragraph structure result, and then, post-processing unit 402 post-processes to paragraph structure result, obtains To structured text.Refer to there is the problem of answer to merge finally to be answered by all in each subordinate sentence here, merging Case, so as to obtain paragraph structure result, and post-processing includes：(1) associated description in structured text is standardized, such as will I ° in the description of tonsillotome size, once, the specification such as I degree into 1 degree, by " bubbling rales " specification in bubble classification for " slightly Bubble ", " medium bubbling rales " specification are " medium rales ", and " fine bubbling rale " specification is " fine moist rale "；(2) quilt in pre-processing The numeral and additional character for replacing with unified symbol carry out reduction treatment, as will be replaced by unified symbol in previous embodiment 10th, 1073,1.763, -0.74 10,1073,1.763,-the 0.74 of script is reduced into, so as to holding structure content of text and original The non-structured text content of beginning is consistent.

As the variation of above-described embodiment, as shown in Figure 7, four preferred embodiment of the invention provides a kind of text Structurizer, the device also include similitude judge module 500 and similarity calculation module 600.

Specifically, similitude judge module 500, result database is stored in for structured text to be changed into vector, And the vector of structured text is subjected to similarity system design with other vectors stored in result database, to obtain structuring text This similitude text.

Here, compare the Euclidean distance between other text vectors stored in structured text vector and database, And similarity system design is carried out according to the distance of Euclidean distance, to find out the Similar Text of the structured text from database. Such as, in medical record, case history text similar to the medical record in medical record storehouse can be retrieved by the device, So as to facilitate Clinics and Practices of the doctor for the disease.

Further, similarity calculation module 600, it is similar between structured text and similitude text for calculating Degree.That is, for the similitude text retrieved from database, text structure makeup puts 1 and calculates itself and structuring text respectively Similarity between this, and the similarity is exported to user, to facilitate user to carry out the comparison between text, judge.

Wherein, in medical diagnosis, similarity system design and Similarity Measure can efficiently generate similar case history, and recommend phase Like the similarity of case history, very big booster action can be played to the work of doctor, preferably to carry out the Clinics and Practices of disease.

As the variation of above-mentioned embodiment, present invention also offers a kind of text structure system, including foregoing reality The makeup of the text structure in mode is applied to put.

It is non-easily at this present invention also offers a kind of non-volatile memory medium as the variation of above-mentioned embodiment Text structure program is stored with the property lost storage medium, text structure program is computer-executed to implement text structure Method, including：

As above, text structure method, apparatus, system and non-volatile memory medium disclosed by the invention, with reference to question and answer Database, non-structured text message can be fully converted to structured message, changing effect is good, and accuracy rate is high, and passes through Two LSTM networks carry out subordinate sentence structuring processing, can handle expression way various in free text, have good stalwartness Property.

It is in summary, provided by the invention that the above-described embodiments merely illustrate the principles and effects of the present invention, rather than For limiting the present invention.Any person skilled in the art all can be under the spirit and scope without prejudice to the present invention, to above-mentioned reality Apply example and carry out modifications and changes.Therefore, such as those of ordinary skill in the art without departing from disclosed Spirit and all equivalent modifications for being completed under technological thought or change, should be covered by the claim of the present invention.

Claims

A kind of 1. text structure method, it is characterised in that this method includes：

Non-structured text is obtained, the non-structured text is pre-processed, and will be pretreated described unstructured Text resolves into multiple subordinate sentences；

Obtain the Q ＆ A database of the structuring entry and the corresponding structuring entry in structured text；

The problem of according in the Q ＆ A database, is putd question to, and the content of the subordinate sentence is matched into the corresponding structuring respectively Entry, to obtain subordinate sentence structured result；

According to the subordinate sentence structured result, the structured text is obtained.
2. text structure method according to claim 1, it is characterised in that the pretreatment includes：By the non-knot Numeral and additional character in structure text are substituted for unified symbol.
3. text structure method according to claim 1, it is characterised in that obtain the structure in the structured text Change the Q ＆ A database of the entry with the corresponding structuring entry, including：

Classification processing is carried out to the structuring entry, obtains classification results；

Question template is set respectively for the classification results, and the corresponding structuring entry is formed according to described problem template The Q ＆ A database.
4. text structure method according to claim 1, it is characterised in that the problem of according in the Q ＆ A database Put question to, the content of the subordinate sentence is matched to the corresponding structuring entry respectively, to obtain the subordinate sentence structured result, Including:

Word segmentation processing is carried out to the subordinate sentence, and obtained subordinate sentence word segmentation result is inputted to the first LSTM networks, described first LSTM networks carry out the first decoding process to the subordinate sentence word segmentation result, to obtain subordinate sentence decoded result；

The problem of corresponding subordinate sentence is generated based on the Q ＆ A database, and word segmentation processing is carried out to described problem, it will obtain The problem of word segmentation result input to the 2nd LSTM networks, the 2nd LSTM networks carry out the second solution to described problem word segmentation result Code processing, to obtain problem decoded result；

The first LSTM networks and the 2nd LSTM networks are tied according to the subordinate sentence decoded result and described problem decoding Fruit is combined pairing, modeling, so as to obtain the subordinate sentence structured result.
5. text structure method according to claim 1, it is characterised in that according to the subordinate sentence structured result, obtain To the structured text, including：

Merge multiple subordinate sentence structured results, obtain paragraph structure result；

The paragraph structure result is post-processed, obtains the structured text.
6. text structure method according to claim 1, it is characterised in that according to the subordinate sentence structured result, obtain To after the structured text, in addition to：

The structured text is changed into vector and is stored in result database, and by the vector of the structured text with Other vectors stored in the result database carry out similarity system design, to obtain the similitude of structured text text This；

Calculate the similarity between the structured text and the similitude text.
7. a kind of makeup of text structure is put, it is characterised in that the text structure makeup put including：

Pretreatment module, for obtaining non-structured text, the non-structured text is pre-processed, and by after pretreatment The non-structured text resolve into multiple subordinate sentences；

Entry acquisition module, for obtaining question and answer number of the structuring entry in structured text with the corresponding structuring entry According to storehouse；

Subordinate sentence structurized module, put question to the problem of for according in the Q ＆ A database, by the content of the subordinate sentence respectively The structuring entry corresponding to being assigned to, to obtain subordinate sentence structured result；

Text forms module, for according to the subordinate sentence structured result, obtaining the structured text.
8. text structure makeup according to claim 7 is put, it is characterised in that the pretreatment module is additionally operable to will be described Numeral and additional character in non-structured text are substituted for unified symbol.
9. text structure makeup according to claim 7 is put, it is characterised in that the entry acquisition module includes：

Taxon, for carrying out classification processing to the structuring entry, obtain classification results；

Unit is formed, sets question template for the classification results, and the corresponding structure is formed according to described problem template Change the Q ＆ A database of entry.
10. text structure makeup according to claim 7 is put, it is characterised in that the subordinate sentence structurized module includes:

Subordinate sentence processing unit, for carrying out word segmentation processing to the subordinate sentence, and obtained subordinate sentence word segmentation result is inputted to first LSTM networks, the first LSTM networks carry out the first decoding process to the subordinate sentence word segmentation result, to obtain subordinate sentence decoding knot Fruit；

Issue handling unit, for generating the problem of corresponding to the subordinate sentence based on the Q ＆ A database, and described problem is entered Row word segmentation processing, word segmentation result the problem of obtaining is inputted to the 2nd LSTM networks, the 2nd LSTM networks are to described problem Word segmentation result carries out the second decoding process, to obtain problem decoded result；

Combine pairing unit, for will the first LSTM networks and the 2nd LSTM networks according to the subordinate sentence decoded result Pairing, modeling are combined with described problem decoded result, obtains the subordinate sentence structured result.
11. text structure makeup according to claim 7 is put, it is characterised in that the text, which forms module, to be included：

Combining unit, for merging multiple subordinate sentence structured results, obtain paragraph structure result；

Post-processing unit, for being post-processed to the paragraph structure result, obtain the structured text.
12. text structure makeup according to claim 7 is put, it is characterised in that the text structure makeup, which is put, also to be included：

Similitude judge module, result database is stored in for the structured text to be changed into vector, and by the knot Described vectorial other vector progress similarity system designs with being stored in the result database of structure text, to obtain the knot The similitude text of structure text；

Similarity calculation module, for calculating the similarity between the structured text and the similitude text.
13. a kind of text structure system, it is characterised in that including the text structure as described in any in claim 7-12 Device.
A kind of 14. non-volatile memory medium, it is characterised in that text structure program is stored with said storage, The text structure program is computer-executed to implement text structure method, including：

A is instructed, non-structured text is obtained, the non-structured text is pre-processed, and will be pretreated described non- Structured text resolves into multiple subordinate sentences；

B is instructed, obtains the Q ＆ A database of the structuring entry and the corresponding structuring entry in structured text；

Instruct c, put question to the problem of according in the Q ＆ A database, by the content of the subordinate sentence be matched to respectively corresponding to it is described Structuring entry, to obtain subordinate sentence structured result；

D is instructed, according to the subordinate sentence structured result, obtains the structured text.