CN109299179A

CN109299179A - Structural data extraction element, method and storage medium

Info

Publication number: CN109299179A
Application number: CN201811196902.1A
Authority: CN
Inventors: 许娟; 刘超; 刘宁
Original assignee: Siemens Healthineers Ltd
Current assignee: Siemens digital medical technology (Shanghai) Co., Ltd
Priority date: 2018-10-15
Filing date: 2018-10-15
Publication date: 2019-02-01

Abstract

The present invention provides a kind of information extracting device, method and storage mediums.Obtain the markup information of a plurality of training texts and the preset position in wherein each training text that is used to indicate target information；Obtain the corresponding first Text Representation information of each training text in a plurality of training texts；A positioning strategy is determined using the first Text Representation information and the markup information.Target text and the corresponding second Text Representation information of the target text are obtained, estimates information using what the positioning strategy determined position of the target information in the second Text Representation information；The target information is extracted from the target text according to the information of estimating.Technical solution obtains positioning strategy using position rule of the target information in the Text Representation information of training text, to have analyze the semanteme of text to position the ability of wherein target information, improves the efficiency and accuracy of information extraction.

Description

Structural data extraction element, method and storage medium

Technical field

The present invention relates to data processing field, especially a kind of structural data extraction element, method and storage medium.

Background technique

Unstructured data, which refers to, can not use the data of unified structure representation, such as document, picture, audio-video, etc..Mesh Before, widely used data are the form of unstructured data, such as various documents, medical report, work report, etc..Knot Every data in structure data all includes data type and corresponding data value.Compared with structural data, unstructured number According to being inconvenient to use computer technology to carry out the operation such as information extraction, retrieval, statistics.

Chinese patent application CN1497473A, which is disclosed, rule-based is converted into structuring shape for non-structured text information The method and apparatus of formula, the method comprising the steps of: input structureization rule；Obtain non-structured text information；To unstructured Text information carries out syntactic analysis, generates small text fragments；Structure is found from the text unit of non-structured text information Change text fragments defined in rule；According to the condition determined in structuring rule to the text fragments of non-structured text information Carry out structuring.The device includes: the input unit for non-structured text information；Input unit for structuring rule And storage device；For extracting the extraction element of small text unit from non-structured text information；For being advised according to structuring Then generate the structurizer of structured text information；With the processing unit for the text unit in structured text information.

Chinese patent application CN101154159A disclose it is a kind of effectively and can simply operate, produce for imaging of medical Raw and runs software application program system (1), the system include at least one frame structure (2,22,34,45,58,69, 81), which has a service layer (3) and a work being arranged on the service layer as Application Programming Interface Have case layer (4), wherein the function of the tool box layer (4) and service layer (3) is summarised in respectively in multiple components, these groups Part is strictly classified setting in this way, so that always can only access from supervisory component to arbitrary component.

Chinese patent application CN107644671A discloses the side for supporting report doctor when evaluating image data set Method, image recording system, computer program and electronically readable data medium.One kind using image recording system (10) for remembering The method that report doctor is supported in the evaluation of the image data set of the patient of record, wherein image data set is by least one pretreatment Algorithm is automatically handled for being shown to report doctor, wherein at least one Preprocessing Algorithm and/or is pre-processed at least one At least one pretreatment parameter of algorithm parameter is automatically selected by the selection algorithm of artificial intelligence according to following: description figure As the record and/or record area of data set record information (1) at least one of, and/or previously having checked about patient At least one of additional information (3a, 3b, 3c, 3d).

Summary of the invention

In view of this, the invention proposes a kind of methods for improving efficiency power amplifier, to improve power amplifier Efficiency, further ensure that the linearity of power amplifier.

Each embodiment provides a kind of information extracting device, may include:

Study module, for obtaining a plurality of training texts and preset being used to indicate target information each instruction wherein Practice the markup information of the position in text；It is special to obtain corresponding first text of each training text in a plurality of training texts Sign indicates information；A positioning strategy is determined using the first Text Representation information and the markup information；And

Extraction module, for obtaining target text and the corresponding second Text Representation information of the target text, benefit Information is estimated with what the positioning strategy determined position of the target information in the second Text Representation information；Root The target information is extracted from the target text according to the information of estimating.

As it can be seen that the information extracting device of each embodiment is using target information in the Text Representation information of training text Position rule obtain positioning strategy, the semanteme of text analyze to the energy of positioning wherein target information to have Power, this improves the efficiency of information extraction and accuracy.

In some embodiments, study module may include:

Element unit, for determining the corresponding preset a plurality of information elements of the target information, from the mark The element markup information of each information element in a plurality of information elements is obtained in note information, the element markup information is used In position of the corresponding information element of instruction in each training text；Utilize the first Text Representation information and described Element markup information determines the element positioning strategy of each information element in a plurality of information elements as the positioning plan Slightly；

Wherein, the extraction module is used for, and is determined using the element positioning strategy every in a plurality of information elements Element position of a information element in the second Text Representation information, it is literary from the target according to the element position The corresponding content of text of a plurality of information elements is extracted as the target information in this.

In this way, target information can be made by determining a positioning strategy for each information element in target information Extraction it is more accurate.

In some embodiments, extraction module may include:

Element extraction unit, for extracting a plurality of information words from the target text according to the element position The corresponding content of text of each information element in element；

Data generating unit, for generating structural data as the target information using the content of text；Wherein, The structural data includes plurality of data entry, wherein each data entry includes one in a plurality of information elements The component identification of a information element and corresponding content of text.

By way of it will be structural data from the target information tissue extracted in target text, it can to extract Information out is convenient for computer utilization, such as is retrieved and counted.

In some embodiments, which can also include:

Label determining module, for obtaining the preset content of each second training text in a plurality of second training texts Label determines classification policy using the first Text Representation information and the content tab；Utilize the classification policy The content tab of the target text is determined with the second Text Representation information；

Message output module, for the content tab using the target text and from described in target text extraction Target information generates the corresponding data set of the target text, and exports the data set；

Wherein, the study module is used for, by answering with first content label in a plurality of second training texts Several second training texts are chosen for a plurality of training texts, and the positioning strategy is determined as the first content label Corresponding positioning strategy；

The extraction module is used for, and when the content tab of the target text is the first content label, utilizes institute It states the corresponding positioning strategy of first content label and extracts the target information from the target text.

Letter can be further improved using the corresponding positioning strategy of the classification by carrying out classifying content to target text Cease the accuracy extracted.

In some embodiments, which can also include:

Element setting module, for from obtaining each training text in a plurality of training texts in the markup information In target information included by information element, according to the content tab of a plurality of training texts and it is described it is each training text Information element included by this target information determines the corresponding information element collection of each content tab in a plurality of content tabs It closes, the information element set includes at least one information element；

Wherein, the extraction module is used for, and obtains the corresponding information element collection cooperation of content tab of the target text For object element set；For each information element in the object element set, using the positioning strategy from the mesh The corresponding content of text of the information element is extracted in mark text；By the member of each information element in the object element set Element mark and the corresponding content of text are as the target information.

In this way, usually determining each content tab pair by the information word according to included by the target information in each training text The information element set answered, it may not be necessary to which the configuration information for presetting the corresponding information element set of each inner label subtracts Few manual operation.When needing to extract more information, after being manually labeled to training text, so that it may run the above process more The newly corresponding information element set of each content tab, makes the scalability of raising scheme.

In some embodiments, which can also include:

Divide module, for the second training text of each of a plurality of second training texts to be divided into a plurality of training texts This segment, using the corresponding a plurality of training text segments of a plurality of second training texts as a plurality of training texts This；Second target text is divided into a plurality of target text segments, using a plurality of target text segments as the mesh Mark text；

Message output module, for the information using second target text and from a plurality of target text segments The target information of middle extraction generates the corresponding data set of second target text, and exports the data set.

In this way, by being shorter text fragments by text segmentation, using text fragments as training text and target text, The accuracy of information extraction can be improved.

In some embodiments, study module can be used for, and utilize the corresponding position markup information of each training text Determine initial position and knot of the target information in the first Text Representation information of each training text Beam position utilizes the corresponding first Text Representation information of each training text, the initial position and end Position determines the positioning strategy；

Extraction module can be used for, and determine the target information in the second text feature table using the positioning strategy Show the initial position in information and end position；By initial position described in the second Text Representation information and the knot Beam position is mapped as initial position and end position in the target text；By the initial position in the target text Content of text between the end position is as the target information.

As it can be seen that indicating the position of target information by using initial position and end position, can extract comprising multiple The target information of continuous word or word, improves the efficiency and flexibility of information extraction.

Each embodiment additionally provides a kind of information extracting method, may include:

Obtain a plurality of training texts and the preset position for being used to indicate target information in wherein each training text Markup information；

Obtain the corresponding first Text Representation information of each training text in a plurality of training texts；

A positioning strategy is determined using the first Text Representation information and the markup information；

The positioning strategy is supplied to a calculating equipment, so that positioning strategy described in the calculating equipment utilization determines institute That states position of the target information in the corresponding second Text Representation information of target text estimates information and according to described pre- Estimate information and extracts the target information from the target text.

As it can be seen that the information extracting method of each embodiment is using target information in the Text Representation information of training text Position rule obtain positioning strategy, the semanteme of text analyze to the energy of positioning wherein target information to have Power, this improves the efficiency of information extraction and accuracy.

In some embodiments, a positioning strategy is generated using the first Text Representation information and the markup information Include:

It determines the corresponding preset a plurality of information elements of the target information, is obtained from the markup information described multiple The element markup information of each information element in several information elements, the element markup information are used to indicate corresponding information word Position of the element in each training text；

A plurality of information elements are determined using the first Text Representation information and the element markup information In each information element element positioning strategy as the positioning strategy so that element described in the calculating equipment utilization positions Strategy determines element of each information element in the second Text Representation information in a plurality of information elements Position extracts the corresponding content of text of a plurality of information elements from the target text according to the element position and makees For the target information.

In some embodiments, which can also include:

Preset data generation strategy is supplied to the calculating equipment, so that the calculating equipment is according to the element position It sets from the corresponding content of text of each information element in a plurality of information elements is extracted in the target text, according to described Data generation strategy and the content of text generate structural data as the target information；Wherein, the structural data Including plurality of data entry, wherein each data entry includes the member of an information element in a plurality of information elements Element mark and corresponding content of text.

In some embodiments, which can also include:

The preset content tab for obtaining each training text in a plurality of training texts utilizes first text Character representation information and the content tab determine classification policy；

The classification policy is provided to the calculating equipment, so that classification policy described in the calculating equipment utilization and described Second Text Representation information determines the content tab of the target text, using the target text content tab and from The target information that the target text extracts generates the corresponding data set of the target text, and exports the data set.

In some embodiments, which can also include:

From the information for obtaining the target information of each training text in a plurality of training texts in the markup information Element, the information element according to the target information of the content tab of a plurality of training texts and each training text are true The corresponding information element set of each content tab in fixed a plurality of content tabs, the information element set includes at least one Information element；

The information of the corresponding information element set of each content tab is supplied to the calculating equipment, so that described The corresponding information element set of content tab of equipment target text according to the acquisition of information is calculated as object element Set；For each information element in the object element set, mentioned from the target text using the positioning strategy Take out the corresponding content of text of the information element；By the corresponding component identification of each information element in the object element set With the corresponding content of text as the target information.

Usually determine that each content tab is corresponding by the information word according to included by the target information in each training text Information element set, it is possible to reduce manual operation, and improve the scalability of scheme.

In some embodiments, which can also include:

Segmentation strategy is supplied to the calculating equipment, so that the second target text is divided into plural number by the calculating equipment A target text segment utilizes second target text using a plurality of target text segments as the target text Information and the target informations of a plurality of target text segments generate the corresponding data set of second target text, And export the data set.

In some embodiments, a positioning strategy is determined using the first Text Representation information and the markup information Include:

The target information rising in the first Text Representation information is determined using the position markup information Beginning position and end position utilizes the first Text Representation information, the initial position and the end position to determine The positioning strategy, so that positioning strategy described in the calculating equipment utilization determines that the target information is special in second text Sign indicates initial position and end position in information；By initial position described in the second Text Representation information and institute State initial position and end position that end position is mapped as in the target text；Extract described in the target text Beginning position and the end position between content as the target information.

Each embodiment also provides a kind of information extracting method, may include:

Obtain target text and the corresponding second Text Representation information of the target text；

Position of the target information in the second Text Representation information is determined using preset positioning strategy Estimate information；And

The target information is extracted from the target text according to the information of estimating；

Wherein, the positioning strategy is to utilize corresponding first text feature of training text each in a plurality of training texts Indicate that the markup information of information and the preset position in wherein each training text that is used to indicate the target information determines 's.

As it can be seen that the information extracting method of each embodiment is analyzed using semanteme of the preset positioning strategy to target text To position wherein target information, the efficiency and accuracy of information extraction are improved.

Each embodiment also provides a kind of information extracting device, including processor and memory, is stored in the memory Computer-readable instruction, the information extracting method that described instruction is used to that processor to be made to realize each embodiment.

As it can be seen that the information extracting device of each embodiment determines positioning strategy using training text, to the semanteme of target text Analyze and position target information therein using positioning strategy, improves the efficiency and accuracy of information extraction.

Each embodiment also provides a kind of computer readable storage medium, is stored with computer-readable instruction, which is characterized in that The information extracting method that described instruction is used to that processor to be made to realize each embodiment.

As it can be seen that the storage medium of each embodiment can make processor determine positioning strategy using training text, to target text This semanteme analyze and positions target information therein using positioning strategy, improves the efficiency of information extraction and accurate Degree.

Detailed description of the invention

Below will detailed description of the present invention preferred embodiment by referring to accompanying drawing, make those skilled in the art more Clear above and other feature and advantage of the invention, in attached drawing:

Fig. 1 is a kind of schematic diagram of information extracting device of the embodiment of the present application.

Fig. 2 is the schematic diagram of the information extracting device of the embodiment of the present application.

Fig. 3 is the system schematic of the embodiment of the present application.

Fig. 4 is a kind of flow chart of information extracting method of the embodiment of the present application.

Fig. 5 is a kind of flow chart of information extracting method of the embodiment of the present application.

Fig. 6 is a kind of flow chart of information extracting method based on NLP deep learning of the embodiment of the present application.

Fig. 7 is a kind of schematic diagram of textual classification model based on NLP technology of the embodiment of the present application.

Fig. 8 is the schematic diagram that one kind of the embodiment of the present application is based on the intensified learning model of depth Q-net (DQN).

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, by the following examples to of the invention further detailed It describes in detail bright.

The embodiment of the present application proposes a kind of technical solution that target information is extracted from non-structured text, passes through instruction Practice the positioning strategy that text obtains target information, extracts target information from target text using the positioning strategy.Fig. 1 is this Apply for a kind of schematic diagram of information extracting device of embodiment.As shown in Figure 1, the information extracting device 10 may include Practise module 11 and an extraction module 12.

The available a plurality of training texts of study module 11 and preset it is used to indicate target information each instruction wherein Practice the markup information of position in text；Obtain the corresponding Text Representation information of each training text in a plurality of training texts (referring to for convenience, call the first Text Representation information in the following text)；It is true using the first Text Representation information and markup information A fixed positioning strategy.

The available target text of extraction module 12 and the corresponding Text Representation information of target text (refer to for convenience In generation, calls the second Text Representation information in the following text), determine target information in the second Text Representation information using positioning strategy Estimate information in position；Target information is extracted from target text according to information is estimated.

Wherein, training text refers to the text for being used to train positioning strategy for representing markup information, and target text, which refers to, to be needed Therefrom to extract the text to be processed of information.Training text and target text are non-structured text data, can be one A sentence, a paragraph or a chapter, etc..Training text and target text can be obtained from the file of preset format, Such as txt format, DOC format, etc..

Target information refers to the information with of interest one or complex item knowledge.For example, for medical records, Target information can be selected from patient's name, inspection item title, diagnostic result, treatment means etc.；(example is reported for medical examination If radiological examination report, ultrasonic examination report, etc.) for, target information can selected from Patient identification, inspection item title, Lesion type, position, size etc.；For financial report, target information can be selected from date, income, expenditure, profit, increasing Long rate etc.；For news article, target information can be selected from time, place, personage, event type etc..Different applications Under scene, technical staff can according to need type, the quantity for the target information that setting needs to extract, etc. here without limit It is fixed.

Markup information is used to indicate position of the target information in each training text.It, can be by people in some embodiments The position of target information in work recognition training text, and using preset symbol or these positions are recorded, these symbols Or label is markup information.For example, the corresponding content of text of target information in training text can be marked using highlighted Note.In another example can by the corresponding content of text of target information in training text using predetermined symbol (such as bracket, rice character number, Deng) be marked.

Text Representation information refer to for ease of calculation machine to text carry out calculating logic processing and by text conversion For the forms of characterization in semantic space, such as vector, matrix, etc..The similar text of meaning has similar Text Representation Information.Text Representation information, example can be converted text to using the various methods in natural language processing (NLP) technology Such as word2vector (word to vector), StringToWordVector, Sentence Embedding, etc..

Positioning strategy refers to the mechanism for identifying and positioning target information in the Text Representation information of target text.It can To determine positioning strategy using training text and markup information.In some examples, the modes such as regular expression can be used and search Rope obtains positioning strategy.In other examples, artificial intelligence technology can be used also to determine positioning strategy.For example, can benefit Instruction is recycled to make machine learning model learning position strategy with training text and markup information training machine learning model Machine learning model after white silk carries out the extraction of target information in target text.Machine learning model may include one or plural number A machine learning module 11, these modules can be selected from, but be not limited to, supervised learning model, intensified learning model, depth Model is practised, etc..

Estimating information is position of the target information of positioning strategy prediction in the second Text Representation information.Estimate letter The form of breath can be depending on the form of Text Representation information.For example, when the word in text is in the second text feature Indicate information in vector form in the presence of, estimating information can be the mark of the corresponding vector of target information, position, etc..

It is cited a plain example below (in fact, target information may include the information that multinomial needs extract, the example It is only used for helping to understand).For example, when target text is radiological examination report (such as CT, nuclear-magnetism that radiologist writes manually Resonance, etc.), when target information is lesion type, a large amount of audit reports can be acquired as training text, and by professional couple Target information (such as " thrombus ", " tumour ", " calcification point ", waiting texts) therein is labeled.By training text and markup information Information extracting device 10 is supplied to be learnt.Device 10 can obtain the Text Representation information of each training text, according to The positioning strategy of the markup information of Text Representation information and target information acquisition target information.When the inspection that acquisition is newly-generated When report, information extracting device 10 can be using the audit report as target text, using positioning strategy to the target text Second Text Representation information is analyzed, and predicts position of the target information in the second Text Representation information, and root Position of the target information in target text is determined according to the position, to extract target information from the position of target text. Wherein training text and target text can use the technology such as OCR and report that papery, carrying out Text region obtains, can also be with It is manually entered the text for calculating equipment.

In some embodiments, which can also include processor 14 and memory 13.Memory 13 can store Above-mentioned study module 11 and the corresponding computer-readable instruction of extraction module 12, these instructions can be executed by processor 14, with Realize the function of study module 11 and extraction module 12.

In some embodiments, target information may include the information that multinomial needs extract.In some examples, study module 11 When determining positioning strategy, it can be generated for while positioning the positioning strategy to multinomial information, the mesh for utilizing positioning strategy to obtain Estimate information in the position that information includes every terms of information of estimating for marking the position of information.In other examples, study module 11 can also Think that wherein each information generates positioning strategy respectively, successively determines estimating for the position of every terms of information using these positioning strategies Information.Fig. 2 is the schematic diagram of the information extracting device of the embodiment of the present application.As shown in Fig. 2, study module 11 may include one Element unit 112.

Element unit 112 can determine the corresponding preset a plurality of information elements of target information, from markup information The middle element markup information for obtaining each information element in a plurality of information elements, element markup information are used to indicate corresponding letter Cease position of the element in each training text；It is determined using the first Text Representation information and element markup information a plurality of The element positioning strategy of each information element is as positioning strategy in information element.

As before, may include a plurality of information elements, such as lesion type, size, quantity, position in target information, etc.. It therefore, may include the position markup information of each information element in markup information.It, can be in markup information in some examples The different information elements in target information are marked using different methods.For example, different highlight colors can be used to mark Different information elements.In another example different symbols can be used to mark different information elements.In this manner it is possible to identify The position markup information of each information element in markup information.

At this point, extraction module 12, which can use element positioning strategy, determines that each information element exists in a plurality of information elements Element position in second Text Representation information, a plurality of information elements pair are extracted according to element position from target text The content of text answered is as target information.

In each embodiment, when target information may include a plurality of information elements, it can will be extracted from target text Target information tissue be structural data form.As shown in Fig. 2, extraction module 12 may include an element extraction unit 122 and a data generating unit 123.

Element extraction unit 122 can use each element positioning strategy and determine each information word in a plurality of information elements Element position of the element in the second Text Representation information, a plurality of information words are extracted according to element position from target text The corresponding content of text of each information element in element.

Data generating unit 123 can use content of text and generate structural data as target information.Wherein, structuring Data include plurality of data entry, wherein each data entry includes the element of an information element in a plurality of information elements Mark and corresponding content of text.

When target text may relate to multiple fields (such as medical examination report may include for different parts, The audit report of different test modes, etc.), included term in the target text of different field, content of text there may be Larger difference.In order to improve the accuracy of information extraction, positioning strategy can be obtained respectively for different types of target text. In some embodiments, as shown in Fig. 2, information extracting device 10 can also include that a label determining module 15 and an information are defeated Module 16 out.

The preset content tab of each training text in the available a plurality of training texts of label determining module 15, benefit Classification policy is determined with the first Text Representation information and content tab；Believed using classification policy and the second Text Representation Cease the content tab for determining target text.

Wherein, content tab is used to indicate default classification belonging to the content of target text.It can according to target text The different field that can dabble presets a plurality of content tabs.For example, lung's blood can be set for medical examination report The content tabs such as bolt, lung tumors, cardiovascular narrow, thyroid nodule, thyroid calcification.Training text can be carried out in advance Classification, determines the label of training text, for determining classification policy.

Study module 11 can will have a plurality of the second of first content label in a plurality of second training texts Training text is chosen for a plurality of training texts, and it is corresponding fixed that the positioning strategy is determined as the first content label Position strategy.

Extraction module 12 can be with, when the content tab of the target text is the first content label, using described The corresponding positioning strategy of first content label extracts the target information from the target text

The target information that message output module 16 can use the content tab of target text and extract from target text is raw At the corresponding data set of target text, and output data set.

The target text of different field not only term, in terms of there may be larger difference, need from wherein mentioning The target information of taking-up is also likely to be present difference.In some embodiments, need can be determined according to the content tab of target text Which information element is extracted from target text.For example, as shown in Fig. 2, information extracting device 10 can also include one Element setting module 17.

Element setting module 17 can be from the mesh obtained in a plurality of training texts in each training text in markup information Information element included by information is marked, according to the target information of the content tab of a plurality of training texts and each training text institute Including information element determine the corresponding information element set of each content tab, information element set in a plurality of content tabs Including at least one information element.For example, can be by the corresponding all information of all training texts with identical content tab The set of element is as the corresponding information element set of the content tab.In another example can be in the institute with identical content tab Have in the corresponding all information elements of training text, according to the selected sections such as the frequency of occurrences or number information element as the content The corresponding information element of label.

Wherein, the corresponding information element set of content tab of the available target text of extraction module 12 is as target element Element set；For each information element in object element set, information word is extracted from target text using positioning strategy The corresponding content of text of element；Using the component identification of each information element in object element set and corresponding content of text as mesh Mark information.

When training text and longer target text length, the content of different piece also has larger difference.Therefore, some In embodiment, training text and target text can be divided into a plurality of text fragments, these text fragments are carried out respectively Study and extraction.For example, as shown in Fig. 2, device 10 can also include a segmentation module 18 and a message output module 16.

The second training text of each of a plurality of second training texts can be divided into a plurality of training by segmentation module 18 Text fragments, using the corresponding a plurality of training text segments of a plurality of second training texts as a plurality of training texts；By Two target texts are divided into a plurality of target text segments, using a plurality of target text segments as target text.

Wherein, the length for each text fragments (including training text segment and target text segment) divided is divided Cutting strategy can be set as needed, for example, can be subordinate sentence, sentence, paragraph, etc..

Message output module 16 can use the information of the second target text and extract from a plurality of target text segments Target information generate the corresponding data set of the second target text, and output data set.

In each embodiment, each target information to be extracted may include single or a plurality of words or word, can use Initial position and end position indicate the position of target information.For example, study module 11 can use each in each embodiment The corresponding position markup information of training text determines target information in the first Text Representation information of each training text Initial position and end position, utilize the corresponding first Text Representation information of each training text, initial position and knot Beam position determines positioning strategy.Extraction module 12 can use positioning strategy and determine that target information is believed in the second Text Representation Initial position and end position in breath；Initial position and end position in second Text Representation information are mapped as target Initial position and end position in text；Using the initial position in target text and the content of text between end position as Target information.

Fig. 3 is the system schematic of the embodiment of the present application.As shown in figure 3, the system includes first equipment 31, one Network 32 and second equipment 33.

First equipment 31 can be the set of single physical equipment or a plurality of physical equipments, such as server, service Device cluster, etc..First equipment 31 can be communicated by network 32 with the second equipment 33.In each embodiment, it can be deposited in the system In a plurality of second equipment 33, each second equipment 33 can be communicated according to the scheme of each embodiment with the first equipment 31, So as to extract target information from respective target text.

First equipment 31 may include processor 311, tranining database 312, policy database 313 and communication module 314.Wherein, processor 311 can use the training text stored in tranining database 312 and corresponding markup information determines one Positioning strategy, and positioning strategy is stored in policy database 313.Tranining database 312 can store various training datas, Such as training text and its markup information.Policy database 313 can store with generate positioning strategy or with extract target information Related various strategies, such as may include, but be not limited to positioning strategy (including element positioning strategy), classification policy, text Segmentation strategy, etc..

In some embodiments, positioning strategy is can be generated in the first equipment 31, and positioning strategy is supplied to the second equipment 33, So that the second equipment 33 extracts target information using the positioning strategy from target text.

Fig. 4 is a kind of flow chart of information extracting method of the embodiment of the present application.Method 40 can be held by the first equipment 31 Row.As shown in figure 4, this method 40 may comprise steps of.

Step S41 obtains a plurality of training texts and preset is used to indicate target information each training text wherein The markup information of middle position.

Step S42 obtains the corresponding first Text Representation information of each training text in a plurality of training texts.

Step S43 determines a positioning strategy using the first Text Representation information and markup information.

Positioning strategy is supplied to a calculating equipment by step S44, so that calculating equipment utilization positioning strategy determines that target is believed Cease position in the corresponding second Text Representation information of target text estimates information and according to estimating information from target Target information is extracted in text.The calculating equipment can be the second equipment 33.

Fig. 5 is a kind of flow chart of information extracting method of the embodiment of the present application.Method 50 can be held by the second equipment 33 Row.As shown in figure 5, this method 50 may comprise steps of.

Step S51 obtains target text and the corresponding second Text Representation information of target text.

Step S52 determines position of the target information in the second Text Representation information using preset positioning strategy Estimate information.

Step S53 extracts target information according to information is estimated from target text.

Wherein, positioning strategy is to utilize corresponding first Text Representation of training text each in a plurality of training texts What the markup information of information and the preset position in wherein each training text that is used to indicate target information determined.

The semanteme of target text is analyzed as it can be seen that the positioning strategy that the information extracting method of each embodiment uses has To position the ability of wherein target information, this improves the efficiency of information extraction and accuracy.

In other embodiments, the generting machanism of training data and positioning strategy can also be supplied to by the first equipment 31 Two equipment 33 are generated positioning strategy using generting machanism by the second equipment 33, are instructed using training data to positioning strategy Practice, and extracts target information from target text using the positioning strategy.

In each embodiment, the first equipment 31 be can according to need, and obtain new training data, and new training data is mentioned Supply the second equipment 33.Second equipment 33 optimizes positioning strategy using new training data, that is, utilizing new training Text and its markup information further train positioning strategy, thus optimum position strategy.

In some embodiments, target information may include the information that multinomial needs extract, and the first equipment 31 can be for wherein Each information generates positioning strategy respectively.For example, the first equipment 31 can determine the corresponding preset a plurality of letters of target information Element is ceased, from the element markup information for obtaining each information element in a plurality of information elements in markup information, element mark letter Breath is used to indicate position of the corresponding information element in each training text；Utilize the first Text Representation information and element Markup information determines the element positioning strategy of each information element in a plurality of information elements as positioning strategy, so that second sets Standby 33 determine each information element in a plurality of information elements in the second Text Representation information using element positioning strategy In element position, the corresponding content of text of a plurality of information elements is extracted from target text according to element position as mesh Mark information.In this way, target information can be made by determining a positioning strategy for each information element in target information It is more accurate to extract.

In each embodiment, when target information may include a plurality of information elements, the first equipment 31 can be to the second equipment 33 provide the generation strategy of structural data.For example, preset data generation strategy can be supplied to second by the first equipment 31 Equipment 33, so that the second equipment 33 is according to element position from extracting each information element in a plurality of information elements in target text Corresponding content of text generates structural data as target information according to data generation strategy and content of text.Wherein, structure Changing data includes plurality of data entry, wherein each data entry includes an information element in a plurality of information elements Component identification and corresponding content of text.It is structural data by the target information tissue that will be extracted from target text Form can make the information extracted convenient for computer utilization, such as be retrieved and be counted.

In some embodiments, when target text may relate to multiple fields, different types of target text can be directed to Positioning strategy is obtained respectively.For example, in the available a plurality of training texts of the first equipment 31 each training text it is preset Content tab determines classification policy using the first Text Representation information and content tab.First equipment 31 can will classify Strategy is supplied to the second equipment 33, so that the second equipment 33 determines target using classification policy and the second Text Representation information The content tab of text generates target text pair using the content tab of target text and from the target information of target text extraction The data set answered, and output data set.By carrying out classifying content to target text, using the corresponding positioning strategy of the classification, It can be further improved the accuracy of information extraction.

In some embodiments, it can determine which needs extract from target text according to the content tab of target text A little information elements.For example, the first equipment 31 can be from obtaining each training text in a plurality of training texts in markup information The information element of target information, according to the information of the target information of the content tab of a plurality of training texts and each training text Element determines that the corresponding information element set of each content tab in a plurality of content tabs, information element set include at least one A information element.The information of the corresponding information element set of each content tab can be supplied to the second equipment by the first equipment 31 33, so that the second equipment 33 is according to the corresponding information element set of content tab of acquisition of information target text as object element Set；For each information element in object element set, information element is extracted from target text using positioning strategy Corresponding content of text；Using in object element set the corresponding component identification of each information element and corresponding content of text as Target information.Usually determine that each content tab is corresponding by the information word according to included by the target information in each training text Information element set, it is possible to reduce manual operation, and improve the scalability of scheme.

In some embodiments, training text and target text can be divided into a plurality of text fragments, to these texts Segment is learnt and is extracted respectively.For example, segmentation strategy can be supplied to the second equipment 33 by the first equipment 31, so that second Second target text is divided into a plurality of target text segments by equipment 33, using a plurality of target text segments as target text This, it is corresponding to generate the second target text using the information of the second target text and the target information of a plurality of target text segments Data set, and output data set.In this way, by being shorter text fragments by text segmentation, using text fragments as training text Sheet and target text, can be improved the accuracy of information extraction.

In each embodiment, each target information to be extracted may include single or a plurality of words or word, can use Initial position and end position indicate the position of target information.For example, to can use position markup information true for the first equipment 31 Initial position and end position of the information that sets the goal in the first Text Representation information are believed using the first Text Representation Breath, initial position and end position determine positioning strategy, so that the second equipment 33 determines target information the using positioning strategy Initial position and end position in two Text Representation information；By initial position and knot in the second Text Representation information Beam position is mapped as initial position and end position in target text；Extract the initial position in target text and end position Between content as target information.As it can be seen that the position of target information is indicated by using initial position and end position, it can To extract the target information comprising multiple continuous words or word, the efficiency and flexibility of information extraction are improved.

In some embodiments, information extracting method can be realized using machine learning model.Fig. 6 is the embodiment of the present application A kind of information extracting method based on NLP deep learning flow chart.The embodiment uses textual classification model and extensive chemical Practise model extraction target information.As shown in fig. 6, this method may comprise steps of.

Step S61, pre-processes training text.

A large amount of training texts are used as corpus by preparatory acquisition.NLP technology can be used, each text is divided into paragraph, sentence Son and word.The corresponding vocabulary of a corpus is established using obtained word.

Step S62, training text disaggregated model.

It selects one group of text as training dataset, wherein will be divided into sentence and word by each text.Each sentence is by this The expert (such as medical expert, professional economist, etc.) of text fields marks out sentence according to predefined content tab manually The corresponding content tab of son.Using the sentence with mark as training text, text classification of the training based on deep learning network Model, input is sentence, and output is the corresponding content tab of the sentence.In some embodiments, text disaggregated model is right In the assorting process of sentence, need to be converted to sentence into Text Representation information.Textual classification model can be made to export these The Text Representation information of sentence is used for subsequent determining positioning strategy.

Fig. 7 is a kind of schematic diagram of textual classification model based on NLP technology of the embodiment of the present application.As shown in fig. 7, Sentence in training data 72 is converted to term vector list by being embedded in (Embedding) module 731 by NLP transformation model 73, Term vector list is inputted into neural network model 732 (such as RNN, LSTM), by splicing module 733 by neural network model 732 Output is spliced, and the Text Representation information of sentence is obtained.Textual classification model 71 utilizes the Text Representation of sentence The content tab of sentence is learnt in information and training data 72, obtains classification policy.Below, textual classification model 71 can Classify to the sentence in target text using classification policy.

Step S63, training intensified learning (RL) model.

Each sentence that training data is concentrated can be marked by the expert in the field according to predefined notation methods manually Outpour position of the target information in sentence.For example, different information elements can with different colours it is highlighted mark out come.

In the present embodiment, RL model is set to obtain positioning strategy by training intensified learning (RL) model.Fig. 8 is that the application is real One kind of example is applied based on the schematic diagram of the intensified learning model of depth Q-net (DQN).As shown in figure 8, the input of RL model 74 is Text Representation information, the original state and dbjective state of the position of target information for the sentence that NLP transformation model 73 exports. Each sentence is an environment of RL model 74 in training data 72, and original state is the initial in sentence of information element Position, such as the initial and end of sentence, dbjective state are that the initial position of the information element marked in sentence and end position (are believed Cease the position markup information of element).RL model 74 utilizes the initial position of information element and all possible shape of end position State 741 constructs state-transition matrix, and various states are calculated in the Text Representation information of sentence and shift corresponding mass value (Q value).The training method of RL model 74 is that the initial position of selection and end position are moved to left or moved to right, if state is more next Closer to dbjective state, then rewarded；If state is punished more and more far away from dbjective state.By training above Process iteration updates each state, continues to optimize DQN, and until current state is identical as dbjective state and DQN restrains, then training is tied Beam.Above-mentioned training process can be executed respectively for each information element in sentence.For different content tabs, can train Different RL model handles the sentence with different content label.

Step S64 extracts target information, generates structural data.

Target text 75 is divided into target sentences by the technology that can be used in step S61.Each target sentences are entered Into textual classification model 71 trained in advance, the content tab of target sentences is obtained.Using NLP transformation model 73 by target sentence Son is converted to Text Representation information.According to its content tab, the Text Representation information input of target sentences is passed through Trained RL model 74.RL model 74 predicts position of the target information in the Text Representation information of target sentences.In this way, The position that RL model 74 is predicted can be mapped to the text in target sentences, and extract these content of text.It can integrate The content tab of sentence, the content of text of extraction generate structural data.Structural data for example can be { content tab: lung Thrombus；Presence or absence: exist；Type: sub- section lung thrombus；Position: the right side lobe of the lung；……}.

Various deep learning networks, such as CNN, RNN etc. can be used in textual classification model 71 and RL model 74.

In this way, by using deep learning model in semantic space understanding and analysis report, to realize information extraction High reliability.

The above is merely preferred embodiments of the present invention, be not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of information extracting device characterized by comprising

One study module, for obtaining a plurality of training texts and preset being used to indicate target information each training wherein The markup information of position in text；Obtain the corresponding first text feature table of each training text in a plurality of training texts Show information；A positioning strategy is determined using the first Text Representation information and the markup information；And

One extraction module is utilized for obtaining target text and the corresponding second Text Representation information of the target text What the positioning strategy determined position of the target information in the second Text Representation information estimates information；According to The information of estimating extracts the target information from the target text.

2. information extracting device according to claim 1, which is characterized in that the study module includes:

One element unit, for determining the corresponding preset a plurality of information elements of the target information, from the mark The element markup information of each information element in a plurality of information elements is obtained in information, the element markup information is used for Indicate position of the corresponding information element in each training text；Utilize the first Text Representation information and the member Plain markup information determines the element positioning strategy of each information element in a plurality of information elements as the positioning strategy；

Wherein, the extraction module is used for, and determines each letter in a plurality of information elements using the element positioning strategy Element position of the element in the second Text Representation information is ceased, according to the element position from the target text The corresponding content of text of a plurality of information elements is extracted as the target information.

3. information extracting device according to claim 2, which is characterized in that the extraction module includes:

One element extraction unit, for extracting a plurality of information elements from the target text according to the element position In the corresponding content of text of each information element；

One data generating unit, for generating structural data as the target information using the content of text；Wherein, institute Stating structural data includes plurality of data entry, wherein each data entry includes one in a plurality of information elements The component identification of information element and corresponding content of text.

4. information extracting device described in any claim in -3 according to claim 1, which is characterized in that further comprise:

One label determining module, for obtaining the preset content mark of each second training text in a plurality of second training texts Label, determine classification policy using the first Text Representation information and the content tab；Using the classification policy and The second Text Representation information determines the content tab of the target text；

One message output module, for the content tab using the target text and the mesh extracted from the target text It marks information and generates the corresponding data set of the target text, and export the data set；

Wherein, the study module is used for, will be a plurality of with first content label in a plurality of second training texts Second training text is chosen for a plurality of training texts, and it is corresponding that the positioning strategy is determined as the first content label Positioning strategy；

The extraction module is used for, and when the content tab of the target text is the first content label, utilizes described the The corresponding positioning strategy of one content tab extracts the target information from the target text.

5. information extracting device according to claim 4, which is characterized in that further comprise:

One element setting module, for from being obtained in the markup information in a plurality of training texts in each training text Target information included by information element, according to the content tab of a plurality of training texts and each training text Target information included by information element determine the corresponding information element set of each content tab in a plurality of content tabs, The information element set includes at least one information element；

Wherein, the extraction module is used for, and obtains the corresponding information element set of content tab of the target text as mesh Mark element set；It is literary from the target using the positioning strategy for each information element in the object element set The corresponding content of text of the information element is extracted in this；By the component identification of each information element in the object element set With the corresponding content of text as the target information.

6. information extracting device described in any claim in -3 according to claim 1, which is characterized in that further comprise:

One segmentation module, for the second training text of each of a plurality of second training texts to be divided into a plurality of training texts Segment, using the corresponding a plurality of training text segments of a plurality of second training texts as a plurality of training texts； Second target text is divided into a plurality of target text segments, using a plurality of target text segments as the target text This；

One message output module, for the information using second target text and from a plurality of target text segments The target information of extraction generates the corresponding data set of second target text, and exports the data set.

7. information extracting device described in any claim in -3 according to claim 1, which is characterized in that wherein,

The study module is used for, and determines that the target information exists using the corresponding position markup information of each training text Initial position and end position in the first Text Representation information of each training text, using described each The corresponding first Text Representation information of training text, the initial position and end position determine the positioning plan Slightly；

The extraction module is used for, and determines that the target information is believed in second Text Representation using the positioning strategy Initial position and end position in breath；By initial position described in the second Text Representation information and the stop bits Set the initial position and end position being mapped as in the target text；By in the target text the initial position and institute The content of text between end position is stated as the target information.

8. a kind of information extracting method characterized by comprising

Obtain a plurality of training texts and the preset mark for being used to indicate target information position in wherein each training text Information；

The positioning strategy is supplied to a calculating equipment, so that positioning strategy described in the calculating equipment utilization determines the mesh Information position in the corresponding second Text Representation information of target text is marked to estimate information and estimate information according to described The target information is extracted from the target text.

9. information extracting method according to claim 8, which is characterized in that utilize the first Text Representation information Generating a positioning strategy with the markup information includes:

It determines the corresponding preset a plurality of information elements of the target information, is obtained from the markup information described a plurality of The element markup information of each information element in information element, the element markup information are used to indicate corresponding information element and exist Position in each training text；

It is determined using the first Text Representation information and the element markup information every in a plurality of information elements The element positioning strategy of a information element is as the positioning strategy, so that element positioning strategy described in the calculating equipment utilization Determine element position of each information element in the second Text Representation information in a plurality of information elements, The corresponding content of text of a plurality of information elements is extracted from the target text according to the element position as institute State target information.

10. information extracting method according to claim 9, which is characterized in that further comprise:

Preset data generation strategy is supplied to the calculating equipment so that the calculating equipment according to the element position from The corresponding content of text of each information element in a plurality of information elements is extracted in the target text, according to the data Generation strategy and the content of text generate structural data as the target information；Wherein, the structural data includes Plurality of data entry, wherein each data entry includes the element mark of an information element in a plurality of information elements Know and corresponding content of text.

11. according to information extracting method described in any claim in claim 8-10, which is characterized in that further packet It includes:

The preset content tab for obtaining each training text in a plurality of training texts utilizes first text feature Indicate that information and the content tab determine classification policy；

The classification policy is provided to the calculating equipment, so that classification policy and described second described in the calculating equipment utilization Text Representation information determines the content tab of the target text, using the content tab of the target text and from described The target information that target text extracts generates the corresponding data set of the target text, and exports the data set.

12. information extracting method according to claim 11, which is characterized in that further comprise:

From the information element for obtaining the target information of each training text in a plurality of training texts in the markup information, It is determined according to the information element of the target information of the content tab of a plurality of training texts and each training text multiple The corresponding information element set of each content tab in several content tabs, the information element set includes at least one information Element；

The information of the corresponding information element set of each content tab is supplied to the calculating equipment, so that the calculating The corresponding information element set of the content tab of equipment target text according to the acquisition of information is as object element set； For each information element in the object element set, institute is extracted from the target text using the positioning strategy State the corresponding content of text of information element；By the corresponding component identification of information element each in the object element set and accordingly The content of text is as the target information.

13. according to information extracting method described in any claim in claim 8-10, which is characterized in that further packet It includes:

Segmentation strategy is supplied to the calculating equipment, so that the second target text is divided into a plurality of mesh by the calculating equipment It marks text fragments and utilizes the letter of second target text using a plurality of target text segments as the target text Data set corresponding with the target information of a plurality of target text segments generation second target text is ceased, and defeated The data set out.

14. according to information extracting method described in any claim in claim 8-10, which is characterized in that wherein, utilize The first Text Representation information and the markup information determine that a positioning strategy includes:

Start bit of the target information in the first Text Representation information is determined using the position markup information Set and end position, determined using the first Text Representation information, the initial position and the end position described in Positioning strategy, so that positioning strategy described in the calculating equipment utilization determines the target information in the second text feature table Show the initial position in information and end position；By initial position described in the second Text Representation information and the knot Beam position is mapped as initial position and end position in the target text；Extract the start bit in the target text The content between the end position is set as the target information.

15. a kind of information extracting method characterized by comprising

Determine that target information position in the second Text Representation information is estimated using preset positioning strategy Information；And

Wherein, the positioning strategy is to utilize corresponding first Text Representation of training text each in a plurality of training texts What the markup information of information and the preset position in wherein each training text that is used to indicate the target information determined.

16. a kind of information extracting device, which is characterized in that including processor and memory, calculating is stored in the memory Machine readable instruction, the information extraction that described instruction is used to that processor to be made to realize as described in any claim in claim 8-14 Method.

17. a kind of computer readable storage medium, is stored with computer-readable instruction, which is characterized in that described instruction is for making Processor realizes the information extracting method as described in any claim in claim 8-14.