CN107590448A

CN107590448A - The method for obtaining QTL data automatically from document

Info

Publication number: CN107590448A
Application number: CN201710761497.2A
Authority: CN
Inventors: 袁晓辉
Original assignee: Wuhan Ancient Gene Technology Co Ltd
Current assignee: Wuhan Ancient Gene Technology Co Ltd
Priority date: 2017-08-30
Filing date: 2017-08-30
Publication date: 2018-01-16

Abstract

The invention belongs to biological information field, more particularly to the method for acquisition QTL data automatically from document.Pass through the method for the text mining information such as mining analysis QTL, gene function from pertinent literature automatically, QTL information is obtained automatically from PDF format document using computer data digging technology, so as to solve the problems, such as that the amount of manual read's documentation instantly is big, speed is slow, can not be in time to newly delivering several timely processings.Meanwhile this method can greatly reduce the work burden of database sharing.

Description

The method for obtaining QTL data automatically from document

Technical field

The invention belongs to biological information field, more particularly to the method for acquisition QTL data automatically from document.

Background technology

With increasing sharply for document, how quickly substantial amounts of data is delivered and is published in document by biological study person, Acquisition these data turn into a challenge.It is difficult often timely and effectively to find it by way of reading these documents by hand Information of concern.Therefore, how from mass data automatically obtain effective information turn into bioinformatics it is in the urgent need to address The problem of, document is excavated by important means into solve this problem using the method for natural language processing with machine learning.

QTL (quantitative trait locus) is important genome annotation information.But current QTL information Obtain mainly by way of manual read's document, workload is big and speed is slow, is unfavorable for upgrading in time.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of method for quickly obtaining QTL data automatically from document.

To reach above-mentioned purpose, the inventive method comprises the following steps that：

One, extracts the structure and content of form from the document of PDF format

Three line tables in document are analyzed and processed using the method for image recognition, by being progressively scanned to the page, soon Speed orients the position of form line, and then orients the position of form；By the positioning to row cut-off rule and column split line, with reference to Ocr technologies, extract the structure and content of form；Finally, above the single line of three line tables, gone out according to keyword extraction The Caption parts of form；

2nd, the form screening of the information containing QTL

If occurring Molecular Marker Information in form, just using the form as candidate's QTL forms；

3rd, information is extracted from the form of screening

For criteria table, gauge outfit field contents are directly extracted, then compare content and predefined Database field Determine the content type of respective column；For complicated form, using following five rule, multirow gauge outfit and loss of learning are handled Situation, it is converted into simple list：

Rule one, if the cell in form more than 60% is sky, abandon the form；

Rule two, determine that the basic standard comprising QTL information is that Molecular Marker Information, decision table are included in form in form The method comprising Molecular Marker Information is to extract preceding four row of form in lattice, using in regular expression fuzzy matching cell Content, judge whether to include marker, interval and loci vocabulary；

Rule three, multiple molecular labelings are correspond to for a phenotype, linkage group or the other information occurred in form Situation (1:N), form line number is determined using molecular labeling as benchmark；

Rule four, multirow other information is corresponded to for one group of molecular labeling, phenotype or the linkage group information occurred in form Situation, the continuous blank cell under it is filled with molecular labeling, phenotype or linkage group information；

Rule five, in the case of phenotype or parental information is not included in form, utilizes the syntactic analysis based on dependency tree Device extracts these information supplements into result from form caption；

4th, QTL information is obtained from document text

In the form of screening, the incomplete situation of form data be present, for the information lacked in completion form, divide three Step is handled：The first step, scan the title and explanation of form, the extraction descriptive sentence related to form；Second step, to this A little sentences analyze the information lacked in extraction form using dictionary matching template, such as, we are with F $ num:$ num are matched Go out species information；3rd, if second step result is sky, recycle the syntax analyzer based on dependency tree to excavate statement language The form missing information stored in sentence；

5th, Step 3: the standardization of step 4 Result and error correction

For form and the result of text mining, standardization and error correction in terms of three：

(1) abbreviation compares：The place for occurring vocabulary abbreviation for the first time needs to provide spelling；

(2) validity check：The record for not having character or molecular labeling is deleted from final result, furthermore with priori Knowledge data base inspection result, then deleted in the event of contradiction from final result；

(3) reproducible results inspection：Mark, phenotype, age, region, parent and method information is identical is recorded in most Only retain in termination fruit a.

The present invention passes through the method for the text mining information such as mining analysis QTL, gene function from pertinent literature automatically.This Invention obtains QTL information automatically using computer data digging technology from PDF format document, so as to solve manual read instantly Documentation amount is big, speed slowly, can not in time to newly delivering several timely processings the problem of.Meanwhile this method can be significantly Reduce the work burden of database sharing.

Brief description of the drawings

The lines of Fig. 1 tri- represent to be intended to.

Fig. 2 three schematic diagrames of rule.

Fig. 3 caption information excavating schematic diagrames.

Fig. 4 Stanfordparser decision tree schematic diagrames.

Fig. 5 QTL data digging flow figures.

Embodiment

One, extracts the structure and content of form from the document of PDF format

In the academic documents of pdf forms, form is generally presented in the form of three line tables in paper, as shown in Figure 1.By It can regard as the picture format naturally without miscellaneous point in PDF, the method that we employ image recognition is analyzed three line tables Processing.Logical atmosphere 3rd side cut table separates form gauge outfit part and list data domain part by three root long degree identical form lines.This three We can regard the number of continuous black pixel point as to the length of bar form line, by being progressively scanned to the page, Wo Menke Go out the position of form line with fast positioning, and then orient the position of form, we can distinguish again according to the position of form line Go out gauge outfit domain and data field.

First, we determined that longitudinally split line (all longitudinal directions of the Fig. 1 in addition to vertical line below Assignment of form Line), because the separation between every column data is separated by blank parts, we are according to the quantity of longitudinal black color dots of table area It is no identical with the quantity of horizontal line to judge at this whether to be longitudinally split line.Then, we judge horizontal partition line (Fig. 1 data Domain x wire).In data field, the segregated portion between each row of data is also blank parts, due to point between each row of data Every being blank parts entirely, whether we are identical with form length come location data domain according to the number of continuous white pixel point Laterally separated line.Finally, the gauge outfit across multiple row is identified for we, and the similar method for determining form line, we are in gauge outfit domain The middle position for determining continuous black color dots, the position is the cut-off rule in gauge outfit domain, afterwards, according to the separator bar and center line black The quantity of point determines the cut-off rule (vertical line below Figure 1A ssignment) of longitudinal direction.

By above step, by the positioning to horizontal partition line and longitudinal subdivision line, with reference to ocr technologies, we are just The complete structure and content for having extracted form.Finally, above the single line of three line tables, we close according to Table Key word extracts the Caption parts of form.

The form screening of two, information containing QTL

Multiple forms are included in one scientific literature, program needs the selected form for including QTL information.It can explicitly indicate that The relevant field of table content and QTL information：Molecular labeling and character.But according to literature content statistical result, trait information Sometimes do not exist directly as a row of form, in fact it could happen that in table name or table note.But Molecular Marker Information is generally all It can preserve in the table, if so occurring Molecular Marker Information in form, we are just using the form as candidate's QTL forms.

Three, extract information from the form of screening

Form in document is divided into two kinds.One kind is criteria table, i.e., gauge outfit form the first row, not across multirow Or the cell of multiple row.Another kind is complicated form, i.e., gauge outfit crosses over multirow, the cell across multirow or multiple row be present. It is simple to handle criteria table method, directly extracts gauge outfit field contents, then compares content and predefined Database field Determine the content type of respective column.For complicated form, then need further to handle, be converted into simple list.I Summarize five transformation rules, special disposal multirow gauge outfit and loss of learning situation.This five transformation rules can correctly be located The QTL forms of reason 94%.

Rule one, if the cell in form more than 60% is sky, abandon the form.

Rule two, determine that the basic standard comprising QTL information is that Molecular Marker Information is included in form in form.Decision table The method comprising Molecular Marker Information is to extract preceding four row of form in lattice, using in regular expression fuzzy matching cell Content, judge whether comprising vocabulary such as marker, interval and loci.

Rule three, multiple molecular labelings are correspond to for a phenotype, linkage group or the other information occurred in form Situation (1:N), we determine form line number using molecular labeling as benchmark.As shown in Fig. 2 phenotype " No.ofpods in form The corresponding multiple molecular labelings of perplant ", it can judge that the phenotype has multiple blank cells under one's name during program scanning form, then To content in sub- flag column, if left and right, cell meets rule three if having content, will be automatically in the result with the table for detection Type name is plugged a gap cell.

Rule four, multirow other information is corresponded to for one group of molecular labeling, phenotype or the linkage group information occurred in form Situation, we fill the continuous blank cell under it with information such as molecular labeling, phenotype or linkage groups.

Rule five, in the case of phenotype or parental information is not included in form, we utilize the grammer based on dependency tree Analyzer extracts these information supplements into result from form caption.

Four, obtain QTL information from document text

Find the incomplete situation of form data be present in table statistics result.In some forms, character, parent Etc. the other positions that information is distributed in title, explanation or document.For the information lacked in completion form, three steps of the present invention point Handled.The first step, scan the title and explanation of form, the extraction descriptive sentence related to form.Second step, it is right first These sentences extract the information lacked in form using simple dictionary matching template to analyze, and matching template is as shown in table 1.The Three steps, if second step result is sky, recycles the syntax analyzer based on dependency tree to excavate and deposited in complicated statement sentence The form missing information of storage.Natural language processing algorithm based on dependency tree is to resolve into sentence (including punctuation mark) It is short for lexical unit, such as noun phrase, verb phrase, punctuate symbol, noun of locality phrase, prepositional phrase, adverbial phrase, adjective Language, determiner phrase, measure word phrase etc., regenerate the dependence between these phrases in the sentence, and formation one is tree-shaped Graph of a relation.Text message is further excavated using set syntax rule according to the dependency tree can.

The matching template sample of table 1

Such as there is no parental information in table 1, and parental information is stored in table name.As shown in figure 3, we use Standford Parser find phenotype (trigonelline biosynthesis) and parental information (Essex in table name With Forrest) and add it to final analysis result.

Five, are Step 3: the standardization of step 4 Result and error correction

Vocabulary abbreviation, mistake unavoidably occur during form conversion and text mining and repeats to record.It is right In form and the result of text mining, the present invention standardization and error correction in terms of three.(1) abbreviation compares：According to writing article Specification, the place for occurring vocabulary abbreviation for the first time need to provide spelling.If occurring vocabulary abbreviation in form, we pass through scanning Entire article, the spelling of abbreviation is searched using the method for full word matching, and add to output result.(2) validity check：No The record of character or molecular labeling is deleted from final result.In addition, in order to ensure the correctness of Result, we utilize Priori database (such as pair relationhip table between known linkage group and molecular labeling) inspection result, in the event of lance Shield is then deleted from final result.(3) reproducible results inspection：Mark, phenotype, age, region, parent and the complete phase of method information Same being recorded in final result only retains portion.

The QTL information datas that six, obtain step 5 export

Data pick-up result is shown using web page form form, including the QTL tables of all judgements in document Lattice, all non-QTL forms；All data lists being drawn into include complete information and Incomplete information in QTL forms.

Claims

A kind of 1. method for quickly obtaining QTL data automatically from document, it is characterised in that：

First, the structure and content of form are extracted from the document of PDF format

Three line tables in document are analyzed and processed using the method for image recognition, it is quick fixed by being progressively scanned to the page Position goes out the position of form line, and then orients the position of form；By the positioning to row cut-off rule and column split line, with reference to ocr Technology, extract the structure and content of form；Finally, above the single line of three line tables, table is gone out according to keyword extraction The Caption parts of lattice；

2nd, the form screening of the information containing QTL

If occurring Molecular Marker Information in form, just using the form as candidate's QTL forms；

3rd, information is extracted from the form of screening

For criteria table, gauge outfit field contents are directly extracted, content and predefined Database field are then compared into determination The content type of respective column；For complicated form, using following five rule, multirow gauge outfit and loss of learning feelings are handled Condition, it is converted into simple list：

Rule one, if the cell in form more than 60% is sky, abandon the form；

Rule two, determine that the basic standard comprising QTL information is that Molecular Marker Information is included in form in form, is judged in form Method comprising Molecular Marker Information is to extract preceding four row of form, using content in regular expression fuzzy matching cell, Judge whether to include marker, interval and loci vocabulary；

Rule three, the feelings of multiple molecular labelings are correspond to for a phenotype, linkage group or the other information occurred in form Condition (1:N), form line number is determined using molecular labeling as benchmark；

Rule four, the feelings of multirow other information are corresponded to for one group of molecular labeling, phenotype or the linkage group information occurred in form Condition, the continuous blank cell under it is filled with molecular labeling, phenotype or linkage group information；

Rule five, in the case of phenotype or parental information is not included in form, using based on the syntax analyzer of dependency tree from These information supplements are extracted in form caption into result；

4th, QTL information is obtained from document text

In the form of screening, the incomplete situation of form data be present, for the information lacked in completion form, point three steppings Row processing：The first step, scan the title and explanation of form, the extraction descriptive sentence related to form；Second step, to these languages Sentence analyzes the information lacked in extraction form using dictionary matching template；3rd, if second step result is sky, recycle base The form missing information stored in statement sentence is excavated in the syntax analyzer of dependency tree；

5th, Step 3: the standardization of step 4 Result and error correction

For form and the result of text mining, standardization and error correction in terms of three：

(1) abbreviation compares：The place for occurring vocabulary abbreviation for the first time needs to provide spelling；

(2) validity check：The record for not having character or molecular labeling is deleted from final result, furthermore with priori Database auditing result, then deleted in the event of contradiction from final result；

(3) reproducible results inspection：Mark, phenotype, age, region, parent and identical be recorded in of method information most terminate Only retain in fruit a.
2. according to the method for claim 1, it is characterised in that also including step 6, the QTL Information Numbers that step 5 is obtained Exported according to web page form form.
3. method according to claim 1 or 2, it is characterised in that in step 1, the separation between every column data is by blank Part separates, and judges at this whether to be vertical according to whether the quantity of longitudinal black color dots of table area is identical with the quantity of horizontal line To cut-off rule.