CN107590448A - The method for obtaining QTL data automatically from document - Google Patents

The method for obtaining QTL data automatically from document Download PDF

Info

Publication number
CN107590448A
CN107590448A CN201710761497.2A CN201710761497A CN107590448A CN 107590448 A CN107590448 A CN 107590448A CN 201710761497 A CN201710761497 A CN 201710761497A CN 107590448 A CN107590448 A CN 107590448A
Authority
CN
China
Prior art keywords
information
qtl
rule
document
line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710761497.2A
Other languages
Chinese (zh)
Inventor
袁晓辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Ancient Gene Technology Co Ltd
Original Assignee
Wuhan Ancient Gene Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Ancient Gene Technology Co Ltd filed Critical Wuhan Ancient Gene Technology Co Ltd
Priority to CN201710761497.2A priority Critical patent/CN107590448A/en
Publication of CN107590448A publication Critical patent/CN107590448A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention belongs to biological information field, more particularly to the method for acquisition QTL data automatically from document.Pass through the method for the text mining information such as mining analysis QTL, gene function from pertinent literature automatically, QTL information is obtained automatically from PDF format document using computer data digging technology, so as to solve the problems, such as that the amount of manual read's documentation instantly is big, speed is slow, can not be in time to newly delivering several timely processings.Meanwhile this method can greatly reduce the work burden of database sharing.

Description

The method for obtaining QTL data automatically from document
Technical field
The invention belongs to biological information field, more particularly to the method for acquisition QTL data automatically from document.
Background technology
With increasing sharply for document, how quickly substantial amounts of data is delivered and is published in document by biological study person, Acquisition these data turn into a challenge.It is difficult often timely and effectively to find it by way of reading these documents by hand Information of concern.Therefore, how from mass data automatically obtain effective information turn into bioinformatics it is in the urgent need to address The problem of, document is excavated by important means into solve this problem using the method for natural language processing with machine learning.
QTL (quantitative trait locus) is important genome annotation information.But current QTL information Obtain mainly by way of manual read's document, workload is big and speed is slow, is unfavorable for upgrading in time.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of method for quickly obtaining QTL data automatically from document.
To reach above-mentioned purpose, the inventive method comprises the following steps that:
One, extracts the structure and content of form from the document of PDF format
Three line tables in document are analyzed and processed using the method for image recognition, by being progressively scanned to the page, soon Speed orients the position of form line, and then orients the position of form;By the positioning to row cut-off rule and column split line, with reference to Ocr technologies, extract the structure and content of form;Finally, above the single line of three line tables, gone out according to keyword extraction The Caption parts of form;
2nd, the form screening of the information containing QTL
If occurring Molecular Marker Information in form, just using the form as candidate's QTL forms;
3rd, information is extracted from the form of screening
For criteria table, gauge outfit field contents are directly extracted, then compare content and predefined Database field Determine the content type of respective column;For complicated form, using following five rule, multirow gauge outfit and loss of learning are handled Situation, it is converted into simple list:
Rule one, if the cell in form more than 60% is sky, abandon the form;
Rule two, determine that the basic standard comprising QTL information is that Molecular Marker Information, decision table are included in form in form The method comprising Molecular Marker Information is to extract preceding four row of form in lattice, using in regular expression fuzzy matching cell Content, judge whether to include marker, interval and loci vocabulary;
Rule three, multiple molecular labelings are correspond to for a phenotype, linkage group or the other information occurred in form Situation (1:N), form line number is determined using molecular labeling as benchmark;
Rule four, multirow other information is corresponded to for one group of molecular labeling, phenotype or the linkage group information occurred in form Situation, the continuous blank cell under it is filled with molecular labeling, phenotype or linkage group information;
Rule five, in the case of phenotype or parental information is not included in form, utilizes the syntactic analysis based on dependency tree Device extracts these information supplements into result from form caption;
4th, QTL information is obtained from document text
In the form of screening, the incomplete situation of form data be present, for the information lacked in completion form, divide three Step is handled:The first step, scan the title and explanation of form, the extraction descriptive sentence related to form;Second step, to this A little sentences analyze the information lacked in extraction form using dictionary matching template, such as, we are with F $ num:$ num are matched Go out species information;3rd, if second step result is sky, recycle the syntax analyzer based on dependency tree to excavate statement language The form missing information stored in sentence;
5th, Step 3: the standardization of step 4 Result and error correction
For form and the result of text mining, standardization and error correction in terms of three:
(1) abbreviation compares:The place for occurring vocabulary abbreviation for the first time needs to provide spelling;
(2) validity check:The record for not having character or molecular labeling is deleted from final result, furthermore with priori Knowledge data base inspection result, then deleted in the event of contradiction from final result;
(3) reproducible results inspection:Mark, phenotype, age, region, parent and method information is identical is recorded in most Only retain in termination fruit a.
The present invention passes through the method for the text mining information such as mining analysis QTL, gene function from pertinent literature automatically.This Invention obtains QTL information automatically using computer data digging technology from PDF format document, so as to solve manual read instantly Documentation amount is big, speed slowly, can not in time to newly delivering several timely processings the problem of.Meanwhile this method can be significantly Reduce the work burden of database sharing.
Brief description of the drawings
The lines of Fig. 1 tri- represent to be intended to.
Fig. 2 three schematic diagrames of rule.
Fig. 3 caption information excavating schematic diagrames.
Fig. 4 Stanfordparser decision tree schematic diagrames.
Fig. 5 QTL data digging flow figures.
Embodiment
One, extracts the structure and content of form from the document of PDF format
In the academic documents of pdf forms, form is generally presented in the form of three line tables in paper, as shown in Figure 1.By It can regard as the picture format naturally without miscellaneous point in PDF, the method that we employ image recognition is analyzed three line tables Processing.Logical atmosphere 3rd side cut table separates form gauge outfit part and list data domain part by three root long degree identical form lines.This three We can regard the number of continuous black pixel point as to the length of bar form line, by being progressively scanned to the page, Wo Menke Go out the position of form line with fast positioning, and then orient the position of form, we can distinguish again according to the position of form line Go out gauge outfit domain and data field.
First, we determined that longitudinally split line (all longitudinal directions of the Fig. 1 in addition to vertical line below Assignment of form Line), because the separation between every column data is separated by blank parts, we are according to the quantity of longitudinal black color dots of table area It is no identical with the quantity of horizontal line to judge at this whether to be longitudinally split line.Then, we judge horizontal partition line (Fig. 1 data Domain x wire).In data field, the segregated portion between each row of data is also blank parts, due to point between each row of data Every being blank parts entirely, whether we are identical with form length come location data domain according to the number of continuous white pixel point Laterally separated line.Finally, the gauge outfit across multiple row is identified for we, and the similar method for determining form line, we are in gauge outfit domain The middle position for determining continuous black color dots, the position is the cut-off rule in gauge outfit domain, afterwards, according to the separator bar and center line black The quantity of point determines the cut-off rule (vertical line below Figure 1A ssignment) of longitudinal direction.
By above step, by the positioning to horizontal partition line and longitudinal subdivision line, with reference to ocr technologies, we are just The complete structure and content for having extracted form.Finally, above the single line of three line tables, we close according to Table Key word extracts the Caption parts of form.
The form screening of two, information containing QTL
Multiple forms are included in one scientific literature, program needs the selected form for including QTL information.It can explicitly indicate that The relevant field of table content and QTL information:Molecular labeling and character.But according to literature content statistical result, trait information Sometimes do not exist directly as a row of form, in fact it could happen that in table name or table note.But Molecular Marker Information is generally all It can preserve in the table, if so occurring Molecular Marker Information in form, we are just using the form as candidate's QTL forms.
Three, extract information from the form of screening
Form in document is divided into two kinds.One kind is criteria table, i.e., gauge outfit form the first row, not across multirow Or the cell of multiple row.Another kind is complicated form, i.e., gauge outfit crosses over multirow, the cell across multirow or multiple row be present. It is simple to handle criteria table method, directly extracts gauge outfit field contents, then compares content and predefined Database field Determine the content type of respective column.For complicated form, then need further to handle, be converted into simple list.I Summarize five transformation rules, special disposal multirow gauge outfit and loss of learning situation.This five transformation rules can correctly be located The QTL forms of reason 94%.
Rule one, if the cell in form more than 60% is sky, abandon the form.
Rule two, determine that the basic standard comprising QTL information is that Molecular Marker Information is included in form in form.Decision table The method comprising Molecular Marker Information is to extract preceding four row of form in lattice, using in regular expression fuzzy matching cell Content, judge whether comprising vocabulary such as marker, interval and loci.
Rule three, multiple molecular labelings are correspond to for a phenotype, linkage group or the other information occurred in form Situation (1:N), we determine form line number using molecular labeling as benchmark.As shown in Fig. 2 phenotype " No.ofpods in form The corresponding multiple molecular labelings of perplant ", it can judge that the phenotype has multiple blank cells under one's name during program scanning form, then To content in sub- flag column, if left and right, cell meets rule three if having content, will be automatically in the result with the table for detection Type name is plugged a gap cell.
Rule four, multirow other information is corresponded to for one group of molecular labeling, phenotype or the linkage group information occurred in form Situation, we fill the continuous blank cell under it with information such as molecular labeling, phenotype or linkage groups.
Rule five, in the case of phenotype or parental information is not included in form, we utilize the grammer based on dependency tree Analyzer extracts these information supplements into result from form caption.
Four, obtain QTL information from document text
Find the incomplete situation of form data be present in table statistics result.In some forms, character, parent Etc. the other positions that information is distributed in title, explanation or document.For the information lacked in completion form, three steps of the present invention point Handled.The first step, scan the title and explanation of form, the extraction descriptive sentence related to form.Second step, it is right first These sentences extract the information lacked in form using simple dictionary matching template to analyze, and matching template is as shown in table 1.The Three steps, if second step result is sky, recycles the syntax analyzer based on dependency tree to excavate and deposited in complicated statement sentence The form missing information of storage.Natural language processing algorithm based on dependency tree is to resolve into sentence (including punctuation mark) It is short for lexical unit, such as noun phrase, verb phrase, punctuate symbol, noun of locality phrase, prepositional phrase, adverbial phrase, adjective Language, determiner phrase, measure word phrase etc., regenerate the dependence between these phrases in the sentence, and formation one is tree-shaped Graph of a relation.Text message is further excavated using set syntax rule according to the dependency tree can.
The matching template sample of table 1
Such as there is no parental information in table 1, and parental information is stored in table name.As shown in figure 3, we use Standford Parser find phenotype (trigonelline biosynthesis) and parental information (Essex in table name With Forrest) and add it to final analysis result.
Five, are Step 3: the standardization of step 4 Result and error correction
Vocabulary abbreviation, mistake unavoidably occur during form conversion and text mining and repeats to record.It is right In form and the result of text mining, the present invention standardization and error correction in terms of three.(1) abbreviation compares:According to writing article Specification, the place for occurring vocabulary abbreviation for the first time need to provide spelling.If occurring vocabulary abbreviation in form, we pass through scanning Entire article, the spelling of abbreviation is searched using the method for full word matching, and add to output result.(2) validity check:No The record of character or molecular labeling is deleted from final result.In addition, in order to ensure the correctness of Result, we utilize Priori database (such as pair relationhip table between known linkage group and molecular labeling) inspection result, in the event of lance Shield is then deleted from final result.(3) reproducible results inspection:Mark, phenotype, age, region, parent and the complete phase of method information Same being recorded in final result only retains portion.
The QTL information datas that six, obtain step 5 export
Data pick-up result is shown using web page form form, including the QTL tables of all judgements in document Lattice, all non-QTL forms;All data lists being drawn into include complete information and Incomplete information in QTL forms.

Claims (3)

  1. A kind of 1. method for quickly obtaining QTL data automatically from document, it is characterised in that:
    First, the structure and content of form are extracted from the document of PDF format
    Three line tables in document are analyzed and processed using the method for image recognition, it is quick fixed by being progressively scanned to the page Position goes out the position of form line, and then orients the position of form;By the positioning to row cut-off rule and column split line, with reference to ocr Technology, extract the structure and content of form;Finally, above the single line of three line tables, table is gone out according to keyword extraction The Caption parts of lattice;
    2nd, the form screening of the information containing QTL
    If occurring Molecular Marker Information in form, just using the form as candidate's QTL forms;
    3rd, information is extracted from the form of screening
    For criteria table, gauge outfit field contents are directly extracted, content and predefined Database field are then compared into determination The content type of respective column;For complicated form, using following five rule, multirow gauge outfit and loss of learning feelings are handled Condition, it is converted into simple list:
    Rule one, if the cell in form more than 60% is sky, abandon the form;
    Rule two, determine that the basic standard comprising QTL information is that Molecular Marker Information is included in form in form, is judged in form Method comprising Molecular Marker Information is to extract preceding four row of form, using content in regular expression fuzzy matching cell, Judge whether to include marker, interval and loci vocabulary;
    Rule three, the feelings of multiple molecular labelings are correspond to for a phenotype, linkage group or the other information occurred in form Condition (1:N), form line number is determined using molecular labeling as benchmark;
    Rule four, the feelings of multirow other information are corresponded to for one group of molecular labeling, phenotype or the linkage group information occurred in form Condition, the continuous blank cell under it is filled with molecular labeling, phenotype or linkage group information;
    Rule five, in the case of phenotype or parental information is not included in form, using based on the syntax analyzer of dependency tree from These information supplements are extracted in form caption into result;
    4th, QTL information is obtained from document text
    In the form of screening, the incomplete situation of form data be present, for the information lacked in completion form, point three steppings Row processing:The first step, scan the title and explanation of form, the extraction descriptive sentence related to form;Second step, to these languages Sentence analyzes the information lacked in extraction form using dictionary matching template;3rd, if second step result is sky, recycle base The form missing information stored in statement sentence is excavated in the syntax analyzer of dependency tree;
    5th, Step 3: the standardization of step 4 Result and error correction
    For form and the result of text mining, standardization and error correction in terms of three:
    (1) abbreviation compares:The place for occurring vocabulary abbreviation for the first time needs to provide spelling;
    (2) validity check:The record for not having character or molecular labeling is deleted from final result, furthermore with priori Database auditing result, then deleted in the event of contradiction from final result;
    (3) reproducible results inspection:Mark, phenotype, age, region, parent and identical be recorded in of method information most terminate Only retain in fruit a.
  2. 2. according to the method for claim 1, it is characterised in that also including step 6, the QTL Information Numbers that step 5 is obtained Exported according to web page form form.
  3. 3. method according to claim 1 or 2, it is characterised in that in step 1, the separation between every column data is by blank Part separates, and judges at this whether to be vertical according to whether the quantity of longitudinal black color dots of table area is identical with the quantity of horizontal line To cut-off rule.
CN201710761497.2A 2017-08-30 2017-08-30 The method for obtaining QTL data automatically from document Pending CN107590448A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710761497.2A CN107590448A (en) 2017-08-30 2017-08-30 The method for obtaining QTL data automatically from document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710761497.2A CN107590448A (en) 2017-08-30 2017-08-30 The method for obtaining QTL data automatically from document

Publications (1)

Publication Number Publication Date
CN107590448A true CN107590448A (en) 2018-01-16

Family

ID=61050221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710761497.2A Pending CN107590448A (en) 2017-08-30 2017-08-30 The method for obtaining QTL data automatically from document

Country Status (1)

Country Link
CN (1) CN107590448A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569489A (en) * 2018-06-05 2019-12-13 北京国双科技有限公司 Form data analysis method and device based on PDF file
CN110968667A (en) * 2019-11-27 2020-04-07 广西大学 Periodical and literature table extraction method based on text state characteristics
CN112232198A (en) * 2020-10-15 2021-01-15 北京来也网络科技有限公司 Table content extraction method, device, equipment and medium based on RPA and AI
WO2022116827A1 (en) * 2020-12-03 2022-06-09 International Business Machines Corporation Automatic delineation and extraction of tabular data in portable document format using graph neural networks

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556606A (en) * 2009-05-20 2009-10-14 同方知网(北京)技术有限公司 Data mining method based on extraction of Web numerical value tables

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556606A (en) * 2009-05-20 2009-10-14 同方知网(北京)技术有限公司 Data mining method based on extraction of Web numerical value tables

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜楠楠: "基于文档集的生物信息挖掘模型研究与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569489A (en) * 2018-06-05 2019-12-13 北京国双科技有限公司 Form data analysis method and device based on PDF file
CN110968667A (en) * 2019-11-27 2020-04-07 广西大学 Periodical and literature table extraction method based on text state characteristics
CN112232198A (en) * 2020-10-15 2021-01-15 北京来也网络科技有限公司 Table content extraction method, device, equipment and medium based on RPA and AI
WO2022116827A1 (en) * 2020-12-03 2022-06-09 International Business Machines Corporation Automatic delineation and extraction of tabular data in portable document format using graph neural networks
US11599711B2 (en) 2020-12-03 2023-03-07 International Business Machines Corporation Automatic delineation and extraction of tabular data in portable document format using graph neural networks
GB2616556A (en) * 2020-12-03 2023-09-13 Ibm Automatic delineation and extraction of tabular data in portable document format using graph neural networks

Similar Documents

Publication Publication Date Title
US5164899A (en) Method and apparatus for computer understanding and manipulation of minimally formatted text documents
JP5144940B2 (en) Improved robustness in table of contents extraction
JP3232143B2 (en) Apparatus for automatically creating a modified version of a document image that has not been decrypted
US20060285746A1 (en) Computer assisted document analysis
US7756871B2 (en) Article extraction
Déjean et al. A system for converting PDF documents into structured XML format
CN107590448A (en) The method for obtaining QTL data automatically from document
CN104199871B (en) A kind of high speed examination question introduction method for wisdom teaching
JP2005526314A (en) Document structure identifier
JPH07325827A (en) Automatic hyper text generator
CN109344355B (en) Automatic regression detection and block matching self-adaption method and device for webpage change
US7046847B2 (en) Document processing method, system and medium
CN112434496B (en) Method and terminal for identifying form data of bulletin document
Marinai et al. Conversion of PDF books in ePub format
CN113962201A (en) Document structuralization and extraction method for documents
CN110688863A (en) Document translation system and document translation method
EP2544100A2 (en) Method and system for making document modules
Berg et al. Towards high-quality text stream extraction from PDF. Technical background to the ACL 2012 Contributed Task
Lin et al. Detection and analysis of table of contents based on content association
Hollingsworth et al. Retrieving hierarchical text structure from typeset scientific articles–a prerequisite for e-science text mining
Josi et al. Structural analysis of contract renewals
JPH103483A (en) Information retrieval device
Bauer et al. Fiasco: Filtering the internet by automatic subtree classification, osnabruck
CN103646058B (en) Method and system for identifying key words in technical documents
Josi et al. Preparing legal documents for NLP analysis: Improving the classification of text elements by using page features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180116

RJ01 Rejection of invention patent application after publication