CN103914440A - Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents - Google Patents

Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents Download PDF

Info

Publication number
CN103914440A
CN103914440A CN201410081331.2A CN201410081331A CN103914440A CN 103914440 A CN103914440 A CN 103914440A CN 201410081331 A CN201410081331 A CN 201410081331A CN 103914440 A CN103914440 A CN 103914440A
Authority
CN
China
Prior art keywords
paragraph
outline
project
rank
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410081331.2A
Other languages
Chinese (zh)
Inventor
吴烈鑫
刘志明
陈锟
张章亮
李国勇
陈铭
王彦峰
侯凯
陈宝珍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Grid Planning Research Center of Guangdong Power Grid Co Ltd
Original Assignee
Grid Planning Research Center of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Grid Planning Research Center of Guangdong Power Grid Co Ltd filed Critical Grid Planning Research Center of Guangdong Power Grid Co Ltd
Priority to CN201410081331.2A priority Critical patent/CN103914440A/en
Publication of CN103914440A publication Critical patent/CN103914440A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents. The intelligent extracting method for the project characteristic indexes in the transmission and transformation project word document table contents sequentially comprises creating an index library for the project characteristic indexes, extracting the project characteristic indexes in the outline level and extracting the table contents of the project characteristic indexes. The created index library is served as a basic database for extraction of the project characteristic indexes. When the transmission and transformation project characteristic indexes are extracted, firstly extraction in the outline level is performed, then extraction of the table contents is performed, and finally the required transmission and transformation project characteristic indexes are obtained after the extraction. The intelligent extracting method for the project characteristic indexes in the transmission and transformation project word document table contents has the advantages of automatically extracting the required transformation project character index information in design files and improving the working efficiency of specialists in design review.

Description

The intelligent extract method of engineering characteristic index in project of transmitting and converting electricity word document table content
Technical field
The present invention relates to the extracting method of engineering characteristic index in project of transmitting and converting electricity word document, specifically refer to the intelligent extract method of engineering characteristic index in project of transmitting and converting electricity word document table content.
Background technology
In project of transmitting and converting electricity evaluation, evaluation expert needs reading review report repeatedly could from a large amount of words and form, extract appraised index content, and then comprehensively provides review comment.In this process, expert need to carry out search repeatedly to document content, and records index of correlation content, and these all need expert manually to complete, and greatly affects expert's work efficiency and accuracy.
Engineering characteristic indication information mainly exists in engineering design report, and different characteristic index is distributed in the positive section text description explanation of different chapters and sections or adopts forms mode explanation.Check engineering report, fuzzy search engineering characteristic indication information, inefficiency.
Summary of the invention
The object of this invention is to provide the intelligent extract method of engineering characteristic index in project of transmitting and converting electricity word document table content, this extracting method can be from design document, automatically extract required project of transmitting and converting electricity characteristic index information, improve the work efficiency of expert in design review.
Above-mentioned purpose of the present invention realizes by following technical solution:
The intelligent extract method of engineering characteristic index in project of transmitting and converting electricity word document table content, it is characterized in that: the method comprises that creating the index storehouse of engineering characteristic index, the outline rank extraction of engineering characteristic index and the table content of engineering characteristic index extracts successively, wherein, the index storehouse creating is as the basic database of engineering characteristic index extraction, in the time that project of transmitting and converting electricity characteristic index is extracted, first carry out the extraction of outline rank, then carry out table content extraction;
Described outline rank is extracted and is in turn included the following steps:
(1) initialization word document, records the shared paragraph quantity of each form;
(2) each paragraph in traversal word document, resolves paragraph properties information, records each paragraph number;
(3) judge that paragraph properties is whether in form, if paragraph in form, is skipped the shared paragraph quantity of this form, record serial number that this form occurs in document and the outline title at place simultaneously, return to above-mentioned steps (2); If paragraph not in form, continues following step (4);
(4) judge paragraph properties, if paragraph properties value is not body text, directly takes out the class value of paragraph outline and record the rank at place; If paragraph properties value is body text, outline rank is set to body text, returns to above-mentioned steps (2);
(5) judge that outline rank is body text, use regular expression to resolve paragraph content, resolution rules is as follows:
A. the feature of self-defined outline paragraph is with numeral and beginning of letter, between outline numbering, cut apart with ". ", and if between them, must have space to distinguish for numeral starts content after outline;
B. filter out with numeral beginning, the not paragraph of outline;
C. use regular expression to resolve the outline rank of paragraph according to numeral, letter;
Described table content is extracted and is in turn included the following steps:
(1) obtain engineering characteristic index place chapter title and form by the index storehouse creating and extract the ranks title, the expression formula that need;
(2) result that coupling outline rank is extracted, obtain corresponding outline title and below the form of child node outline and form along number;
(3) directly locate form corresponding in document according to form numbering;
(4) travel through each form, determine only element form according to the ranks title of configuration, directly extract the project of transmitting and converting electricity characteristic index in form, after extraction, obtain required project of transmitting and converting electricity characteristic index.
Compared with prior art, the present invention can extract engineering characteristic index in project of transmitting and converting electricity word document table content, improves the work efficiency of expert in design review.
Accompanying drawing explanation
Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
Fig. 1 is the overall flow block diagram of intelligent extract method of the present invention;
Fig. 2 is the FB(flow block) that in intelligent extract method of the present invention, outline rank is extracted;
Fig. 3 is the FB(flow block) that in intelligent extract method of the present invention, table content is extracted;
Fig. 4 is the schematic diagram in the index storehouse that creates in intelligent extract method of the present invention;
Fig. 5 is the extraction result schematic diagram of intelligent extract method of the present invention.
Embodiment
As shown in Figures 1 to 5, the intelligent extract method of engineering characteristic index in project of transmitting and converting electricity word document table content of the present invention, the method comprises that creating the index storehouse of engineering characteristic index, the outline rank extraction of engineering characteristic index and the table content of engineering characteristic index extracts successively, wherein, the index storehouse creating is as the basic database of engineering characteristic index extraction, in the time that project of transmitting and converting electricity characteristic index is extracted, first carry out the extraction of outline rank, then carry out table content extraction.
First engineering characteristic indication information is defined, then engineering design report is carried out to vector quantization, extract outline title in report, dividing report Chinese word describes and form description, adopt various ways to extract and overview display engineering characteristic information, facilitate evaluation expert to check engineering characteristic indication information, improve evaluation expert's evaluation efficiency.The operational flowchart of intelligent extraction is as shown in Figure 1:
The establishment in index storehouse
Index storehouse is tissue and storage engineering characteristic index unit, and engineering characteristic index adopts tree structure mode organize and store, and each engineering objective comprises Back ground Information and extracting method information, and an index can multiple extracting method.Index basic characteristic information mainly contains: the information such as index name, affiliated specialty, affiliated engineering type, electric pressure, index unit, extracting method information: evaluation stage, chapter title, whether form extraction, row headers, column heading, expression formula, key word, extracting method etc.Index storehouse is the basic data of intelligent extraction, and evaluation expert can self-defined engineering characteristic achievement data.The index storehouse creating as shown in Figure 4.
Outline rank is extracted in document pre-service
In word, each paragraph has outline level attribute: body text or concrete rank, as 1 grade, 2 grades, 3 grades ... in the time of one piece of word document of editor, people can use paragraph heading, the bullets etc. that word carries to be referred to as " outline rank ", the data of the kind tree structure when outline rank carrying in word; Simultaneously, also can directly write paragraph numbering, distinguish different paragraph headings by some conventional numerals are set with letter mark, these titles are called " self-defined outline rank ", as " 1 title 1 " or " a title a ", self-defined outline rank is that word document itself can not be identified.Therefore, in the time that extracting, the outline rank to word document need to consider that word self outline rank and self-defined outline rank extract.
In the time extracting document schem rank, need to record paragraph, outline place rank and form place outline rank in the document again of outline rank simultaneously.Outline rank is extracted process flow diagram as shown in Figure 2:
Outline rank in word document is extracted and is in turn included the following steps:
1. initialization word document, records the shared paragraph quantity of each form;
2. each paragraph in traversal word document, resolves paragraph properties information, records each paragraph number;
3. judge that paragraph properties is whether in form, if paragraph in form, is skipped the shared paragraph quantity of this form, record serial number that this form occurs in document and the outline title at place simultaneously, if return to above-mentioned steps 2 paragraphs not in form, continue following step 4;
4. judge paragraph properties, if paragraph properties value is not body text, directly takes out the class value of paragraph outline and record the rank at place; If paragraph properties value is body text, outline rank is set to body text, returns to above-mentioned steps 2;
5. judge that outline rank is body text, use regular expression to resolve paragraph content, resolution rules is as follows:
A. the feature of self-defined outline paragraph is with numeral and beginning of letter, between outline numbering, cut apart with ". ", and if between them, must have space to distinguish for numeral starts content after outline, as " 1110kV power distribution equipment ";
B. filter out with numeral and start, the not paragraph of outline, as the paragraph that " 220 kilovolts of main service areas of Wei Tang transformer station are Localities In Southwest, Huicheng District " starts, it is below specific characters such as " kV, mA, kilovolt, time " that this mode is filtered numeral by regular expression;
C. use regular expression to resolve the outline rank of paragraph according to numeral, letter.
Table content is extracted
Table content is extracted process flow diagram as shown in Figure 3, and table content is extracted and in turn included the following steps:
1. obtain ranks title, the expression formula of engineering characteristic index place chapter title and form extraction needs by the index storehouse creating;
2. the result that coupling outline rank is extracted, obtain corresponding outline title and below the form of child node outline and form along number;
3. directly locate form corresponding in document according to form numbering;
4. each form of traversal, determines only element form according to the ranks title of configuration, directly extracts the project of transmitting and converting electricity characteristic index in form, obtains required project of transmitting and converting electricity characteristic index, as shown in Figure 5 after extraction.Project of transmitting and converting electricity characteristic index in direct extraction form in this step also can be extracted by expression formula.
Engineering characteristic information display
The technical indicator corresponding engineering extracting is carried out showing according to the mode of engineering number and form, consults for experts' evaluation, simultaneously expert can be directly to extracted desired value modify, document locates highlighted demonstration.
The above embodiment of the present invention is not limiting the scope of the present invention; embodiments of the present invention are not limited to this; all this kind is according to foregoing of the present invention; according to ordinary skill knowledge and the customary means of this area; do not departing under the above-mentioned basic fundamental thought of the present invention prerequisite; modification, replacement or the change of other various ways that said structure of the present invention is made, within all should dropping on protection scope of the present invention.

Claims (1)

1. the intelligent extract method of engineering characteristic index in project of transmitting and converting electricity word document table content, it is characterized in that: the method comprises that creating the index storehouse of engineering characteristic index, the outline rank extraction of engineering characteristic index and the table content of engineering characteristic index extracts successively, wherein, the index storehouse creating is as the basic database of engineering characteristic index extraction, in the time that project of transmitting and converting electricity characteristic index is extracted, first carry out the extraction of outline rank, then carry out table content extraction;
Described outline rank is extracted and is in turn included the following steps:
(1) initialization word document, records the shared paragraph quantity of each form;
(2) each paragraph in traversal word document, resolves paragraph properties information, records each paragraph number;
(3) judge that paragraph properties is whether in form, if paragraph in form, is skipped the shared paragraph quantity of this form, record serial number that this form occurs in document and the outline title at place simultaneously, return to above-mentioned steps (2); If paragraph not in form, continues following step (4);
(4) judge paragraph properties, if paragraph properties value is not body text, directly takes out the class value of paragraph outline and record the rank at place; If paragraph properties value is body text, outline rank is set to body text, returns to above-mentioned steps (2);
(5) judge that outline rank is body text, use regular expression to resolve paragraph content, resolution rules is as follows:
A. the feature of self-defined outline paragraph is with numeral and beginning of letter, between outline numbering, cut apart with ". ", and if between them, must have space to distinguish for numeral starts content after outline;
B. filter out with numeral beginning, the not paragraph of outline;
C. use regular expression to resolve the outline rank of paragraph according to numeral, letter;
Described table content is extracted and is in turn included the following steps:
(1) obtain engineering characteristic index place chapter title and form by the index storehouse creating and extract the ranks title, the expression formula that need;
(2) result that coupling outline rank is extracted, obtain corresponding outline title and below the form of child node outline and form along number;
(3) directly locate form corresponding in document according to form numbering;
(4) travel through each form, determine only element form according to the ranks title of configuration, directly extract the project of transmitting and converting electricity characteristic index in form, after extraction, obtain required project of transmitting and converting electricity characteristic index.
CN201410081331.2A 2014-03-06 2014-03-06 Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents Pending CN103914440A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410081331.2A CN103914440A (en) 2014-03-06 2014-03-06 Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410081331.2A CN103914440A (en) 2014-03-06 2014-03-06 Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents

Publications (1)

Publication Number Publication Date
CN103914440A true CN103914440A (en) 2014-07-09

Family

ID=51040134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410081331.2A Pending CN103914440A (en) 2014-03-06 2014-03-06 Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents

Country Status (1)

Country Link
CN (1) CN103914440A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504221A (en) * 2015-01-13 2015-04-08 北京恒华伟业科技股份有限公司 Evaluation data processing method and system
CN105184513A (en) * 2015-10-16 2015-12-23 北京恒华伟业科技股份有限公司 Evaluation result output method and system
CN105184514A (en) * 2015-10-19 2015-12-23 广东电网有限责任公司电网规划研究中心 Power grid design index extraction method based on sequence label
CN105373885A (en) * 2015-11-10 2016-03-02 国网福建省电力有限公司 Electric power engineering design review and technical economic evaluation information system
CN105389302A (en) * 2015-10-19 2016-03-09 广东电网有限责任公司电网规划研究中心 Power grid design review index structure information identification method
CN108073678A (en) * 2017-11-06 2018-05-25 广东广业开元科技有限公司 Applied to document analyzing and processing method, system and the device in big data analysis
CN113361256A (en) * 2021-06-24 2021-09-07 上海真虹信息科技有限公司 Rapid Word document parsing method based on Aspose technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556606A (en) * 2009-05-20 2009-10-14 同方知网(北京)技术有限公司 Data mining method based on extraction of Web numerical value tables
EP2291010A1 (en) * 2008-06-05 2011-03-02 Peking University Founder Group Co., Ltd Structure processing method and apparatus for layout file
CN103399857A (en) * 2013-07-01 2013-11-20 北京航空航天大学 General method for extracting document structural information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2291010A1 (en) * 2008-06-05 2011-03-02 Peking University Founder Group Co., Ltd Structure processing method and apparatus for layout file
CN101556606A (en) * 2009-05-20 2009-10-14 同方知网(北京)技术有限公司 Data mining method based on extraction of Web numerical value tables
CN103399857A (en) * 2013-07-01 2013-11-20 北京航空航天大学 General method for extracting document structural information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘力: "科技文档信息抽取与格式化技术研究", 《中国优秀硕士学位论文全文数据库•信息科技辑》 *
杨桢 等: "基于正则表达式的信息抽取系统在国防技术监测中的应用", 《北京理工大学学报》 *
赵洪 等: "Web表格信息抽取研究综述", 《现代图书情报技术》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504221A (en) * 2015-01-13 2015-04-08 北京恒华伟业科技股份有限公司 Evaluation data processing method and system
CN105184513A (en) * 2015-10-16 2015-12-23 北京恒华伟业科技股份有限公司 Evaluation result output method and system
CN105184514A (en) * 2015-10-19 2015-12-23 广东电网有限责任公司电网规划研究中心 Power grid design index extraction method based on sequence label
CN105389302A (en) * 2015-10-19 2016-03-09 广东电网有限责任公司电网规划研究中心 Power grid design review index structure information identification method
CN105389302B (en) * 2015-10-19 2017-11-28 广东电网有限责任公司电网规划研究中心 A kind of electrical reticulation design appraised index structural information recognition methods
CN105373885A (en) * 2015-11-10 2016-03-02 国网福建省电力有限公司 Electric power engineering design review and technical economic evaluation information system
CN108073678A (en) * 2017-11-06 2018-05-25 广东广业开元科技有限公司 Applied to document analyzing and processing method, system and the device in big data analysis
CN108073678B (en) * 2017-11-06 2020-08-28 广东广业开元科技有限公司 Document analysis processing method, system and device applied to big data analysis
CN113361256A (en) * 2021-06-24 2021-09-07 上海真虹信息科技有限公司 Rapid Word document parsing method based on Aspose technology

Similar Documents

Publication Publication Date Title
CN103914440A (en) Intelligent extracting method for project characteristic indexes in transmission and transformation project word document table contents
CN103927296A (en) Intelligent extracting method for engineering characteristic indexes in paragraph contents of word document of transmission and transformation project
CN104331446B (en) A kind of massive data processing method mapped based on internal memory
CN101763422B (en) Method for storing vector data and indexing space
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN102156711B (en) Cloud storage based power full text retrieval method and system
CN104063365B (en) The method that object is inserted into PDF document
CN104537116A (en) Book search method based on tag
CN104063519B (en) BPA power grid data analyzing and managing method and system based on EXCEL
CN104035956A (en) Time-series data storage method based on distributive column storage
CN107766433A (en) A kind of range query method and device based on Geo BTree
CN105335488A (en) Knowledge base construction method
CN104572978A (en) User behavior counting method for power scheduling automatic system based on log
CN107463711A (en) A kind of tag match method and device of data
CN102508901A (en) Content-based massive image search method and content-based massive image search system
CN103020283B (en) A kind of semantic retrieving method of the dynamic restructuring based on background knowledge
CN103150632B (en) Flood control based on water conservation cloud platform is taked precautions against drought the construction method of bulletin generation system
CN103699555A (en) Multisource data real-time database data generation method applicable to scheduling and transformer substation integrated system
CN104679829A (en) Quick search method and apparatus of license plate numbers
CN104408128B (en) A kind of reading optimization method indexed based on B+ trees asynchronous refresh
CN107944659A (en) A kind of substation supervisory control system auto report completing method and apparatus
CN103455964B (en) A kind of case clue analytic system based on case information and method
CN106649879A (en) Method for intelligent recommendation of professional book in library
CN104615782A (en) Address matching method based on sliding window maximum matching algorithm
CN109033370A (en) A kind of method and device that searching similar shop, the method and device of shop access

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140709

RJ01 Rejection of invention patent application after publication