CN112597353B

CN112597353B - Text information automatic extraction method

Info

Publication number: CN112597353B
Application number: CN202011507003.6A
Authority: CN
Inventors: 刘金硕; 王晨阳; 邓娟; 黄朔; 刘宁; 唐浩洲
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2024-03-08
Anticipated expiration: 2040-12-18
Also published as: CN112597353A

Abstract

The invention discloses an automatic text information extraction method, which is characterized in that the parameter information of the existing bidding document target is extracted manually, so that a great deal of labor and time are required, and the time and the labor are consumed. The invention uses natural language processing technology to automatically extract parameter information of a bidding text, designs a bidding text structuring, extracts target object parameter information and extracts a report system, wherein the bidding text structuring comprises the steps of extracting bookmark information by using pypdf2, identifying pdf bidding text by using pdfplumberer, cleaning the text by using regularization, and then carrying out structuring analysis processing on the text by using rule matching. And the target parameter information is extracted by accurately identifying and extracting the technical parameter information of the target in the structured target text by utilizing a regularization technology. And finally, establishing an extraction report by utilizing the information in the process, and intuitively reflecting the whole extraction condition.

Description

Text information automatic extraction method

Technical Field

The invention belongs to the technical field of computers, relates to an automatic text information extraction method, and in particular relates to an automatic target parameter information extraction method for a bid-oriented text.

Background

Along with the continuous development of the intellectualization and automation of the existing information technology, great influence and convenience are brought to the life of people, characters can be automatically converted into pictures, and the pictures can be converted into the characters, so that the intelligent and convenient use is realized; however, the specific specificity of the information is required for some specific fields, and the information extraction is difficult to be performed in a targeted manner in the prior art, such as automatic extraction of target parameter information for a bid-oriented text.

The bid-inviting file is a centralized embodiment of purchasing requirements, and the quality of the bid-inviting file directly determines success and failure of bid-inviting results. Through utilizing past bidding documents to compile bidding standard documents, bidding behaviors can be unified, bidding quality is improved, management level is promoted, project success rate can be improved, and bidding document compiling time is saved. However, the existing standard documents for bid-making are manually compiled, especially the target parameters, technical requirements and the like, and require a lot of skilled professionals to extract information in a great deal of time and effort.

There is an urgent need for a technique for specific information extraction.

Disclosure of Invention

In order to solve the technical problems, the invention provides an automatic text information extraction method which is used for solving the problem of automatic extraction of target object parameter information of a bidding document and replacing the current time-consuming and labor-consuming manual extraction method.

The technical scheme adopted by the invention is as follows: the automatic text information extraction method is characterized by comprising the following steps of:

step 1: carrying out batch preprocessing on the input text, and converting the input text into pdf format text;

step 2: carrying out structuring treatment on the pdf format text;

the specific implementation of the step 2 comprises the following sub-steps:

step 2.1: inputting a batch of pdf format text;

step 2.2: extracting bookmark information in the pdf format text by using the pypdf2, and carrying out regular matching on bookmark names by using a construction rule to obtain matched chapter bookmarks, and saving the names and page position information of the bookmarks;

the rules for extracting chapter bookmarks according to the book signature are as follows: pattern= "(chapter|part)";

step 2.3: dividing the pdf format text based on the chapter bookmark information extracted in the step 2.2 to obtain each chapter text in the file respectively;

step 2.4: based on each chapter text obtained in the step 2.3, constructing a specific regular rule to divide the chapter text, and obtaining the names and the positions of each section in the chapter;

the regular rule of extracting the bar is as follows: pattern= ", section";

step 3: positioning, identifying and extracting the related information of the appointed target;

step 4: and establishing an extraction report by utilizing the intermediate information of the steps to generate an extraction result.

Preferably, in step 1, batch preprocessing is performed on the input text by utilizing a winAPI, and a word bottom vba is called by utilizing a python win32 library to convert the word format text into the pdf format text.

Preferably, the specific implementation of step 2.3 comprises the following sub-steps:

step 2.3.1: constructing a regular rule according to the bookmark names in the bookmarks, positioning each chapter, and intercepting the original pdf file by utilizing page position information corresponding to each chapter;

step 2.3.2: identifying the intercepted chapter text by using a pdfplumber;

step 2.3.3: cleaning the text by using a natural language processing technology, and cleaning blank spaces, line-feed symbols, header footers, annotating invalid interference texts such as page numbers and the like in the text;

step 2.3.4: and storing the cleaned text of each chapter to a txt file.

Preferably, the specific implementation of the step 3 comprises the following sub-steps:

step 3.1: constructing a section in which the regular rule positioning related information is located; the regular expression of the chapter where the technical parameter content in the related information is located is as follows: pattern= "(technical |parameter|requirement),"; the construction principle of other information regular expressions is the same as that of technical parameter regular expressions;

step 3.2: based on the section obtained in the step 3.1, the section information obtained in the step 2.4 is utilized to construct a rule matched with a specific technical parameter section; the regular expression of the matching technical parameter bar is as follows: pattern= "((technical|parameter) | requirement) | (..technical|parameter),";

step 3.3: accurately positioning the position of a specific parameter text in a bar by utilizing regular matching; the regular expression for locating the specific parameter text is as follows: pattern= "\W? D + \w [_4e00-_9fa5] (technical |parameter| requirement) [_4e00-_9fa5] (|: is? ";

step 3.4: starting from the content positioned in the step 3.3, carrying out row-by-row parameter identification, and extracting corresponding parameter types, parameter names and parameter values;

step 3.5: storing the parameter name and parameter value extracted in step 3.4, together with the object type, the extracted source (file name) in the form of key value pair in the python dictionary type;

step 3.8: and the parameters extracted from the batch files are stored in the json file, so that the subsequent processing is convenient.

Preferably, in step 3.2, since there are usually parameter information of a plurality of packages in a single bid, for accurate identification and extraction, if the single bid includes parameter information of a plurality of packages, rule division is built again in the technical parameter section extracted in step 3.2, so as to obtain technical parameter section contents of each package; the regular expressions for dividing the technical parameter sections of each label packet are as follows: pattern= ", x (packet|term); and screening to obtain the technical parameter section of the specified object based on the technical parameter section content of each divided object packet.

Preferably, the screening obtains the technical parameter section to which the specified object belongs, and judges whether the current technical parameter section belongs to the specified object by detecting whether the beginning of the content of each packet technical parameter section contains the name of the specified object.

Preferably, the specific implementation of step 3.4 comprises the following sub-steps:

step 3.6.1: constructing a specific rule, and sequentially judging whether the line of texts is a primary title (such as '1. Parameter information'); if yes, the title indicates the parameter type, the current parameter type is modified to be the text content after the title is cleaned, and the flow process is ended; if not, turning to the step 3.6.2;

the regular expression for judging whether the text is a primary title is as follows: pattern= "\W? D + \w [_4e00\u9fa 5] + (: |:)? ";

step 3.6.2: judging whether the line text is a secondary title and comprises a parameter name: the format text of parameter value "(e.g., 1.1 capacity: greater than 12000 t/h); if yes, the text is the formatted description of the object parameters, the parameter name and the parameter value are extracted by utilizing the regularization, the parameter name and the parameter value are stored into a dictionary together with the current parameter type as a parameter item, and the flow process is ended; if not, go to step 3.6.3;

wherein, judge whether the text is the second grade title and include "parameter name: the regular expression of the parameter value "is: pattern= "\W? D+w+d+w [_4e00\u9fa5 ] + (|:) + ";

step 3.6.3: judging whether the line text is a secondary title (such as 2.2 coal cutter valve class parts are provided with filters); if yes, the text is the specific text description of the parameters, the whole text is directly taken as the parameter name, the parameter value is set to be empty, the operation of storing the current parameter item is executed, and the flow is ended; if not, go to step 3.6.4;

the regular expression for judging whether the text is a secondary title is as follows: pattern= "\W? D+w+d+w [_4e00\u9fa5 ] + (|:) + ";

step 3.6.4: judging whether the line text is a three-level title; if so, cleaning the line text to remove the title, adding the line text into the parameter value of the last parameter item, and ending the flow; if not, go to step 3.6.5;

the regular expression for judging whether the text is a three-level title is as follows: pattern= "\W? D+ (\w+ \d+) {2} \w ";

step 3.6.5: if the conditions are not met, indicating that the line text is the subsequent text of the previous parameter item, and directly adding the line text into the parameter value of the previous parameter item;

step 3.6.6: the steps 3.6.1-3.6.5 are circularly executed until the content of the label packet is finished.

Preferably, in step 4, the extraction report is generated by using the intermediate information in the above process, including the number of files, the total number of packets identified in the files, the number of packets of the specified object obtained by screening, whether the parameter content is successfully located, and the total number of extracted parameter items.

Compared with the current manual extraction method, the automatic extraction method for the target parameter information of the bidding document can realize the structural processing of the bidding document by utilizing the technology in the natural language processing field, simultaneously automatically and efficiently extract the parameter information of the target object, save a large amount of manpower and material resources and provide a solid foundation for the compilation of the subsequent bidding standard document and other data analysis.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention;

FIG. 2 is a text structure of an embodiment of the present invention;

FIG. 3 is a text structuring flow chart of an embodiment of the present invention;

FIG. 4 is a flowchart of object technical parameter positioning identification and extraction according to an embodiment of the present invention;

FIG. 5 is a flow chart of the parameter row-by-row extraction according to an embodiment of the present invention;

fig. 6 is a drawing of an extraction report according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.

The embodiment further explains the invention through automatic extraction of the target parameter information facing the bidding text.

The bidding business is an important work for enterprises to conduct project management, and the bidding documents have relatively standardized writing requirements and text contents, so that if the bidding documents are researched as corpus, the functions of standard bidding document management, application, feedback, updating iteration and the like are realized, the working efficiency of staff in related fields can be remarkably improved, the bidding quality and control risk are effectively improved, the management mode of the enterprises on bidding is promoted to develop in the large direction of intellectualization and electronization, and the defect of the application of computer technology in the bidding field is overcome.

In order to facilitate compiling of bidding standard documents, the method is mainly used for carrying out structural processing on bidding corpus in the bidding field and researching an extraction method of final target parameter information. The invention finally realizes the automatic extraction of the target object parameter information by utilizing the technology in the natural language processing field.

In this embodiment, fig. 1 shows a flowchart of an automatic extraction method of target parameter information for a bid text, and the original bid text is processed by a natural language processing technology to finally extract corresponding technical parameter information. To realize automatic extraction, the characteristics of the bidding documents need to be clear, and the bidding documents develop a unique idiom style and writing standard after a long time.

(1) The bidding text contains a large number of field idioms and technical terms, and has strong field characteristics.

(2) The structure of the bidding text is relatively fixed, and the idiom style and the text structure are relatively uniform.

Based on the characteristics of the bidding documents, an automatic extraction strategy of the technical parameter information of the targets is provided on the basis, and an extraction flow is determined.

In this embodiment, a standard structure of a text is shown in fig. 2, and based on such a text structure, the method for automatically extracting target parameter information for a bid-oriented text provided by the invention includes the following steps:

step 1: performing batch pretreatment on the bidding text by utilizing the winAPI, so that the subsequent input files are in pdf format;

in this embodiment, since most of the tagbook files are pdf files, but some of the tagbook files are still in word format, in order to facilitate subsequent unified processing, a python win32 library is used to call a word bottom vba, and all of the word files are converted into pdf files in batch.

Step 2: carrying out structuring treatment on the bidding documents;

in this embodiment, as shown in fig. 3, a text structuring flowchart is shown. According to the characteristics of the text structure, the text structure is divided by using a specific rule:

step 2.1: inputting batch pdf files;

step 2.2: extracting bookmark information in the pdf file by using the pypdf2, and carrying out regular matching on bookmark names by using a construction rule to obtain matched chapter bookmarks, and saving the names and page position information of the bookmarks;

regular expressions, also known as regular expressions, are a logical formula for string operations, that is, a "regular string" is formed by a number of specific characters defined in advance and combinations of the specific characters, and the "regular string" is used to express a filtering logic for the string.

The regular expression may be used to determine whether a given string meets the filtering logic of the regular expression (known as "matching"), or the desired specific portion may be obtained from the string by the regular expression. By utilizing the two characteristics of the regularization and other natural language processing technologies, the automatic extraction of the target parameter information of the bidding text can be realized.

In this embodiment, the rule for extracting the chapter bookmarks according to the bookmark names is as follows:

pattern= "(chapter|part)";

step 2.3: dividing the pdf file based on the chapter bookmark information extracted in the step 2 to obtain the content of each chapter in the file;

step 2.3.2: identifying the intercepted chapter text by using a pdfplumber;

step 2.3.4: and storing the cleaned text of each chapter to a txt file.

Step 2.4: based on the text of each chapter processed in the step 2.3, constructing a specific regular rule to divide the chapter content, and acquiring the names and the positions of each section in the chapter;

the regular rule of extracting the bar is as follows:

pattern= ", section" No.) "

In step 2.3 of this embodiment, the text is divided into chapters by using pdf bookmark information, the chapters are identified by using the pdfplumber library in python, and finally the text is cleaned by using a natural language processing technique, so as to obtain a relatively standard and clean chapters.

Step 3: constructing a corresponding rule by utilizing a regularization technology, and carrying out positioning identification and extraction on parameter information of a specified object;

in this embodiment, as shown in fig. 4, a flowchart of target technical parameter positioning identification and extraction is shown. And positioning the parameter content of the designated object by constructing corresponding rules by utilizing the line character of the parameter description of the target object and the structural information in the previous text, and then accurately extracting according to the line style construction rules of the parameter text content. The whole positioning recognition and extraction steps are as follows:

step 3.1: in this embodiment, as shown in fig. 2, the name of the chapter where the technical parameter is located is "the fifth chapter cargo technical requirement", and the regular expression for locating the chapter where the technical parameter content is located is:

pattern= ", technical |parameter| requirement".

Step 3.2: based on the section obtained in step 3.1, the construction rule is matched to a specific technical parameter section by using the section information obtained in step 2.4, and in this embodiment, as shown in fig. 2, the section name where the technical parameter is located is "first section technical requirement".

The regular expression of the matching technology bar is as follows:

pattern= "((technical|parameter.)). Requirements) | (technologic parameter)' the following:

step 3.3: because there are usually parameter information of multiple standard packages in a standard book, in order to perform accurate identification and extraction, each standard package is divided by a reconstruction rule to obtain parameter content of each standard package, in this embodiment, as shown in fig. 2, each standard package is named as a "first package coal mining machine", "a" second package scraper conveyor ";

the regular expressions matching each label packet are as follows:

pattern= ", fifth (packet|term)," v "

Step 3.4: based on each label packet divided in the previous step, the parameter content of the label packet to which the specified target object belongs needs to be screened.

In this embodiment, the method for screening the label package is as follows: the names of the objects to which the table packet belongs, such as "first-packet shearer", "technical parameters of shearer", etc., are usually mentioned at the beginning of the text content of the packets, or in the first few lines of introduction, so that it is determined whether the current packet belongs to the object of the specified object by detecting whether the beginning of the text content of each packet contains the name of the object of the specified object. In this embodiment, ten lines are set in the top of the header of the markup packet text for detecting the object name of the specified object.

Step 3.5: precisely positioning the starting position of the technical parameter information content by utilizing regular matching; because the technical parameter information content has a fixed structure in each label package text and even has a specific position, the parameter content can be accurately positioned according to the specific text expression.

The regular expression for locating the beginning position of the parameter information content is as follows:

pattern= "\W? D + \w [_4e00-_9fa5] (technical |parameter| requirement) [_4e00-_9fa5] (|: is? "

Step 3.6: starting from the content positioned in the step 3.2, carrying out row-by-row parameter identification, and extracting corresponding parameter types, parameter names and parameter values;

in this embodiment, in the content of performing progressive parameter identification extraction, as shown in fig. 5, an algorithm flow chart of progressive parameter identification extraction is shown. And determining the type of the parameter by using the title structure of the parameter, and accurately extracting the parameter name and the parameter value by using the character of the line text of the parameter description. The specific steps of row-by-row identification and extraction are as follows:

Step 3.7: storing the parameter name and parameter value extracted in step 3.3, together with the object type, the extracted source (file name) in the form of key value pair in the python dictionary type;

Step 4: establishing an extraction report by utilizing the intermediate information of the steps to generate an extraction result;

in this embodiment, as shown in fig. 6, the parameter extraction report is mainly generated by using the intermediate information in the above process, such as the number of files, the total number of packets identified from these files, the number of packets of the specified object obtained by screening, whether the parameter content is successfully located, the total number of extracted parameter items, and the like, so as to summarize the whole extraction process.

The method is not only suitable for extracting the corresponding parameters of the bidding documents, but also suitable for various scenes needing to propose special information from the text.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. The automatic text information extraction method is characterized by comprising the following steps of:

step 2: carrying out structuring treatment on the pdf format text;

the specific implementation of the step 2 comprises the following sub-steps:

step 2.1: inputting a batch of pdf format text;

the regular rule of extracting the bar is as follows: pattern= ", section";

the specific implementation of the step 3 comprises the following sub-steps:

step 3.5: storing the parameter name and the parameter value extracted in the step 3.4 together with the target object type into a python dictionary type together with the extracted file name in a key value pair format;

step 3.6: the parameters extracted from the batch files are stored in json files;

2. The automatic text information extraction method according to claim 1, wherein: in the step 1, batch preprocessing is carried out on input texts by utilizing the WinAPI, and a word bottom vba is called by utilizing a python win32 library to convert word format texts into pdf format texts.

3. The method for automatically extracting text information according to claim 1, wherein the specific implementation of step 2.3 comprises the following sub-steps:

step 2.3.2: identifying the intercepted chapter text by using a pdfplumber;

step 2.3.3: cleaning the text by using a natural language processing technology, and cleaning invalid interference text in the text;

step 2.3.4: and storing the cleaned text of each chapter to a txt file.

4. The automatic text information extraction method according to claim 1, wherein: in step 3.2, if one bidding document contains parameter information of a plurality of bidding packages, constructing rule division for each bidding package in the technical parameter section extracted in step 3.2 again, and obtaining technical parameter section content of each bidding package respectively; the regular expressions for dividing the technical parameter sections of each label packet are as follows: pattern= ", x (packet|term); and screening to obtain the technical parameter section of the specified object based on the technical parameter section content of each divided object packet.

5. The automatic text information extraction method according to claim 4, wherein: the technical parameter section to which the specified target object belongs is obtained through screening, and whether the current technical parameter section belongs to the specified target object is judged by detecting whether the beginning of the content of each target packet technical parameter section contains the name of the specified target object.

6. The method for automatically extracting text information according to claim 1, wherein the specific implementation of step 3.4 comprises the following sub-steps:

step 3.4.1: constructing a specific rule, and sequentially judging whether the line of texts are primary titles; if yes, the title indicates the parameter type, the current parameter type is modified to be the text content after the title is cleaned, and the flow process is ended; if not, go to step 3.4.2;

step 3.4.2: judging whether the line text is a secondary title and comprises a parameter name: a format text of parameter values "; if yes, the text is the formatted description of the object parameters, the parameter name and the parameter value are extracted by utilizing the regularization, the parameter name and the parameter value are stored into a dictionary together with the current parameter type as a parameter item, and the flow process is ended; if not, turning to the step 3.4.3;

step 3.4.3: judging whether the line text is a secondary title; if yes, the text is the specific text description of the parameters, the whole text is directly taken as the parameter name, the parameter value is set to be empty, the operation of storing the current parameter item is executed, and the flow is ended; if not, turning to the step 3.4.4;

step 3.4.4: judging whether the line text is a three-level title; if so, cleaning the line text to remove the title, adding the line text into the parameter value of the last parameter item, and ending the flow; if not, turning to the step 3.4.5;

step 3.4.5: if the conditions are not met, indicating that the line text is the subsequent text of the previous parameter item, and directly adding the line text into the parameter value of the previous parameter item;

step 3.4.6: and circularly executing the steps 3.4.1-3.4.5 until the content of the mark package text is finished.

7. The automatic text information extraction method according to any one of claims 1 to 6, characterized in that: in step 4, the intermediate information in the above process is used to generate an extraction report, including the number of files, the total number of packets identified in the files, the number of packets of the specified object obtained by screening, whether the parameter content is successfully located, and the total number of extracted parameter items.