CN112597353B - Text information automatic extraction method - Google Patents

Text information automatic extraction method Download PDF

Info

Publication number
CN112597353B
CN112597353B CN202011507003.6A CN202011507003A CN112597353B CN 112597353 B CN112597353 B CN 112597353B CN 202011507003 A CN202011507003 A CN 202011507003A CN 112597353 B CN112597353 B CN 112597353B
Authority
CN
China
Prior art keywords
text
parameter
information
chapter
technical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011507003.6A
Other languages
Chinese (zh)
Other versions
CN112597353A (en
Inventor
刘金硕
王晨阳
邓娟
黄朔
刘宁
唐浩洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202011507003.6A priority Critical patent/CN112597353B/en
Publication of CN112597353A publication Critical patent/CN112597353A/en
Application granted granted Critical
Publication of CN112597353B publication Critical patent/CN112597353B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9038Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an automatic text information extraction method, which is characterized in that the parameter information of the existing bidding document target is extracted manually, so that a great deal of labor and time are required, and the time and the labor are consumed. The invention uses natural language processing technology to automatically extract parameter information of a bidding text, designs a bidding text structuring, extracts target object parameter information and extracts a report system, wherein the bidding text structuring comprises the steps of extracting bookmark information by using pypdf2, identifying pdf bidding text by using pdfplumberer, cleaning the text by using regularization, and then carrying out structuring analysis processing on the text by using rule matching. And the target parameter information is extracted by accurately identifying and extracting the technical parameter information of the target in the structured target text by utilizing a regularization technology. And finally, establishing an extraction report by utilizing the information in the process, and intuitively reflecting the whole extraction condition.

Description

Text information automatic extraction method
Technical Field
The invention belongs to the technical field of computers, relates to an automatic text information extraction method, and in particular relates to an automatic target parameter information extraction method for a bid-oriented text.
Background
Along with the continuous development of the intellectualization and automation of the existing information technology, great influence and convenience are brought to the life of people, characters can be automatically converted into pictures, and the pictures can be converted into the characters, so that the intelligent and convenient use is realized; however, the specific specificity of the information is required for some specific fields, and the information extraction is difficult to be performed in a targeted manner in the prior art, such as automatic extraction of target parameter information for a bid-oriented text.
The bid-inviting file is a centralized embodiment of purchasing requirements, and the quality of the bid-inviting file directly determines success and failure of bid-inviting results. Through utilizing past bidding documents to compile bidding standard documents, bidding behaviors can be unified, bidding quality is improved, management level is promoted, project success rate can be improved, and bidding document compiling time is saved. However, the existing standard documents for bid-making are manually compiled, especially the target parameters, technical requirements and the like, and require a lot of skilled professionals to extract information in a great deal of time and effort.
There is an urgent need for a technique for specific information extraction.
Disclosure of Invention
In order to solve the technical problems, the invention provides an automatic text information extraction method which is used for solving the problem of automatic extraction of target object parameter information of a bidding document and replacing the current time-consuming and labor-consuming manual extraction method.
The technical scheme adopted by the invention is as follows: the automatic text information extraction method is characterized by comprising the following steps of:
step 1: carrying out batch preprocessing on the input text, and converting the input text into pdf format text;
step 2: carrying out structuring treatment on the pdf format text;
the specific implementation of the step 2 comprises the following sub-steps:
step 2.1: inputting a batch of pdf format text;
step 2.2: extracting bookmark information in the pdf format text by using the pypdf2, and carrying out regular matching on bookmark names by using a construction rule to obtain matched chapter bookmarks, and saving the names and page position information of the bookmarks;
the rules for extracting chapter bookmarks according to the book signature are as follows: pattern= "(chapter|part)";
step 2.3: dividing the pdf format text based on the chapter bookmark information extracted in the step 2.2 to obtain each chapter text in the file respectively;
step 2.4: based on each chapter text obtained in the step 2.3, constructing a specific regular rule to divide the chapter text, and obtaining the names and the positions of each section in the chapter;
the regular rule of extracting the bar is as follows: pattern= ", section";
step 3: positioning, identifying and extracting the related information of the appointed target;
step 4: and establishing an extraction report by utilizing the intermediate information of the steps to generate an extraction result.
Preferably, in step 1, batch preprocessing is performed on the input text by utilizing a winAPI, and a word bottom vba is called by utilizing a python win32 library to convert the word format text into the pdf format text.
Preferably, the specific implementation of step 2.3 comprises the following sub-steps:
step 2.3.1: constructing a regular rule according to the bookmark names in the bookmarks, positioning each chapter, and intercepting the original pdf file by utilizing page position information corresponding to each chapter;
step 2.3.2: identifying the intercepted chapter text by using a pdfplumber;
step 2.3.3: cleaning the text by using a natural language processing technology, and cleaning blank spaces, line-feed symbols, header footers, annotating invalid interference texts such as page numbers and the like in the text;
step 2.3.4: and storing the cleaned text of each chapter to a txt file.
Preferably, the specific implementation of the step 3 comprises the following sub-steps:
step 3.1: constructing a section in which the regular rule positioning related information is located; the regular expression of the chapter where the technical parameter content in the related information is located is as follows: pattern= "(technical |parameter|requirement),"; the construction principle of other information regular expressions is the same as that of technical parameter regular expressions;
step 3.2: based on the section obtained in the step 3.1, the section information obtained in the step 2.4 is utilized to construct a rule matched with a specific technical parameter section; the regular expression of the matching technical parameter bar is as follows: pattern= "((technical|parameter) | requirement) | (..technical|parameter),";
step 3.3: accurately positioning the position of a specific parameter text in a bar by utilizing regular matching; the regular expression for locating the specific parameter text is as follows: pattern= "\W? D + \w [_4e00-_9fa5] (technical |parameter| requirement) [_4e00-_9fa5] (|: is? ";
step 3.4: starting from the content positioned in the step 3.3, carrying out row-by-row parameter identification, and extracting corresponding parameter types, parameter names and parameter values;
step 3.5: storing the parameter name and parameter value extracted in step 3.4, together with the object type, the extracted source (file name) in the form of key value pair in the python dictionary type;
step 3.8: and the parameters extracted from the batch files are stored in the json file, so that the subsequent processing is convenient.
Preferably, in step 3.2, since there are usually parameter information of a plurality of packages in a single bid, for accurate identification and extraction, if the single bid includes parameter information of a plurality of packages, rule division is built again in the technical parameter section extracted in step 3.2, so as to obtain technical parameter section contents of each package; the regular expressions for dividing the technical parameter sections of each label packet are as follows: pattern= ", x (packet|term); and screening to obtain the technical parameter section of the specified object based on the technical parameter section content of each divided object packet.
Preferably, the screening obtains the technical parameter section to which the specified object belongs, and judges whether the current technical parameter section belongs to the specified object by detecting whether the beginning of the content of each packet technical parameter section contains the name of the specified object.
Preferably, the specific implementation of step 3.4 comprises the following sub-steps:
step 3.6.1: constructing a specific rule, and sequentially judging whether the line of texts is a primary title (such as '1. Parameter information'); if yes, the title indicates the parameter type, the current parameter type is modified to be the text content after the title is cleaned, and the flow process is ended; if not, turning to the step 3.6.2;
the regular expression for judging whether the text is a primary title is as follows: pattern= "\W? D + \w [_4e00\u9fa 5] + (: |:)? ";
step 3.6.2: judging whether the line text is a secondary title and comprises a parameter name: the format text of parameter value "(e.g., 1.1 capacity: greater than 12000 t/h); if yes, the text is the formatted description of the object parameters, the parameter name and the parameter value are extracted by utilizing the regularization, the parameter name and the parameter value are stored into a dictionary together with the current parameter type as a parameter item, and the flow process is ended; if not, go to step 3.6.3;
wherein, judge whether the text is the second grade title and include "parameter name: the regular expression of the parameter value "is: pattern= "\W? D+w+d+w [_4e00\u9fa5 ] + (|:) + ";
step 3.6.3: judging whether the line text is a secondary title (such as 2.2 coal cutter valve class parts are provided with filters); if yes, the text is the specific text description of the parameters, the whole text is directly taken as the parameter name, the parameter value is set to be empty, the operation of storing the current parameter item is executed, and the flow is ended; if not, go to step 3.6.4;
the regular expression for judging whether the text is a secondary title is as follows: pattern= "\W? D+w+d+w [_4e00\u9fa5 ] + (|:) + ";
step 3.6.4: judging whether the line text is a three-level title; if so, cleaning the line text to remove the title, adding the line text into the parameter value of the last parameter item, and ending the flow; if not, go to step 3.6.5;
the regular expression for judging whether the text is a three-level title is as follows: pattern= "\W? D+ (\w+ \d+) {2} \w ";
step 3.6.5: if the conditions are not met, indicating that the line text is the subsequent text of the previous parameter item, and directly adding the line text into the parameter value of the previous parameter item;
step 3.6.6: the steps 3.6.1-3.6.5 are circularly executed until the content of the label packet is finished.
Preferably, in step 4, the extraction report is generated by using the intermediate information in the above process, including the number of files, the total number of packets identified in the files, the number of packets of the specified object obtained by screening, whether the parameter content is successfully located, and the total number of extracted parameter items.
Compared with the current manual extraction method, the automatic extraction method for the target parameter information of the bidding document can realize the structural processing of the bidding document by utilizing the technology in the natural language processing field, simultaneously automatically and efficiently extract the parameter information of the target object, save a large amount of manpower and material resources and provide a solid foundation for the compilation of the subsequent bidding standard document and other data analysis.
Drawings
FIG. 1 is an overall flow chart of an embodiment of the present invention;
FIG. 2 is a text structure of an embodiment of the present invention;
FIG. 3 is a text structuring flow chart of an embodiment of the present invention;
FIG. 4 is a flowchart of object technical parameter positioning identification and extraction according to an embodiment of the present invention;
FIG. 5 is a flow chart of the parameter row-by-row extraction according to an embodiment of the present invention;
fig. 6 is a drawing of an extraction report according to an embodiment of the present invention.
Detailed Description
In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.
The embodiment further explains the invention through automatic extraction of the target parameter information facing the bidding text.
The bidding business is an important work for enterprises to conduct project management, and the bidding documents have relatively standardized writing requirements and text contents, so that if the bidding documents are researched as corpus, the functions of standard bidding document management, application, feedback, updating iteration and the like are realized, the working efficiency of staff in related fields can be remarkably improved, the bidding quality and control risk are effectively improved, the management mode of the enterprises on bidding is promoted to develop in the large direction of intellectualization and electronization, and the defect of the application of computer technology in the bidding field is overcome.
In order to facilitate compiling of bidding standard documents, the method is mainly used for carrying out structural processing on bidding corpus in the bidding field and researching an extraction method of final target parameter information. The invention finally realizes the automatic extraction of the target object parameter information by utilizing the technology in the natural language processing field.
In this embodiment, fig. 1 shows a flowchart of an automatic extraction method of target parameter information for a bid text, and the original bid text is processed by a natural language processing technology to finally extract corresponding technical parameter information. To realize automatic extraction, the characteristics of the bidding documents need to be clear, and the bidding documents develop a unique idiom style and writing standard after a long time.
(1) The bidding text contains a large number of field idioms and technical terms, and has strong field characteristics.
(2) The structure of the bidding text is relatively fixed, and the idiom style and the text structure are relatively uniform.
Based on the characteristics of the bidding documents, an automatic extraction strategy of the technical parameter information of the targets is provided on the basis, and an extraction flow is determined.
In this embodiment, a standard structure of a text is shown in fig. 2, and based on such a text structure, the method for automatically extracting target parameter information for a bid-oriented text provided by the invention includes the following steps:
step 1: performing batch pretreatment on the bidding text by utilizing the winAPI, so that the subsequent input files are in pdf format;
in this embodiment, since most of the tagbook files are pdf files, but some of the tagbook files are still in word format, in order to facilitate subsequent unified processing, a python win32 library is used to call a word bottom vba, and all of the word files are converted into pdf files in batch.
Step 2: carrying out structuring treatment on the bidding documents;
in this embodiment, as shown in fig. 3, a text structuring flowchart is shown. According to the characteristics of the text structure, the text structure is divided by using a specific rule:
step 2.1: inputting batch pdf files;
step 2.2: extracting bookmark information in the pdf file by using the pypdf2, and carrying out regular matching on bookmark names by using a construction rule to obtain matched chapter bookmarks, and saving the names and page position information of the bookmarks;
regular expressions, also known as regular expressions, are a logical formula for string operations, that is, a "regular string" is formed by a number of specific characters defined in advance and combinations of the specific characters, and the "regular string" is used to express a filtering logic for the string.
The regular expression may be used to determine whether a given string meets the filtering logic of the regular expression (known as "matching"), or the desired specific portion may be obtained from the string by the regular expression. By utilizing the two characteristics of the regularization and other natural language processing technologies, the automatic extraction of the target parameter information of the bidding text can be realized.
In this embodiment, the rule for extracting the chapter bookmarks according to the bookmark names is as follows:
pattern= "(chapter|part)";
step 2.3: dividing the pdf file based on the chapter bookmark information extracted in the step 2 to obtain the content of each chapter in the file;
step 2.3.1: constructing a regular rule according to the bookmark names in the bookmarks, positioning each chapter, and intercepting the original pdf file by utilizing page position information corresponding to each chapter;
step 2.3.2: identifying the intercepted chapter text by using a pdfplumber;
step 2.3.3: cleaning the text by using a natural language processing technology, and cleaning blank spaces, line-feed symbols, header footers, annotating invalid interference texts such as page numbers and the like in the text;
step 2.3.4: and storing the cleaned text of each chapter to a txt file.
Step 2.4: based on the text of each chapter processed in the step 2.3, constructing a specific regular rule to divide the chapter content, and acquiring the names and the positions of each section in the chapter;
the regular rule of extracting the bar is as follows:
pattern= ", section" No.) "
In step 2.3 of this embodiment, the text is divided into chapters by using pdf bookmark information, the chapters are identified by using the pdfplumber library in python, and finally the text is cleaned by using a natural language processing technique, so as to obtain a relatively standard and clean chapters.
Step 3: constructing a corresponding rule by utilizing a regularization technology, and carrying out positioning identification and extraction on parameter information of a specified object;
in this embodiment, as shown in fig. 4, a flowchart of target technical parameter positioning identification and extraction is shown. And positioning the parameter content of the designated object by constructing corresponding rules by utilizing the line character of the parameter description of the target object and the structural information in the previous text, and then accurately extracting according to the line style construction rules of the parameter text content. The whole positioning recognition and extraction steps are as follows:
step 3.1: in this embodiment, as shown in fig. 2, the name of the chapter where the technical parameter is located is "the fifth chapter cargo technical requirement", and the regular expression for locating the chapter where the technical parameter content is located is:
pattern= ", technical |parameter| requirement".
Step 3.2: based on the section obtained in step 3.1, the construction rule is matched to a specific technical parameter section by using the section information obtained in step 2.4, and in this embodiment, as shown in fig. 2, the section name where the technical parameter is located is "first section technical requirement".
The regular expression of the matching technology bar is as follows:
pattern= "((technical|parameter.)). Requirements) | (technologic parameter)' the following:
step 3.3: because there are usually parameter information of multiple standard packages in a standard book, in order to perform accurate identification and extraction, each standard package is divided by a reconstruction rule to obtain parameter content of each standard package, in this embodiment, as shown in fig. 2, each standard package is named as a "first package coal mining machine", "a" second package scraper conveyor ";
the regular expressions matching each label packet are as follows:
pattern= ", fifth (packet|term)," v "
Step 3.4: based on each label packet divided in the previous step, the parameter content of the label packet to which the specified target object belongs needs to be screened.
In this embodiment, the method for screening the label package is as follows: the names of the objects to which the table packet belongs, such as "first-packet shearer", "technical parameters of shearer", etc., are usually mentioned at the beginning of the text content of the packets, or in the first few lines of introduction, so that it is determined whether the current packet belongs to the object of the specified object by detecting whether the beginning of the text content of each packet contains the name of the object of the specified object. In this embodiment, ten lines are set in the top of the header of the markup packet text for detecting the object name of the specified object.
Step 3.5: precisely positioning the starting position of the technical parameter information content by utilizing regular matching; because the technical parameter information content has a fixed structure in each label package text and even has a specific position, the parameter content can be accurately positioned according to the specific text expression.
The regular expression for locating the beginning position of the parameter information content is as follows:
pattern= "\W? D + \w [_4e00-_9fa5] (technical |parameter| requirement) [_4e00-_9fa5] (|: is? "
Step 3.6: starting from the content positioned in the step 3.2, carrying out row-by-row parameter identification, and extracting corresponding parameter types, parameter names and parameter values;
in this embodiment, in the content of performing progressive parameter identification extraction, as shown in fig. 5, an algorithm flow chart of progressive parameter identification extraction is shown. And determining the type of the parameter by using the title structure of the parameter, and accurately extracting the parameter name and the parameter value by using the character of the line text of the parameter description. The specific steps of row-by-row identification and extraction are as follows:
step 3.6.1: constructing a specific rule, and sequentially judging whether the line of texts is a primary title (such as '1. Parameter information'); if yes, the title indicates the parameter type, the current parameter type is modified to be the text content after the title is cleaned, and the flow process is ended; if not, turning to the step 3.6.2;
the regular expression for judging whether the text is a primary title is as follows: pattern= "\W? D + \w [_4e00\u9fa 5] + (: |:)? ";
step 3.6.2: judging whether the line text is a secondary title and comprises a parameter name: the format text of parameter value "(e.g., 1.1 capacity: greater than 12000 t/h); if yes, the text is the formatted description of the object parameters, the parameter name and the parameter value are extracted by utilizing the regularization, the parameter name and the parameter value are stored into a dictionary together with the current parameter type as a parameter item, and the flow process is ended; if not, go to step 3.6.3;
wherein, judge whether the text is the second grade title and include "parameter name: the regular expression of the parameter value "is: pattern= "\W? D+w+d+w [_4e00\u9fa5 ] + (|:) + ";
step 3.6.3: judging whether the line text is a secondary title (such as 2.2 coal cutter valve class parts are provided with filters); if yes, the text is the specific text description of the parameters, the whole text is directly taken as the parameter name, the parameter value is set to be empty, the operation of storing the current parameter item is executed, and the flow is ended; if not, go to step 3.6.4;
the regular expression for judging whether the text is a secondary title is as follows: pattern= "\W? D+w+d+w [_4e00\u9fa5 ] + (|:) + ";
step 3.6.4: judging whether the line text is a three-level title; if so, cleaning the line text to remove the title, adding the line text into the parameter value of the last parameter item, and ending the flow; if not, go to step 3.6.5;
the regular expression for judging whether the text is a three-level title is as follows: pattern= "\W? D+ (\w+ \d+) {2} \w ";
step 3.6.5: if the conditions are not met, indicating that the line text is the subsequent text of the previous parameter item, and directly adding the line text into the parameter value of the previous parameter item;
step 3.6.6: the steps 3.6.1-3.6.5 are circularly executed until the content of the label packet is finished.
Step 3.7: storing the parameter name and parameter value extracted in step 3.3, together with the object type, the extracted source (file name) in the form of key value pair in the python dictionary type;
step 3.8: and the parameters extracted from the batch files are stored in the json file, so that the subsequent processing is convenient.
Step 4: establishing an extraction report by utilizing the intermediate information of the steps to generate an extraction result;
in this embodiment, as shown in fig. 6, the parameter extraction report is mainly generated by using the intermediate information in the above process, such as the number of files, the total number of packets identified from these files, the number of packets of the specified object obtained by screening, whether the parameter content is successfully located, the total number of extracted parameter items, and the like, so as to summarize the whole extraction process.
The method is not only suitable for extracting the corresponding parameters of the bidding documents, but also suitable for various scenes needing to propose special information from the text.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims (7)

1. The automatic text information extraction method is characterized by comprising the following steps of:
step 1: carrying out batch preprocessing on the input text, and converting the input text into pdf format text;
step 2: carrying out structuring treatment on the pdf format text;
the specific implementation of the step 2 comprises the following sub-steps:
step 2.1: inputting a batch of pdf format text;
step 2.2: extracting bookmark information in the pdf format text by using the pypdf2, and carrying out regular matching on bookmark names by using a construction rule to obtain matched chapter bookmarks, and saving the names and page position information of the bookmarks;
the rules for extracting chapter bookmarks according to the book signature are as follows: pattern= "(chapter|part)";
step 2.3: dividing the pdf format text based on the chapter bookmark information extracted in the step 2.2 to obtain each chapter text in the file respectively;
step 2.4: based on each chapter text obtained in the step 2.3, constructing a specific regular rule to divide the chapter text, and obtaining the names and the positions of each section in the chapter;
the regular rule of extracting the bar is as follows: pattern= ", section";
step 3: positioning, identifying and extracting the related information of the appointed target;
the specific implementation of the step 3 comprises the following sub-steps:
step 3.1: constructing a section in which the regular rule positioning related information is located; the regular expression of the chapter where the technical parameter content in the related information is located is as follows: pattern= "(technical |parameter|requirement),"; the construction principle of other information regular expressions is the same as that of technical parameter regular expressions;
step 3.2: based on the section obtained in the step 3.1, the section information obtained in the step 2.4 is utilized to construct a rule matched with a specific technical parameter section; the regular expression of the matching technical parameter bar is as follows: pattern= "((technical|parameter) | requirement) | (..technical|parameter),";
step 3.3: accurately positioning the position of a specific parameter text in a bar by utilizing regular matching; the regular expression for locating the specific parameter text is as follows: pattern= "\W? D + \w [_4e00-_9fa5] (technical |parameter| requirement) [_4e00-_9fa5] (|: is? ";
step 3.4: starting from the content positioned in the step 3.3, carrying out row-by-row parameter identification, and extracting corresponding parameter types, parameter names and parameter values;
step 3.5: storing the parameter name and the parameter value extracted in the step 3.4 together with the target object type into a python dictionary type together with the extracted file name in a key value pair format;
step 3.6: the parameters extracted from the batch files are stored in json files;
step 4: and establishing an extraction report by utilizing the intermediate information of the steps to generate an extraction result.
2. The automatic text information extraction method according to claim 1, wherein: in the step 1, batch preprocessing is carried out on input texts by utilizing the WinAPI, and a word bottom vba is called by utilizing a python win32 library to convert word format texts into pdf format texts.
3. The method for automatically extracting text information according to claim 1, wherein the specific implementation of step 2.3 comprises the following sub-steps:
step 2.3.1: constructing a regular rule according to the bookmark names in the bookmarks, positioning each chapter, and intercepting the original pdf file by utilizing page position information corresponding to each chapter;
step 2.3.2: identifying the intercepted chapter text by using a pdfplumber;
step 2.3.3: cleaning the text by using a natural language processing technology, and cleaning invalid interference text in the text;
step 2.3.4: and storing the cleaned text of each chapter to a txt file.
4. The automatic text information extraction method according to claim 1, wherein: in step 3.2, if one bidding document contains parameter information of a plurality of bidding packages, constructing rule division for each bidding package in the technical parameter section extracted in step 3.2 again, and obtaining technical parameter section content of each bidding package respectively; the regular expressions for dividing the technical parameter sections of each label packet are as follows: pattern= ", x (packet|term); and screening to obtain the technical parameter section of the specified object based on the technical parameter section content of each divided object packet.
5. The automatic text information extraction method according to claim 4, wherein: the technical parameter section to which the specified target object belongs is obtained through screening, and whether the current technical parameter section belongs to the specified target object is judged by detecting whether the beginning of the content of each target packet technical parameter section contains the name of the specified target object.
6. The method for automatically extracting text information according to claim 1, wherein the specific implementation of step 3.4 comprises the following sub-steps:
step 3.4.1: constructing a specific rule, and sequentially judging whether the line of texts are primary titles; if yes, the title indicates the parameter type, the current parameter type is modified to be the text content after the title is cleaned, and the flow process is ended; if not, go to step 3.4.2;
the regular expression for judging whether the text is a primary title is as follows: pattern= "\W? D + \w [_4e00\u9fa 5] + (: |:)? ";
step 3.4.2: judging whether the line text is a secondary title and comprises a parameter name: a format text of parameter values "; if yes, the text is the formatted description of the object parameters, the parameter name and the parameter value are extracted by utilizing the regularization, the parameter name and the parameter value are stored into a dictionary together with the current parameter type as a parameter item, and the flow process is ended; if not, turning to the step 3.4.3;
wherein, judge whether the text is the second grade title and include "parameter name: the regular expression of the parameter value "is: pattern= "\W? D+w+d+w [_4e00\u9fa5 ] + (|:) + ";
step 3.4.3: judging whether the line text is a secondary title; if yes, the text is the specific text description of the parameters, the whole text is directly taken as the parameter name, the parameter value is set to be empty, the operation of storing the current parameter item is executed, and the flow is ended; if not, turning to the step 3.4.4;
the regular expression for judging whether the text is a secondary title is as follows: pattern= "\W? D+w+d+w [_4e00\u9fa5 ] + (|:) + ";
step 3.4.4: judging whether the line text is a three-level title; if so, cleaning the line text to remove the title, adding the line text into the parameter value of the last parameter item, and ending the flow; if not, turning to the step 3.4.5;
the regular expression for judging whether the text is a three-level title is as follows: pattern= "\W? D+ (\w+ \d+) {2} \w ";
step 3.4.5: if the conditions are not met, indicating that the line text is the subsequent text of the previous parameter item, and directly adding the line text into the parameter value of the previous parameter item;
step 3.4.6: and circularly executing the steps 3.4.1-3.4.5 until the content of the mark package text is finished.
7. The automatic text information extraction method according to any one of claims 1 to 6, characterized in that: in step 4, the intermediate information in the above process is used to generate an extraction report, including the number of files, the total number of packets identified in the files, the number of packets of the specified object obtained by screening, whether the parameter content is successfully located, and the total number of extracted parameter items.
CN202011507003.6A 2020-12-18 2020-12-18 Text information automatic extraction method Active CN112597353B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011507003.6A CN112597353B (en) 2020-12-18 2020-12-18 Text information automatic extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011507003.6A CN112597353B (en) 2020-12-18 2020-12-18 Text information automatic extraction method

Publications (2)

Publication Number Publication Date
CN112597353A CN112597353A (en) 2021-04-02
CN112597353B true CN112597353B (en) 2024-03-08

Family

ID=75199447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011507003.6A Active CN112597353B (en) 2020-12-18 2020-12-18 Text information automatic extraction method

Country Status (1)

Country Link
CN (1) CN112597353B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590655A (en) * 2021-06-30 2021-11-02 中国神华国际工程有限公司 Method and device for extracting parameter information of object, storage medium and electronic equipment
CN113643077A (en) * 2021-10-14 2021-11-12 北京百炼智能科技有限公司 Object prediction processing method and system for label
CN114580348A (en) * 2022-02-24 2022-06-03 来也科技(北京)有限公司 Method, device, terminal and storage medium for acquiring bidding document by combining RPA and AI
CN115544974A (en) * 2022-11-28 2022-12-30 药融云数字科技(成都)有限公司 Text data extraction method, system, storage medium and terminal

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
JP2015194955A (en) * 2014-03-31 2015-11-05 株式会社ナビット Bid information search system
CN106776538A (en) * 2016-11-23 2017-05-31 国网福建省电力有限公司 The information extracting method of enterprise's noncanonical format document
CN106886509A (en) * 2017-03-06 2017-06-23 大连理工大学 A kind of academic dissertation form automatic testing method
CN108874771A (en) * 2018-05-25 2018-11-23 福州大学 A kind of information extraction method towards bid text
CN110287785A (en) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 Text structure information extracting method, server and storage medium
WO2019237540A1 (en) * 2018-06-12 2019-12-19 平安科技(深圳)有限公司 Method and device for acquiring financial data, terminal device, and medium
CN111241230A (en) * 2019-12-31 2020-06-05 中国南方电网有限责任公司 Method and system for identifying string mark risk based on text mining

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015194955A (en) * 2014-03-31 2015-11-05 株式会社ナビット Bid information search system
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN106776538A (en) * 2016-11-23 2017-05-31 国网福建省电力有限公司 The information extracting method of enterprise's noncanonical format document
CN106886509A (en) * 2017-03-06 2017-06-23 大连理工大学 A kind of academic dissertation form automatic testing method
CN108874771A (en) * 2018-05-25 2018-11-23 福州大学 A kind of information extraction method towards bid text
WO2019237540A1 (en) * 2018-06-12 2019-12-19 平安科技(深圳)有限公司 Method and device for acquiring financial data, terminal device, and medium
CN110287785A (en) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 Text structure information extracting method, server and storage medium
WO2020233332A1 (en) * 2019-05-20 2020-11-26 深圳壹账通智能科技有限公司 Text structured information extraction method, server and storage medium
CN111241230A (en) * 2019-12-31 2020-06-05 中国南方电网有限责任公司 Method and system for identifying string mark risk based on text mining

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于Web的文本挖掘;唐菁, 张前, 陈泓婕, 刘宁, 杨炳儒;计算机工程与应用;20021101(第21期);全文 *
文本情感倾向性分析方法:bfsmPMI-SVM;刘金硕;李哲;叶馨;陈嘉敏;邓娟;;武汉大学学报(理学版);20170630;第63卷(第03期);全文 *
科技论文中学术信息的提取方法综述;胡志刚;田文灿;孙太安;侯海燕;;数字图书馆论坛;20171025(第10期);全文 *
自然语言在智能信息检索中的应用;刘宁;柴雅凌;;图书与情报;20060228(第01期);全文 *

Also Published As

Publication number Publication date
CN112597353A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
CN112597353B (en) Text information automatic extraction method
US7251777B1 (en) Method and system for automated structuring of textual documents
US6920608B1 (en) Chart view for reusable data markup language
US9141691B2 (en) Method for automatically indexing documents
CN109062874A (en) Acquisition methods, terminal device and the medium of financial data
US7249328B1 (en) Tree view for reusable data markup language
CN109933796B (en) Method and device for extracting key information of bulletin text
CN102810097B (en) Webpage text content extracting method and device
CN101151843B (en) Text data digging method
WO2000072197A2 (en) Reusable data markup language
US20070088743A1 (en) Information processing device and information processing method
CN112732994B (en) Method, device and equipment for extracting webpage information and storage medium
CN101398812A (en) Apparatus and method for generating electric table with service logic
CN108959204B (en) Internet financial project information extraction method and system
CN115828874A (en) Industry table digital processing method based on image recognition technology
CN110362596A (en) A kind of control method and device of text Extracting Information structural data processing
CN113590655A (en) Method and device for extracting parameter information of object, storage medium and electronic equipment
CN114239576A (en) Issue label classification method based on topic model and convolutional neural network
CN112784585A (en) Abstract extraction method and terminal for financial bulletin
CN110413659B (en) General shopping ticket data accurate extraction method
CN113449509A (en) Text analysis method and device and computer equipment
CN112765939A (en) Policy and law and regulation analysis method and system based on regular expression matching algorithm
CN108153817B (en) Intelligent web page data acquisition method
Jauhiainen et al. A Social Network of the Prosopography of the Neo-Assyrian Empire
CN109446239A (en) Text method for digging, device and computer readable storage medium under line

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant