CN110609998A

CN110609998A - Data extraction method of electronic document information, electronic equipment and storage medium

Info

Publication number: CN110609998A
Application number: CN201910725818.2A
Authority: CN
Inventors: 李宗蔚; 彭彬; 区旸; 黎毅; 吴淦浩; 孙风建; 李展鹏; 罗伟浩; 陈堉颖; 孙荣
Original assignee: Zhong Tong Clothing Construction Co Ltd
Current assignee: Zhong Tong Clothing Construction Co Ltd
Priority date: 2019-08-07
Filing date: 2019-08-07
Publication date: 2019-12-24

Abstract

The invention discloses a data extraction method of electronic document information, which comprises the steps of firstly extracting text content of a text from the acquired electronic document information; then preprocessing the extracted text content, converting the preprocessed text content into standardized text content, and dividing the standardized text content into a plurality of short sentences; and sequentially performing theme matching, attribute name matching and attribute value matching on each short sentence according to the keywords according to the configuration file to obtain the attribute names and the corresponding attribute values in the electronic document information, thereby realizing information extraction in the electronic document information and solving the problems of low data extraction effect and the like in the prior art. The invention also discloses an electronic device and a storage medium.

Description

Data extraction method of electronic document information, electronic equipment and storage medium

Technical Field

The present invention relates to text information processing, and in particular, to a data extraction method for electronic document information, an electronic device, and a storage medium.

Background

Text information extraction is an important subject in pattern recognition and artificial intelligence, and is a text processing technology for extracting fact information of entities, relations, time and the like of specified types from natural language texts and forming structured data output. The extraction is not only simple information retrieval, but also can utilize natural language analysis technology to classify, summarize and analyze sentences and chapters in the text besides utilizing statistics and keyword matching technology, and finally form formatted information.

With the advent of the knowledge explosion era, particularly the wide popularization of the internet, the social demand of language information processing is increasing, and people urgently need to process massive language information by an automatic means. However, due to the limitation of development of subject theory and the complexity of Chinese, the current research on computer linguistic theory and method in China cannot provide enough support for developing Chinese information processing application systems. Currently, the main information extraction methods are: dictionary-based extraction models, rule-based extraction models, hidden markov model-based extraction models, etc., but such extraction models are extraction for some keywords. However, in the practical application process, the chinese language is not the same as the languages such as the english language, and there are specific separators between words, so that when processing the chinese language, it is often necessary to process the whole text information, such as a sentence or paragraph, and currently, there is no processing means capable of effectively extracting information from the chinese text information.

Disclosure of Invention

In order to overcome the defects of the prior art, an object of the present invention is to provide a data extraction method for electronic document information, which can solve the problems of low efficiency of extracting mass data and the like in the prior art.

Another object of the present invention is to provide an electronic device that can solve the problem of low efficiency of extracting mass data in the prior art.

It is a further object of the present invention to provide a computer-readable storage medium, which can solve the problems of low efficiency of extracting mass data, etc. in the prior art.

One of the purposes of the invention is realized by adopting the following technical scheme:

a data extraction method of electronic document information comprises the following steps:

a step of generating a configuration file: setting a configuration file according to the service requirement; the business requirement defines information to be extracted;

an acquisition step: extracting text content of the text from the acquired electronic document information;

a pretreatment step: preprocessing the extracted text content and converting the preprocessed text content into standardized text content;

and short sentence dividing step: dividing the standardized text content into a plurality of short sentences according to a short sentence division rule; matching: sequentially performing theme matching, attribute name matching and attribute value matching on each short sentence according to a keyword matching algorithm according to the configuration file to obtain an attribute value corresponding to each attribute name;

and a result judgment step: judging whether the attribute value of each attribute name meets the service requirement or not, and obtaining one or more extracted attributes according to the judgment result; wherein the attribute comprises an attribute name and an attribute value.

Further, still include: a normalization step: and carrying out normalization processing on the attribute names and the attribute values according to the service requirements, and storing the attribute names and the attribute values in a corresponding data table in a data table field mode.

Further, the matching step specifically includes:

step S1: obtaining a theme related to the information to be extracted from the configuration file, and then matching each short sentence according to a keyword matching method to obtain all short sentences related to the theme;

step S2: obtaining attribute names related to the topics from the configuration files, and matching each short sentence related to the topics according to a keyword matching method to obtain a short sentence corresponding to each attribute name;

step S3: and extracting one or more attribute values corresponding to each attribute name from the context information according to the short sentence corresponding to each attribute name.

Further, when the attribute values are matched, the credibility of each attribute value obtained by each attribute name is calculated according to a preset rule, and then one or more attribute values corresponding to each attribute name are screened according to the credibility.

Further, the preprocessing step specifically includes: firstly, removing labels used for layout in the text content, then converting the text content into a Markdown text, and finally converting the Markdown text into a plain text.

Further, when the body content includes a data table, before converting to a standard formatted plain text, the method further includes: firstly, numbering all data tables in text content; and adding specific identifiers before and after the start tag and the end tag of each data Table, creating a Table object for the corresponding data Table, and associating the specific identifier and the Table object of each data Table with the number of the corresponding data Table.

Further, before the step of dividing the short sentence, the method further comprises: firstly, confirming a theme of information to be extracted according to a configuration file, and dividing text content into a plurality of paragraphs according to the theme; then dividing the text content into a plurality of blocks according to the block characteristics; wherein, the block characteristics include: the header represents a block, the blank column, the dividing line is the end of a block and the end of the data table is the end of a block.

Further, the short sentence dividing step is as follows: dividing the text content into a plurality of short sentences according to the short sentence separators; wherein the phrase separator includes a period, a comma, and a colon.

The second purpose of the invention is realized by adopting the following technical scheme:

an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of a method for data extraction of electronic document information as employed in one of the objects of the invention when executing said program.

The third purpose of the invention is realized by adopting the following technical scheme:

a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a method for data extraction of electronic document information as taken in one of the objects of the invention.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, after the mass data captured regularly and quantitatively in the webpage is preprocessed, the information is extracted, so that important information in the mass data is extracted and converted into the same data format, and the important information is stored in the database in a normalized manner, thereby realizing the screening and analysis of the mass data in the prior art by manpower, greatly saving manpower and material resources, improving the processing effect of the mass data, and improving the application range of the mass data.

Drawings

FIG. 1 is a data flow diagram of a data extraction system for electronic document information according to the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.

The first embodiment is as follows:

with the advent of the big data age, big data is increasingly applied. However, the application of big data is not independent of the acquisition and extraction of basic data, and the characteristics of large data volume, high updating speed and the like in the internet industry cause great difficulty in the acquisition, induction, arrangement and analysis and the like of mass data. It is obviously not feasible if the information extraction is performed manually only. Therefore, the present invention is made in view of the above-mentioned problems, and provides a processing method capable of efficiently extracting information from mass data, and storing the extracted information in a standard format, so that the processed mass data can be applied to other fields.

The invention generally captures corresponding data from various daily electronic documents, web pages and the like in a crawler capturing mode according to a configuration file at regular time and quantity and stores the data in the electronic documents, and then classifies, summarizes, analyzes, extracts and the like text information of the electronic documents, further forms standard formatted text content of the extracted data information and correspondingly stores the text content in a database. When the massive data are used, the data do not need to be acquired from a webpage, but the relevant data information can be acquired directly from the data after extraction, and the application range of the data can be greatly improved because the data in the database table are processed.

The invention provides a data extraction system of electronic document information, as shown in fig. 1, comprising the following parts:

the data acquisition part: through a crawler grabbing mode, corresponding data are regularly and quantitatively grabbed from daily electronic documents, web pages and the like and stored in the corresponding electronic documents.

Generally, mass data is captured from pages of websites, software APPs and the like. For example, for a website page, it usually involves the contents of page header, page footer, advertisement, etc., which are completely unrelated to the text content of the electronic document. For the convenience of subsequent processing, the acquired data is simply processed firstly, information irrelevant to the text content is removed, and the text content of the electronic document is extracted.

For example, for a type of announcement of a specific website, the extraction of the text content can be realized directly by simple XPATH selection or front-back identification string matching.

The key point of the invention is to extract the information of the text content of the electronic document. Most of the information extraction is performed in a semi-structured manner, that is to say: the data extracted from the information is generally not a complete sentence conforming to the general language grammar, and is generally composed of some simple and critical words, phrase phrases and the like.

Based on the same principle, the invention only needs to extract the corresponding attribute name and attribute value during extraction, namely, the information extraction is realized. In addition, based on the characteristics of the chinese language, the content speech described in a segment of text is generally related to the noun in the text content, because the noun is the key information of a segment of text, and verbs, adverbs, adjectives and other words in the text are not helpful to the understanding of the article. Therefore, when information is extracted, verbs, adverbs, adjectives and other similar words in the text do not need to be extracted, and only corresponding nouns need to be extracted, wherein the nouns are defined as attributes including attribute names and attribute values. Such as: the attribute name is person name: the attribute value is Zhang III, and the attribute value is a place name: the attribute value is XX number of XX street in Guangzhou river area, and the attribute name is organization name: the attribute value is Zhongtong construction limited company, and the attribute name is time: attribute value is 2019, 7 month, 30 days, attribute name is money: the attribute value is 130W, the attribute name is telephone number: attribute values of 130000000, and so on. For the convenience of describing the specific process of information extraction, the present invention provides the following related definitions of noun concepts:

(A) definition of noun concept:

1. the attributes are as follows: represented as a piece of information to be extracted, including an attribute name and an attribute value.

The attribute name refers to a name of one piece of information to be extracted, such as a bidding unit name, a telephone number, an address, an item number, and the like. Each attribute may have several aliases. And the attribute value refers to the value corresponding to the attribute name, such as XX company, 13700000000, guangzhou city valley XX street XX number, 01, and the like.

2. Subject matter: attributes are not usually scattered, but are organized into a group by category, and the corresponding category name is called a topic. For example: suppliers, purchasing units, etc. Likewise, there may be several aliases per topic.

3. Anonymous subject: it means that there is a partial attribute, which is difficult to organize by category, and it is divided into a single name class called anonymous subject. Such as: attributes under anonymous subject: and (6) displaying the date.

4. Short sentence: in this paragraph, the content divided by the punctuation marks is called a phrase. I.e. a short sentence is the part between two punctuation marks of a punctuation in a line. The division punctuation here means: comma, period, colon. Wherein pause signs, parentheses, etc. are not used to separate phrases.

In addition, in order to distinguish the content in parentheses, the phrase further includes a phrase main chain and a phrase branched chain.

Main chain of short sentence: the short sentence is usually obtained by splitting the sentence directly through punctuation marks from the content of the official document. Therefore, according to the original sentence organization sequence, the split short sentences naturally form a front-to-back connection relationship, namely a short sentence main chain is formed among the short sentences.

Short sentence branch chain: in an actual corpus of official documents, some related information is often mentioned in the form of parenthesized words, the content in the parenthesis is often just a short sentence which is finally needed, and in order to conveniently represent the situation, the related clauses expressed in the form of parenthesis and the like are split into a short sentence chain. This chain of phrases, which usually ends at the position where the parentheses end, is not referred to as a branch for the entire document content.

5. Naming an entity: the named entity defines the result information of the information extraction for defining the configuration file. The setting can be carried out according to the business requirement. Such as: name of person, place name, organization name, time, amount of money, telephone number, product name, etc. In addition, the method can also be used for judging whether the information extraction result meets the service requirement.

6. A service entity: and the entity class in the related service field is used for defining the configuration file. With the object-oriented concept: a named entity is an instance, and a business entity is a class to which the instance corresponds. For example: in the field of bidding, the field entity classes include related business entities such as bidding units, public recruitment under bidding, bidding project bidding sections, winning units, bidding agents and the like. And for bidding units with named entities: the name of the company.

Wherein, the configuration file: and the configuration file is used for carrying out keyword matching on each short sentence. The configuration file includes: a subject related to the business requirement (namely, the subject to which the information to be extracted belongs) and one or more attribute names related to the subject; meanwhile, for each attribute name, for example: parts of speech, labels, types of attribute values, and the like.

For example: the main topic is: winning bid information;

attribute name: deal situation, purchase item, and evaluation result company.

9. Defined and non-defined headings: from an object-oriented perspective, the attributes themselves are all object-related. Therefore, when defining attribute names, it may not be necessary to define a complete name.

For example: bidding unit { name, address, telephone }, winning bid unit { name, address, telephone }. Although the attribute names in the bid-winning unit and the bid-winning unit are identical here, with the definition of the corresponding class name, for the attribute name: name, address, phone are not ambiguous.

However, from another perspective, the attribute name in the above example cannot determine what the semantic meaning is if there is no definition of the class name, and therefore, the name, address, and phone are all non-defined names, that is, non-defined titles. Rather, the non-limiting title must incorporate contextual topics in order to determine its semantics.

In addition, another attribute name, for example, a name of { item number } directly specifies a business entity in the file corpus, and the semantics of the business entity can be determined without additional information.

In a simple sense, non-limiting titles must be meaningful in conjunction with contextual topics, whereas non-limiting titles can be directly identified.

In addition, for the text content of a document, the text content includes not only nouns, but also words such as verbs, adverbs, adjectives and the like in most cases, and the relationship with the result of information extraction is not great. Meanwhile, documents are obtained from websites, electronic software APP and the like, for example, contents in webpages can be read through various tags, and people can conveniently read the documents. The tags have no relation with the text content of the document, and when information is extracted, the words, the tags and the like all affect the result of the information extraction, and the processing efficiency of the information extraction is greatly reduced. Namely: when the irrelevant information in the text content is less, the processing efficiency of information extraction is higher, and the extraction result is more accurate.

Therefore, the electronic documents acquired by the crawler capturing mode need to be further processed, and the content irrelevant to the text content is removed, so that the processing complexity in information extraction is greatly reduced, the processing efficiency is improved, and the accuracy of the result is improved.

(II) pretreatment part: and preprocessing the acquired electronic document to obtain the text content of the electronic document, and converting the text content into a standard text document. The text content obtained through preprocessing is that most of information irrelevant to the text content is removed, and subsequent processing is facilitated.

Generally, the data is a web page document obtained from a web page or a software APP or the like. In order to facilitate reading and browsing of users, the web page information can be laid out in a web page format and beautified with text content through HTML tags, for example: picture, table typesetting, font bolding, coloring, etc. In addition, based on the characteristics of the HTML web page, it is specified that all the text content in an HTML web page needs to be in the HTML tags, that is, the text content extracted from the web page includes not only the text content but also a large number of HTML tags. The HTML tags have no relation with the information extraction result of the text content, so that the electronic document is processed by preprocessing in order to facilitate subsequent processing, and further useless information such as the HTML tags is removed to obtain the text content.

The invention takes when the electronic document is an HTML document:

1. for example, style script tags, after converting an HTML document to plain text, become tabs that interfere with the actual content of the text.

2. For example, a table tag for a page layout and a table tag for a data table for showing data interfere with each other after being converted into plain text.

Therefore, before information extraction, it is necessary to remove disturbance information that is not related to the actual content of text or the like. Since the electronic documents involved are different, and the types of the interference information existing in the electronic documents are also different, different algorithms are involved for different types of interference information to achieve the removal of the interference information.

For example: the algorithm for clearing the webpage script style is mainly used for clearing the script style label of the webpage; clearing a layout Table tag, which is mainly used for clearing the Table tag for layout and preventing the Table tag for layout from interfering with the Table tag of the real data Table; the empty content tag is cleared, and the tag related to the empty content is cleared, and generally, in the electronic document, the blank content needs to be cleared by adopting the corresponding tag.

Of course, the algorithm for clearly interfering with information will be different according to the source of the electronic document, and the interference information is eliminated by increasing or decreasing the corresponding algorithm according to the actual situation.

The invention also illustrates the processing procedure of the removal of the interference information:

such as: remove labels for layout: and clearing some script style labels, empty content labels and other label characters which are not related to the actual text in the text content. In addition, since the text content may include a table tag of the data table, when the table tag is cleared, the table tag for layout and the table tag for the data table for tabulation data should be distinguished and processed separately.

Table tags for layout, some of which may be present in the web page only for layout, usually contain complex cell combinations, and the content of these tags interferes with the subsequent conversion into Markdown text (tabbed text document) or PlainText text. Therefore, it is desirable to remove the tags as much as possible during the preprocessing so as not to interfere with the subsequent processing.

Since layout tables typically have one or more of the following characteristics: other tables are nested in the tables; the table uses the merging cells marked by attributes such as colspan, rowspan and the like; the number of cells in each row in the table is not uniform, etc. Therefore, when removing the table tags for the layout table, all the table tags for the layout can be removed as much as possible based on the characteristics of the layout table.

In addition, since the layout table rendering effect conforms to the visual order from left to right and from left to right, the method of converting the layout table can be simply handled in the following manner:

(1) taking out the table cell contents directly, i.e. removing the < td > </td > tag and replacing it with a comma at the original </td >, to separate the adjacent cells in a row; (2) removing the < tr > </tr > tag, and using < br/> to replace the original </tr > tag to carry out forced line change to represent a table line; (3) other tags such as < table > are removed directly.

In addition, the content of the table tag of the data table is different from the text content, and if the table tag is directly removed like the text content and is directly converted into a plain text, the data in the data table is disturbed. In the HTML format, the data in the data table is easier to be divided and recognized, and if the data is converted into plain text, the processing difficulty is increased. However, the data table cannot be processed separately from the text content, and the content of the data table may also need context topic information, such as:

table one:

table two:

for the contents of the above table, the context is necessary to distinguish whether the information belongs to table one or table two.

Therefore, in order to easily identify and divide cells of a data table and easily process context information of the data table, the following processes are performed on the data table:

(1) before the text content is converted into plain text, identifying data TABLEs, numbering all the data TABLEs in the text content, and adding a specific identifier < p > BEGIN _ TABLE 201</p > … < p > END _ TABLE < before and after a start tag and an END tag of each data TABLE, for example:

such as < BEGIN _ TABLE 201 > < TABLE >. </TABLE > < P END _ TABLE >

(2) And analyzing each data Table, creating a corresponding Table object and associating the Table object with a corresponding number.

For example, after text content is converted into plain text, since there are special identifiers before and after each data table: starting a label: BEGIN _ TABLE, end tag: END _ TABLE, number: 201.

therefore, a data table can be easily identified in the subsequent text, only the number of the data table needs to be advanced to obtain the result of the data table, and the table in the text form does not need to be analyzed.

After the interference information is removed through the algorithm, the text content with the interference information removed is converted into a standardized text document.

In addition, the invention also provides an implementation mode for converting the text content into a standardized text document, which specifically comprises the following steps:

firstly: and converting the text in the HTML format into a Markdown format, and keeping consistency on visual effect as much as possible. When the Markdown format represents a text, the data can be stored in a plain text manner, and a certain description is required, such as a title, a bold font, a list and other formats, to realize the corresponding identification of the data. That is, in the visual effect, the rendering effect of the HTML can be kept similar, while the tag of the HTML is also removed.

Then: the method comprises the steps of carrying out simple identification matching on some formats in the Markdown text, and further dividing the text. For example, text content may be divided into sections according to article titles.

And finally: the Markdown text is further reduced to PlainText text, i.e. a standardized text document.

In addition, due to the habit of language specification and daily writing, an electronic document may include a plurality of different chapters, paragraphs, etc., and the ideas expressed between different chapters or paragraphs may be different and may be the same, so that the preprocessing may not only remove useless information in the electronic document that is not related to the actual content of the text, but also divide and merge the chapters or paragraphs in the electronic document accordingly, for example: the chapter merging algorithm, the chapter splitting algorithm, the block splitting algorithm and the like are combined, articles in the electronic document are divided into a plurality of chapters, blocks and the like according to the expression intentions, and therefore when later-stage information is extracted and matched, the corresponding chapters, blocks and the like can be searched and matched according to the topics to which the extracted information belongs, the success rate of subsequent matching is improved conveniently, and wrong matching items are greatly reduced.

For example: and the splitting section algorithm is used for splitting the text content into a plurality of sections according to the section ideas of the text content in the electronic document.

And the combined section algorithm is used for combining two or more sections according to the text content in the electronic document. When the chapters are combined and split, the chapters can be split by using keywords, for example, according to a theme set in a configuration file defined by service requirements, then the whole text content is scanned and matched according to the keywords related to the theme, and then two chapters of the same theme can be divided together, the chapter and paragraph contents of different themes are split, and corresponding labeling is carried out at the same time. Therefore, matching can be directly carried out from related chapters when matching is carried out on the topics in the later period, and a large amount of useless operations caused by full-text matching on the whole document are avoided.

Generally, it is not uncommon for a property name to be scattered, so multiple property names are divided into a topic by category. Such as the theme: the bidding information may include the corresponding attribute name: the bidding unit, the bidding amount, the name of the bidding item, the number of the bidding item, the contact person and other attribute names belong to the subject of the bidding information.

The block splitting algorithm is used for splitting text content in the electronic document into blocks, for example, the content in each paragraph chapter can be divided into a plurality of blocks, so that confusion when context information is matched is avoided. When dividing blocks, the division is also made according to the theme.

Such as: for the bidding unit { name, address, telephone }, the winning unit { name, address, telephone }, the same attribute name exists for different topics, but the two are expressed differently. That is, in the corpus, as described above, when a group of attribute names frequently appears in a region, understanding needs to be performed in combination with context information when understanding the attribute names; while such topics are also referred to as contextual topics.

Therefore, in order to ensure correct information extraction, it is necessary to perform block division on text content in an electronic document, and when the block division is performed, the context information should be paid attention. That is, how to correctly identify the boundary of the context topic has obvious influence on the final result. For example, when the bidding unit { name, address, telephone }, and the winning bid unit { name, address, telephone }, are divided into two blocks, then when matching the topics, the corresponding attribute matching can be implemented according to the corresponding topic difference, and the context confusion is not caused.

However, it is not easy to determine the boundary of the context information in the whole article, and a context topic usually does not span too much, so that the context is divided into one or more blocks according to some characteristics of the article structure, and the boundary of the context topic is limited not to exceed the range of one block, thereby greatly improving the reasonability of the boundary of the context topic.

In general, natural blocks of articles have such characteristics:

a. the title of the article can be used as the boundary of one block, namely, the title line means the end of the previous block and the beginning of the block;

b. clear visual separation areas, such as blank lines, separation lines, especially continuous blank lines, mean the end of a block, which conforms to the reading and cognition habits of human beings;

c. the table, and generally the context topic, may extend to the contents of the table but not beyond the table, so the end of the table means the end of a block. Since the data table is added with a special identifier before being converted into plain text, when dividing blocks, a new block is started after setting the data table. Therefore, after the text is block-divided, for a block, it either contains no data table or it ends up with a data table.

(III) information extraction part: and dividing the preprocessed plain text into a plurality of short sentences, extracting information of each short sentence to obtain attribute pairs, and obtaining short sentence entities according to the attribute pairs and service requirements. Wherein, the clause entity comprises attributes, attribute names, topics, parts of speech, analysis groups and context information. And the attribute pair comprises an attribute name and an attribute value.

The processing process of the invention during information extraction specifically comprises the following steps:

1. dividing short sentences:

for one text message, after being preprocessed, the text message is divided into a plurality of blocks, and each block in the text message is divided according to the theme, so that each block represents the corresponding theme. The matching in the invention is based on keywords, therefore, in order to improve the matching efficiency, the invention firstly needs to divide the text content into short sentences, and the short sentences are divided into a plurality of short sentences. In addition, when a phrase is divided, the phrase is divided based on a division marker. The segmentation marker definition in the invention comprises the following steps: periods, commas, and colons. Other punctuation marks such as a pause sign are not used as division punctuation marks for dividing a short sentence. In addition, based on the Chinese character, if a sentence has parentheses, the content in the parentheses is generally the explanation of the previous word or sentence, so when the parentheses are encountered, the content in the parentheses is divided into short sentence branches, and the content outside the parentheses is divided into short sentence main chains.

(2) Information extraction

And after the short sentence division is finished, extracting information of each short sentence in the text content to realize the extraction of the information. The extracted information is different due to different service requirements. The invention firstly sets the service entity class according to the service requirement, and then sets the corresponding named entity, namely the finally extracted information. Such as: obtaining the extraction requirement according to the service requirement: bid on unit, then the corresponding named entity: the name of the bidding unit, the contact person of the bidding unit, the contact way of the bidding unit and the like. Thus, the information in the configuration file can be defined according to the named entity, that is, the subject to be extracted is defined: bidding information, and defining various keywords of each attribute name to be extracted under the subject, such as: company organization names, contacts, contact addresses, etc., and keywords such as aliases, labels, parts of speech, etc., which define each attribute name. Such as contact information: which defines its alias as a telephone number, the corresponding attribute value as a number, etc. In this way, during extraction, the topic identification of the bidding information is firstly carried out, all short sentences under the topic are searched, then, the identification of each attribute name is carried out on the short sentences, and finally, the attribute value corresponding to each attribute name is extracted.

When short sentences are extracted, firstly, topics to be extracted are obtained from a configuration file, then all short sentences corresponding to the topics are obtained through matching by a keyword matching algorithm, then all attribute names of the topics are obtained through matching according to the same method, and finally, attribute values are extracted from context information according to the attribute names.

Wherein, matching of the subject: and obtaining the theme of information extraction according to the configuration file, and then performing matching query on each short sentence. The matching query is realized according to a keyword matching algorithm in the process of matching query, and all short sentences related to the subject can be searched.

For a data table, the corresponding topic is usually the topic corresponding to the previous short sentence of the data table. In addition, in order to ensure the efficiency of theme matching, because the blocks are already divided aiming at the theme when being divided, the corresponding blocks can be matched through the theme firstly, and then the corresponding short sentences are directly searched in the blocks under the corresponding theme, so that the efficiency of theme matching can be greatly improved.

Matching attribute names:

and after the theme matching is completed, obtaining all short sentences related to the theme, obtaining each attribute name according to the configuration file, and identifying the corresponding attribute name of each short sentence according to a keyword matching algorithm. Namely: it can be determined whether a short sentence identifies an attribute name, and which attribute name is identified. Of course, when the attribute names match, there is not necessarily a corresponding attribute name for each phrase.

Extracting attribute values:

for the short sentence with the identified attribute name, the attribute value of the attribute name can be extracted according to the attribute value extraction rule. Generally, a phrase following the phrase is extracted as an attribute value.

In addition, information such as a label of an attribute name, a type of an attribute value, and the like is set in the profile, and for example, the attribute name: contact means, which defines that the corresponding attribute value should contain a number. Therefore, when the attribute value is extracted, after the following clause is extracted as the attribute value, the matching needs to be judged according to the configuration file.

In addition, after the attribute value is extracted, a plurality of possible attribute values can be extracted from one attribute name, so that the reliability is introduced in the extracting process of the attribute value.

Here, the credibility refers to a credible program that uses a percentage to represent a determination result, not just true or false. For example, if the short sentence is completely the name of a company, the credibility is 100%, and the short sentence is considered to be an attribute value; if the phrase does not include any person name, place name, company name, or the like at all, the confidence level is 0, and the phrase is considered not to be an attribute value.

For another example: in the matching of attribute names, a matching confidence level may also be given, for example, two phrases "winning bid unit name, amount, contact, etc." and "contact: the term "contact person" may be matched at the same time, but the matching degree of the latter is almost 100% higher than that of the former in terms of credibility.

Reliability:

therefore, when the attribute values are matched, reliability is introduced, the direction of the table is judged, the duplication removing operation is carried out on a plurality of matching results, and the most possible result is extracted and is clearer than the existing matching logic.

In addition, the invention also provides a calculation mode for determining the reliability of the extracted attribute value, which comprises the following steps: :

A. when defining the extraction attribute, setting the attribute worth range identification such as belonging to organization, name, amount, time and the like, and when extracting, obtaining whether the part of speech of the short sentence contains or is equal to the part of speech through semantic analysis, if not, the short sentence is not credible.

B. Whether the extracted value is an 'attribute value' is determined by judging whether the short sentence contains too much other information or whether the short sentence is a title through dictionary and part-of-speech segmentation, and an algorithm is solely used for scoring the value and judging that the value is not credible if the score is low, for example, the extracted value is a money amount, if the short sentence is a place name and a business name, the extracted value is not credible, if the short sentence is a number, and if the short sentence contains currency symbols, case numbers and the like, the credibility is high.

In addition, the invention also sets a threshold value for the credibility, and when the credibility of one or more attribute values obtained by matching does not exceed the threshold value, the attribute value searched by the attribute name is considered to be not in accordance with the system requirement.

In addition, after matching the short sentences, when one attribute name does not find the corresponding attribute value or the found attribute value does not meet the system requirement, the short sentences after being searched can be removed, the short sentences in the text are identified and the attribute values of the short sentences are searched again until the reliability of the found attribute values exceeds a certain threshold value, and each attribute name and the corresponding attribute value under the theme are obtained.

Named entity guessing:

after the matching of the short sentences is completed, the attribute values of some attribute names can not be obtained by matching. The invention also guesses the attribute value of the attribute name by introducing a named entity guessing method.

Such as: winning unit:

generally, if there is no mark division, for only one bid-closing unit in the text, the attribute value of the bid-closing unit is considered to be the maximum number of times of the name of the bid-closing unit appearing in the text.

For another example: and (4) winning the bid amount: the occurrence of a value in the full text of the document that is clearly a monetary value, e.g. "RMB 1.2 ten thousand yuan", can be considered: the value of the amount is the attribute value of the winning amount; or the following steps: if several amounts are present, but the value in which the amount is the largest is more or less the sum of the other amounts (e.g., less than 5% error), the value in which the amount is the largest is considered to be the attribute value of the bid amount.

Visual distance:

because the description of the general attribute name and the attribute value are in the same paragraph or the same block based on the characteristics of the chinese language, if the extracted attribute name and the extracted attribute value are found to be in different positions of the text document when the attribute value is matched, and the distance between the two is too large, it indicates that the correlation between the two is small, and the extracted attribute value is not the attribute value corresponding to the attribute name. Therefore, the invention introduces visual distance to represent the visual distance between two text segments in reading, and is used for describing the correlation degree between text contents at two ends. That is, what is visually gathered together with a small visual distance is associated with the expressed semantics.

The invention also briefly provides a method for calculating the visual distance, which comprises the following steps:

(1) when two parts are in the same row, then the number of characters spaced between the end of the previous part and the beginning of the next part is the visual distance of the two parts.

(2) If the two portions are no longer the same row, then the visual distance is a weighted average of the differences of the rows and the differences of the columns.

(3) For a data table, the data table is not directly identified, but when the data table takes a value, the attribute name corresponding to the attribute value should be one row or a plurality of rows before the attribute name, and the attribute names are basically aligned on the column. Using the concept of visual distance, the visual distance between the attribute name and the attribute value is calculated, and when the visual distance is within an acceptable range, the two key values can be considered to be paired.

The invention is not the invention point of the calculation formula of the visual distance, and belongs to the conventional technology in the text calculation process, and the invention only provides some brief introduction.

And (3) named entity extension:

named entities, which are generally identified by keyword matching, are often incomplete. For example: the "third division of Zhongtong construction Co., Ltd" was recognized as "Zhongtong construction Co., Ltd". The invention also realizes the processes of theme matching, attribute name matching and attribute value matching by introducing named entity extension. That is to say: the name of a named entity is extended to be longer and more complete by a simple bound noun phrase. Where a bonded noun phrase is intended to mean a modified or defined relationship between a plurality of lexically consecutive nouns, from which longer nouns can be formed.

For example: "Zhongtong construction company, ltd, third division", which is expanded, can identify the complete company name entirely containing the division name.

For another example: matching "winning bid units" query algorithms typically only identify unit names for such standards as "XXX corporation", "XXX units", etc. However, when further matching is carried out, such as division, matching items need to be added continuously through data customization. For example, a matching item of a dictionary of a branch company with corresponding configuration items is found after the 'Zhongtong construction limited company', and the dictionary can be expanded to the 'Zhongtong construction limited company'.

In addition, when the attribute names are identified and matched, because the named entities exist: such as addresses, person names, place names, telephone numbers, etc., may cause unnecessary interference with the identification of matching attribute names. Therefore, when the attribute names are identified and matched, the named entities are firstly removed, and then the attribute names are identified and matched, so that the text content is greatly simplified, the information structure of the text is not influenced, the attribute to be extracted is favorably matched, and whether the extracted result is correct or not can be judged.

For example: for such a piece of textual content:

fifth, contact information

The tenderer: ******

Address: ******

The contact person: ******

Telephone: ******

E-mail: ******

The bidding agency: ******

Address: ******

The contact person: ******

Telephone: ******

E-mail: ******

The named entities are removed to obtain the following:

fifth, contact information

The tenderer: x

Address: x

The contact person: x

Telephone: x

E-mail: x

The bidding agency: x

Address: x

The contact person: x, X

Telephone: x

E-mail: x

Obviously, after the named entity is removed, the text structure is not changed, the text content is relatively concise, and the identification and matching of the attribute name are facilitated. That is, before the attribute names are matched, the text content can be further simplified according to the service entities defined in the service requirements, so that the text content is simplified, and the identification and matching of the attribute names are facilitated.

In addition, the invention also carries out word segmentation processing on the text without the named entity, so that words can be used as units for matching in the subsequent matching, rather than simply carrying out matching by using character strings, and an approximate semantic matching effect can be achieved. For example, the word "the winning company is Zhongtong construction Limited company" is: the winning bid/v company/n is/h Zhongtonghu construction company Limited/nh. Through the word segmentation processing, corresponding information can be relatively accurately extracted, and articles can be understood to a certain degree. Of course, the word segmentation process may be applied to the matching of the attribute values, and the reliability of the attribute values may be calculated from the parts of speech of each segmented word.

And (IV) a normalization part:

and finally obtaining one or more attributes by matching and extracting the short sentences, wherein each attribute comprises an attribute name and an attribute value. In order to facilitate the subsequent application of the mass data, the invention also stores the extracted result in the corresponding database table in the form of the data table field according to the normalization method. The database table in which the attribute name and the attribute value are stored is determined according to a specific application scenario. In addition, since the attribute name has a plurality of aliases, when storing the attribute value in the data table, the attribute value can be stored in different data tables according to different aliases of the attribute name.

Example two:

the invention provides a data extraction method of electronic document information, which specifically comprises the following steps:

Further, the matching step specifically includes:

Further, when the attribute names are matched, word segmentation processing is carried out on the text content.

Further, the acquiring step: extracting information irrelevant to text content from the acquired electronic document information; the information unrelated to the text content includes but is not limited to: head-up, footer, and advertising of a web page.

Example three:

the invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of a data extraction method of electronic document information described herein when executing the program.

Example four:

the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of a method of data extraction of electronic document information as described herein.

The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims

1. A data extraction method of electronic document information is characterized in that: the method comprises the following steps:

2. The method for extracting data of electronic document information according to claim 1, wherein: further comprising: a normalization step: and carrying out normalization processing on the attribute names and the attribute values according to the service requirements, and storing the attribute names and the attribute values in a corresponding data table in a data table field mode.

3. The method for extracting data of electronic document information according to claim 1, wherein: the matching step specifically comprises:

4. The method for extracting data of electronic document information according to claim 3, wherein: when the attribute values are matched, calculating the credibility of each attribute value obtained by each attribute name according to a preset rule, and screening one or more attribute values corresponding to each attribute name according to the credibility.

5. The method for extracting data of electronic document information according to claim 1, wherein: the pretreatment step specifically comprises: firstly, removing labels used for layout in the text content, then converting the text content into a Markdown text, and finally converting the Markdown text into a plain text.

6. The method for extracting data of electronic document information according to claim 1, wherein: when the body content includes a data table, before converting to a standard formatted plain text, the method further comprises: firstly, numbering all data tables in text content; and adding specific identifiers before and after the start tag and the end tag of each data Table, creating a Table object for the corresponding data Table, and associating the specific identifier and the Table object of each data Table with the number of the corresponding data Table.

7. The method for extracting data of electronic document information according to claim 1, wherein: before the step of dividing the short sentence, the method also comprises the following steps: firstly, confirming a theme of information to be extracted according to a configuration file, and dividing text content into a plurality of paragraphs according to the theme; then dividing the text content into a plurality of blocks according to the block characteristics; wherein, the block characteristics include: the header represents a block, the blank column, the dividing line is the end of a block and the end of the data table is the end of a block.

8. The data extraction method of electronic document information according to claim 1, characterized in that: the short sentence dividing step is as follows: dividing the text content into a plurality of short sentences according to the short sentence separators; wherein the phrase separator includes a period, a comma, and a colon.

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the program, performs the steps of a method for extracting data of electronic document information as claimed in any one of claims 1 to 8.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when being executed by a processor, realizes the steps of a method for data extraction of electronic document information as claimed in any one of claims 1 to 8.