CN113743076A

CN113743076A - Data extraction method and system

Info

Publication number: CN113743076A
Application number: CN202111309972.5A
Authority: CN
Inventors: 于斌; 汤华; 贾晓光; 李圣亮; 寇志刚
Original assignee: Zhongguancun Technology Software Co ltd
Current assignee: Zhongguancun Technology Software Co ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2021-12-03

Abstract

The invention discloses a data processing method and a system, wherein the method comprises the following steps: preprocessing a sample document uploaded by a user to obtain a visual page document, wherein the sample document is of a Word document type; selecting contents to be extracted from a visual page document through mouse dragging according to the selection of a user, judging the context relevance of the contents according to the extracted contents, and generating an extraction rule template; and according to the generated extraction rule template, performing batch data extraction on the uploaded documents with the same structure or similar structures, and storing the extracted data in a structured database according to a preset data corresponding relation. According to the method and the device, the document data are selected by dragging with the mouse, and the corresponding extraction rule is generated according to the relevance of the context of the selected document data and the characteristic keywords, so that the extraction rule can process documents with the same structure or similar structures in batch, and the processing efficiency of the document data is greatly improved.

Description

Data extraction method and system

Technical Field

The invention relates to the technical field of data processing, in particular to a data extraction method and a data extraction system, and particularly relates to a method and a system for extracting word document data based on visual mouse dragging selection.

Background

Because word documents are generally used in traditional office statistical data and file reporting, if structured data and content characters need to be statistically calculated from a large amount of generated documents, it is very laborious and time-consuming to look up each document paste data.

Often, business or government documents follow certain rule structures or templates, and some office assistant systems may customize the manner in which batch entries are developed for a particular document. However, the system is not highly versatile, needs to be re-customized for new structure words, and does not meet user operability.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

In order to solve the above problems in the related art, the present invention provides a data processing method and system.

The technical scheme of the invention is realized as follows:

according to an aspect of the present invention, a data processing method is provided.

The data processing method comprises the following steps:

preprocessing a sample document uploaded by a user to obtain a visual page document;

selecting contents to be extracted from a visual page document through mouse dragging according to the selection of a user, judging the context relevance of the contents according to the extracted contents, and generating an extraction rule template;

and according to the generated extraction rule template, performing batch data extraction on the uploaded documents with the same structure or similar structures, and storing the extracted data in a structured database according to a preset data corresponding relation.

Preprocessing a sample document uploaded by a user to obtain a visual page document comprises the following steps: and receiving or selecting a sample document uploaded by a user, and converting the sample document into a page document in an HTML format to obtain a visual page document.

Selecting the content to be extracted in the visualized page document through mouse dragging according to the selection of the user comprises the following steps: in the visual page document, triggering a click event from a mouse click starting position, judging the node position of a clicked character, and recording the node position as a starting node; dragging the mouse until releasing the trigger end event, judging the node position of the end position of the mouse, and recording the node position as an end node; and determining the content between the starting node and the ending node as the content to be extracted.

Wherein, according to the extracted content, judging the context relevance of the content, and generating an extraction rule template comprises: according to the extracted content, judging brother paragraph labels and parent paragraph labels corresponding to the selected content, and searching whether paragraph identifiers in a preset word stock exist in the brother paragraph labels and the parent paragraph labels, wherein the paragraph identifiers comprise paragraph starting identifiers and paragraph ending identifiers; under the condition that the search result is a no-segment-drop identifier, using the beginning character and the ending character of the extracted content as labels to form segment labels, and using the segment labels as extraction rules; and taking the paragraph beginning identifier and the paragraph ending identifier as extraction rules under the condition that the search result is that the paragraph identifiers exist.

In addition, the data processing method further includes: determining feature characters related to the extracted content according to the extracted content, and taking the feature characters as extracted feature keywords; determining feature keyword extraction elements according to the extracted feature keywords, and combining the feature keyword extraction elements to form an extraction rule; wherein the keyword extraction element comprises at least one of: the starting position of the feature keyword, the ending position of the feature keyword, whether the feature keyword is contained, the number of matched feature keywords, and the sequence position.

The configuration mode of the pre-configured data corresponding relation comprises the following steps: reading table field information in a database, selecting corresponding document extracted information fields for each table field, and generating a one-to-one configuration relationship; wherein the table field information includes: field name information, field type information, and/or field length information.

According to another aspect of the invention, a data processing system is provided.

The data processing system includes:

the preprocessing module is used for preprocessing the sample document uploaded by the user to obtain a visual page document;

the extraction rule generation module is used for selecting contents to be extracted in the visual page document through mouse dragging according to the selection of a user, judging the context relevance of the contents according to the extracted contents and generating an extraction rule template;

the batch extraction module is used for extracting batch data of uploaded documents with the same structure or similar structures according to the generated extraction rule template;

and the storage module is used for storing the extracted data into the structured database according to the preset data corresponding relation.

The preprocessing module is used for preprocessing the sample document uploaded by the user to obtain the visual page document, and receiving or selecting the sample document uploaded by the user and converting the sample document into the HTML-format page document to obtain the visual page document.

When the content to be extracted is selected in the visual page document by mouse dragging according to the selection of a user, the extraction rule generation module triggers a click event from a mouse click start position in the visual page document, judges the position of a node where the clicked character is located, and records the node as a start node; after the mouse is dragged until the mouse is released, triggering an end event, judging the position of a node where the end position of the mouse is located, and recording the position as an end node; determining the content between the starting node and the ending node as the content to be extracted;

when the extraction rule generating module judges the context relevance of the content according to the extracted content and generates an extraction rule template, according to the extracted content, the extraction rule generating module judges brother paragraph labels and parent paragraph labels corresponding to the selected content, and searches whether paragraph identifiers in a preset word stock exist in the brother paragraph labels and the parent paragraph labels, wherein the paragraph identifiers comprise paragraph starting identifiers and paragraph ending identifiers; under the condition that the search result is a no-segment-drop identifier, using the beginning character and the ending character of the extracted content as labels to form segment labels, and using the segment labels as extraction rules; taking a paragraph starting identifier and a paragraph ending identifier as extraction rules under the condition that the search result is that the paragraph identifier exists;

the extraction rule generation module determines the characteristic characters related to the extracted content according to the extracted content and takes the characteristic characters as extracted characteristic keywords; determining feature keyword extraction elements according to the extracted feature keywords, and combining the feature keyword extraction elements to form an extraction rule; wherein the keyword extraction element comprises at least one of: the starting position of the feature keyword, the ending position of the feature keyword, whether the feature keyword is contained, the number of matched feature keywords, and the sequence position.

The configuration mode of the data corresponding relation pre-configured in the storage module comprises the following steps: reading table field information in a database, selecting corresponding document extracted information fields for each table field, and generating a one-to-one configuration relationship; wherein the table field information includes: field name information, field type information, and/or field length information.

Has the advantages that: the invention selects the document data by using mouse dragging and generates the corresponding extraction rule according to the relevance of the context of the selected document data and the characteristic key words, thereby being capable of processing the documents with the same structure or similar structures in batches by the extraction rule and further greatly improving the processing efficiency of the document data.

In addition, the extraction rule is generated based on the document data selected by mouse dragging and the relevance of the document data, so that the extraction rule can be conveniently and quickly generated according to the uploaded data document, the processing speed of extracting key contents in the document by developers and business personnel is greatly reduced, and the learning cost is reduced by adopting a low-code mode. And the operating efficiency and the one-key processing of the large-batch documents are improved through a batch processing mode.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a data processing method according to an embodiment of the invention;

FIG. 2 is a block diagram of a data processing system according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of data extraction based on a word document according to an embodiment of the present invention;

FIG. 4 is a logical illustration of a drag portion of a visualization according to an embodiment of the invention;

FIG. 5 is a schematic diagram of batch execution of extraction information according to an embodiment of the invention;

FIG. 6 is a schematic diagram of the logic for extracting matches during data storage according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

According to an embodiment of the present invention, there is provided a data processing method.

As shown in fig. 1, a data processing method according to an embodiment of the present invention includes:

step S101, preprocessing a sample document uploaded by a user to obtain a visual page document;

step S103, selecting contents to be extracted in a visual page document through mouse dragging according to the selection of a user, judging the context relevance of the contents according to the extracted contents, and generating an extraction rule template;

and step S105, extracting batch data of the uploaded documents with the same structure or similar according to the generated extraction rule template, and storing the extracted data in a structured database according to a preset data corresponding relation.

According to an embodiment of the present invention, a data processing system is provided.

As shown in fig. 2, a data processing system according to an embodiment of the present invention includes:

the preprocessing module 201 is configured to preprocess the sample document uploaded by the user to obtain a visual page document;

the extraction rule generation module 203 is used for selecting contents to be extracted from the visual page document through mouse dragging according to the selection of the user, judging the context relevance of the contents according to the extracted contents and generating an extraction rule template;

the batch extraction module 205 is configured to perform batch data extraction on uploaded documents with the same structure or similar structures according to the generated extraction rule template;

the storage module 207 is configured to store the extracted data in the structured database according to a pre-configured data correspondence.

When the preprocessing module 201 preprocesses the sample document uploaded by the user to obtain the visual page document, the sample document uploaded by the user is received or selected, and the sample document is converted into the page document in the HTML format to obtain the visual page document.

When content to be extracted is selected in a visual page document by mouse dragging according to the selection of a user, the extraction rule generation module 203 triggers a click event from a mouse click start position in the visual page document, judges the position of a node where a clicked character is located, and records the node as a start node; after the mouse is dragged until the mouse is released, triggering an end event, judging the position of a node where the end position of the mouse is located, and recording the position as an end node; determining the content between the starting node and the ending node as the content to be extracted;

when the extraction rule generating module 203 determines the context relevance of the content according to the extracted content and generates an extraction rule template, the extraction rule generating module determines a sibling paragraph tag and a parent paragraph tag corresponding to the selected content according to the extracted content, and searches whether a paragraph identifier in a preset lexicon exists in the sibling paragraph tag and the parent paragraph tag, wherein the paragraph identifier includes a paragraph start identifier and a paragraph end identifier; under the condition that the search result is a no-segment-drop identifier, using the beginning character and the ending character of the extracted content as labels to form segment labels, and using the segment labels as extraction rules; taking a paragraph starting identifier and a paragraph ending identifier as extraction rules under the condition that the search result is that the paragraph identifier exists;

the extraction rule generating module 203 determines feature words related to the extracted content according to the extracted content, and uses the feature words as extracted feature keywords; determining feature keyword extraction elements according to the extracted feature keywords, and combining the feature keyword extraction elements to form an extraction rule; wherein the keyword extraction element comprises at least one of: the starting position of the feature keyword, the ending position of the feature keyword, whether the feature keyword is contained, the number of matched feature keywords, and the sequence position.

The configuration mode of the data corresponding relationship configured in advance in the storage module 207 includes: reading table field information in a database, selecting corresponding document extracted information fields for each table field, and generating a one-to-one configuration relationship; wherein the table field information includes: field name information, field type information, and/or field length information.

For convenience of understanding the technical solutions of the present invention, the following takes a word document as an example to describe the technical solutions of the present invention in detail.

As shown in FIGS. 3-6, the present invention can be divided into the following steps when processing a word document: 1. the document visualization processing part: and uploading a sample document by a user, and converting the sample word document into an html format in order to enable the document to be converted into a common webpage end and a style from being checked only from the office class software. 2. Extracting content rules, dragging and editing the part by a mouse: displaying the converted html on a page enables a user to freely select content to be extracted, judges the context association of the content to generate an extraction rule and configures the extraction rule into a database. 3. A batch execution part: uploading word documents with the same structure or similar structures, executing extraction rules by the batch execution module, sequentially executing configuration according to a sequential logic AND mode, stripping in a layering mode, and locking the extracted contents in each document. 4. A data entry section: and putting the extracted content into each field of the database table according to the warehousing rule.

When the extracted characters or paragraphs are selected by the mouse, a click event is triggered from the mouse click starting position to judge the dot node position where the clicked characters are located, the starting node is recorded, the mouse is dragged until the mouse releasing judging mouse ending position is released, and the dot node is recorded.

When the character context feature is selected through context judgment, the feature condition is divided into a starting condition and an ending condition, and each starting condition and each ending condition are provided with judgment matched with the key word appearing for the second time, so that the unique judgment condition for judging the same key appearing in the same paragraph is provided. In addition, an option of whether to include start and end key identifiers is provided to inform the parser at parsing time whether to include the previous and next keywords in the extracted information.

After the mouse is dragged and selected, the program can intelligently judge the selected information context, and the method specifically comprises the following steps:

determining paragraph identification: searching whether a key paragraph mark in a preset word stock (a common paragraph starting mark is added in the preset word stock) exists from the brother label and the father label of the selected character, wherein the key paragraph mark comprises the following steps: (1) one, etc. paragraph beginning identification, or paragraph ending identification, such as: < p > tag identification. When a paragraph mark is recorded in the rule, if the end tag is the end of the paragraph and the end mark is not enabled by default, the parser will automatically obtain the end of the paragraph from the beginning character. Paragraph rules that tag selected text beginning and ending characters if there is no such obvious identification.

Rule for single field: two groups of six matching conditions, namely, which characteristic characters start or end, and whether the second characteristic character is matched with the second same characteristic character in the general paragraph and whether the second characteristic character comprises the characteristic characters, are combined into an extraction rule for extracting information.

For example, the following coding form is adopted:

{

the name of the field is extracted,

starting conditions were as follows:

{

starting with which character,

matching the first and second parts of the material,

whether or not to include:

},

and (4) finishing conditions:

{

ending with which character,

matching the first and second parts of the material,

whether or not to include:

}

and finally, combining json extraction rules of all fields generated by the searched label rules according to a head-tail pairing principle to form the information extraction rule.

And when the batch execution part is executed, a user uploads a plurality of word documents with the same or similar structures before batch execution, a background stores the plurality of documents into a server according to a task splitting directory, and converts the documents into an HTML format to store relative paths of HTML.

And reading json extraction rules generated by dragging the visual mouse by the background. Sequentially parsing each word document: each acquisition field has a pair of start flag rule and end flag rule, and three kinds of start flags and four kinds of end flags are set. The rules also include a flag word indicating inclusion or non-inclusion.

Taking the beginning (not included) of the text and the end of the paragraph as an example, the beginning mark is read first, the node including the beginning mark is read from the whole document, and the node is recorded. Looking out for the parent node from the current node until the paragraph end tag is found, and then getting to the content in the entire paragraph. Then finding out the start mark in the rule from the whole paragraph content, intercepting the content after the start mark, finally judging whether the rule contains the mark of the mark, and intercepting the start mark from the intercepted content.

For the storage part, before batch execution, an interface for configuring the corresponding relationship between the extraction field and the database field is provided, and the specific process of configuration is as follows: by reading field information (including field names, field types, field length information and the like) of the target database table, corresponding to each table field, selecting the corresponding document extracted information field through a drop-down selection list, generating a one-to-one configuration relation and storing the one-to-one configuration relation in a batch execution rule. And reading the corresponding relation between the extracted field and the database field in the batch execution rule, and correspondingly storing the extracted field and the database field into the database field after the content is extracted.

The storage and warehousing are important components of data circulation, data structuring and data practical application. By configuring the corresponding rules of the fields and the database, the data can be transferred to any database table, and all or some extracted fields can be selected to correspond to the storage relation of the database, including functions of configuring fixed content, filling blank fields and the like. And in the warehousing process, the corresponding lengths of the characters and the fields do not conform to the format, the normal state of the warehousing process is judged, and the logs are recorded for abnormal storage, so that the execution process is convenient to trace.

In summary, according to the technical solution of the present invention, the document data is selected by using mouse dragging, and the corresponding extraction rule is generated according to the relevance of the context of the selected document data and the feature keyword, so that the extraction rule can process documents with the same structure or similar structures in batch, thereby greatly improving the processing efficiency of the document data. In addition, the extraction rule is generated based on the document data selected by mouse dragging and the relevance of the document data, so that the extraction rule can be conveniently and quickly generated according to the uploaded data document, the processing speed of extracting key contents in the document by developers and business personnel is greatly reduced, and the learning cost is reduced by adopting a low-code mode. And the operating efficiency and the one-key processing of the large-batch documents are improved through a batch processing mode.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A data processing method, comprising:

2. The data processing method of claim 1, wherein preprocessing the sample document uploaded by the user to obtain a visual page document comprises:

and receiving or selecting a sample document uploaded by a user, and converting the sample document into a page document in an HTML format to obtain a visual page document.

3. The data processing method of claim 2, wherein selecting content to be extracted in the visualized page document by mouse dragging according to the user's selection comprises:

in the visual page document, triggering a click event from a mouse click starting position, judging the node position of a clicked character, and recording the node position as a starting node;

dragging the mouse until releasing the trigger end event, judging the node position of the end position of the mouse, and recording the node position as an end node;

and determining the content between the starting node and the ending node as the content to be extracted.

4. The data processing method of claim 3, wherein determining content context relevance based on the extracted content, and generating an extraction rule template comprises:

according to the extracted content, judging brother paragraph labels and parent paragraph labels corresponding to the selected content, and searching whether paragraph identifiers in a preset word stock exist in the brother paragraph labels and the parent paragraph labels, wherein the paragraph identifiers comprise paragraph starting identifiers and paragraph ending identifiers;

under the condition that the search result is a no-segment-drop identifier, using the beginning character and the ending character of the extracted content as labels to form segment labels, and using the segment labels as extraction rules;

and taking the paragraph beginning identifier and the paragraph ending identifier as extraction rules under the condition that the search result is that the paragraph identifiers exist.

5. The data processing method of claim 4, further comprising:

determining feature characters related to the extracted content according to the extracted content, and taking the feature characters as extracted feature keywords;

determining feature keyword extraction elements according to the extracted feature keywords, and combining the feature keyword extraction elements to form an extraction rule;

wherein the keyword extraction element comprises at least one of:

the starting position of the feature keyword, the ending position of the feature keyword, whether the feature keyword is contained, the number of matched feature keywords, and the sequence position.

6. The data processing method according to claim 5, wherein the pre-configured data correspondence is configured in a manner that includes:

reading table field information in a database, selecting corresponding document extracted information fields for each table field, and generating a one-to-one configuration relationship;

wherein the table field information includes: field name information, field type information, and/or field length information.

7. A data processing system, comprising:

8. The data processing system of claim 7, wherein the preprocessing module preprocesses the sample document uploaded by the user to obtain the visual page document, and receives or selects the sample document uploaded by the user and converts the sample document into the HTML-format page document to obtain the visual page document.

9. The data processing system of claim 8, wherein the extraction rule generating module determines a node position of a clicked text by triggering a click event from a mouse click start position in the visualized page document and recording the node position as a start node when selecting a content to be extracted in the visualized page document by mouse dragging according to a selection of a user; after the mouse is dragged until the mouse is released, triggering an end event, judging the position of a node where the end position of the mouse is located, and recording the position as an end node; determining the content between the starting node and the ending node as the content to be extracted;

10. The data processing system of claim 9, wherein the pre-configured data mapping in the storage module is configured in a manner that includes: reading table field information in a database, selecting corresponding document extracted information fields for each table field, and generating a one-to-one configuration relationship; wherein the table field information includes: field name information, field type information, and/or field length information.