CN116483940A

CN116483940A - Method for extracting and structuring data of whole-flow type document

Info

Publication number: CN116483940A
Application number: CN202310461849.8A
Authority: CN
Inventors: 杨丽艳
Original assignee: Shenzhen Guofang Cloud Data Technology Service Co ltd
Current assignee: Shenzhen Guofang Cloud Data Technology Service Co ltd
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-07-25

Abstract

The invention discloses a method for extracting and structuring data of a complete-flow type document, and belongs to the technical field of numerical control machine tools. The method for extracting and structuring the whole-flow system document data comprises the following steps: s1, collecting documents: collecting all relevant migration type documents; s2, preprocessing a document: preprocessing the collected documents; s3, text recognition: the document is identified using OCR technology, text content in the document is extracted, and if a handwritten font exists in the document, handwriting recognition technology may be used for processing. The method for extracting and structuring the data of the removed full-flow type document can rapidly and accurately extract and sort the information by using the method for extracting and structuring the data of the removed full-flow type document, so that the information is changed into structured data, and the subsequent analysis and utilization are convenient. The working efficiency can be improved, and the occurrence of manual errors can be reduced, so that the fairness, fairness and transparency of the removal work are better ensured.

Description

Method for extracting and structuring data of whole-flow type document

Technical Field

The invention relates to the technical field, in particular to a method for extracting and structuring document data in a full-flow mode.

Background

In the migration planning process, a large number of standard documents and a large amount of information contained in the documents are generated. The system documents are widely related to content, including policy files, planning schemes, migration protocols, compensation schemes and the like, and contain a large amount of important information such as migration places, areas, compensation amounts, migration progress and the like. The information is very important for all parties such as government, removing companies and residents, in the prior art, the document information is manually extracted and arranged by workers, however, due to the fact that the number of documents is huge, the formats of the documents are not uniform, the contents of the documents are complex, and the like, the manual extraction and arrangement of the information is very time-consuming and labor-consuming, so that the labor intensity of the workers is improved, and the manual extraction and arrangement of the information can cause errors of the workers due to the overlarge working intensity, so that the labor intensity of the workers is reduced.

Based on the method, the invention designs a method for extracting and structuring the whole-flow system document data to solve the problems.

Disclosure of Invention

1. Technical problem to be solved

The invention aims to provide a method for extracting and structuring document data in a full-flow mode, which aims to solve the problems in the background technology:

in the prior art, document information is manually extracted and arranged by a worker, however, because of huge document quantity, non-uniform document format, complex document content and the like, the manual extraction and arrangement of the information is very time-consuming and labor-consuming, thereby improving the labor intensity of the worker, and the manual extraction and arrangement of the information can cause errors of the worker due to overlarge working intensity, thereby reducing the labor intensity of the worker.

2. Technical proposal

The method for extracting and structuring the whole-flow system document data comprises the following steps:

s1, collecting documents: collecting all relevant migration type documents;

s2, preprocessing a document: preprocessing the collected documents;

s3, text recognition: recognizing the document by using an OCR technology, extracting text content in the document, and if a handwritten font exists in the document, processing by using a handwriting recognition technology;

s4, keyword extraction: extracting keywords of places, related transition projects and transition policies related to the documents by using a keyword extraction technology;

s5, entity identification: identifying the name of a person, the name of a place and the name of an organization in the document by using an entity identification technology;

s6, data structuring: organizing the extracted keywords and entities according to a certain structure to form structured data;

s7, data verification: verifying the structured data;

s8, data storage: the structured data is stored in a database.

Preferably, the step S2 includes the steps of:

s2-1, converting a document format: converting the related documents collected in the step S1 into a unified format;

s2-2, de-duplication: for a plurality of identical documents, only one document is reserved, so that repeated processing and storage are reduced;

s2-3, denoising: removing irrelevant contents in the document;

s2-4, text cutting: cutting the text in the document according to a certain rule;

s2-5, format standardization: normalizing the format in the document;

s2-6, character set conversion: converting the character set in the document into a unified character set;

s2-7, compressing a document: all documents are compressed.

Preferably, the step S2-2 comprises the following steps:

s2-21, text cleaning: cleaning the text data to remove useless information;

s2-22, text standardization: the text is subjected to standardized processing so as to ensure consistency and comparability of the text data;

s2-23, extracting features: extracting characteristics of the text data, and converting the text into a vector form so as to facilitate the subsequent comparison and calculation;

s2-24, similarity calculation: comparing the similarity between the text data by using a similarity algorithm;

s2-25, de-duplication treatment: and determining which text data are similar and which text data are repeated according to the result of the similarity calculation, and performing deduplication processing.

Preferably, the step S4 includes the steps of:

s4-1, word segmentation: cutting the text into words or phrases according to a certain rule;

s4-2, removing stop words: filtering out some common words without practical meaning in the segmented text;

s4-3, part-of-speech tagging: marking the parts of speech of the word after word segmentation;

s4-4, extracting: extracting keywords from the text marked by the parts of speech;

s4-5, keyword filtering and sequencing: and filtering and sequencing the extracted keywords according to actual requirements to obtain a more accurate and useful keyword list.

Preferably, the step S4-4 comprises the following steps:

s4-41, calculating word frequency: counting each word in the segmented text to obtain the number of times of each word in the text;

s4-42, calculating TF values: for each word, calculating the word frequency of the word in the text divided by the total number of all the words in the text to obtain the TF value of the word;

s4-43, calculating an Inverse Document Frequency (IDF) value: for each word, calculating the number of documents which appear in the text set, dividing the number of the documents by the value, and taking the logarithm to obtain the IDF value of the word;

s4-44, calculating TF-IDF value: multiplying the TF value of each word with the IDF value of the word to obtain the TF-IDF value of the word;

s4-45, selecting keywords: and selecting words with higher values from the calculated TF-IDF values as keywords, setting a threshold according to actual requirements, and selecting words with TF-IDF values larger than or equal to the threshold as keywords.

Preferably, the step S6 includes the steps of:

s6-1, data cleaning: removing noise and error data in the data;

s6-2, data integration: integrating and combining the data of a plurality of sources into a data set, so as to facilitate subsequent processing;

s6-3, data transformation: transforming the data to make the data more in line with the analysis and modeling requirements;

s6-4, feature selection: selecting features useful for modeling, and removing features that are not useful or redundant for modeling;

s6-5, feature extraction: extracting features useful for modeling from the raw data;

s6-6, modeling data: selecting a proper modeling method according to a specific problem, and training and verifying;

s6-7, evaluating results: evaluating and analyzing the modeling result to judge the accuracy and reliability of the model;

s6-8, the result shows that: and visualizing and displaying the structured data and the modeling result so as to facilitate understanding and application.

Preferably, the step S7 includes the steps of:

s7-1, checking data types: checking whether the type of the data is correct;

s7-2, checking the data range: checking whether the data is within an expected range;

s7-3, checking data uniqueness: checking whether the data has uniqueness;

s7-4, checking the data logic relation: checking whether the logical relationship between the data meets expectations;

s7-5, checking data integrity: checking whether the data is complete;

s7-6, checking data consistency: checking whether the data are consistent among different data sources and data sets;

s7-7, checking data outlier: check if there are outliers or unreasonable data.

3. Advantageous effects

Compared with the prior art, the invention has the advantages that:

1) In the invention, through carrying out preliminary processing on all original documents, the subsequent data extraction and analysis are convenient, and meanwhile, through carrying out the de-duplication processing on the documents, the text data are similar and the text data are repeated, so that the storage space of the documents during storage can be effectively reduced through carrying out the de-duplication processing.

2) In the invention, the key word extraction is carried out on the original document, so that a worker can rapidly extract the information such as the place, the transition project, the transition policy and the like related to the document from the document, thereby being beneficial to the working efficiency of the worker when extracting the document data.

3) According to the invention, the information can be rapidly and accurately extracted and arranged by using the extraction and structuring method of the whole-flow type document data, so that the information is changed into structured data, and the subsequent analysis and utilization are convenient. The working efficiency can be improved, and the occurrence of manual errors can be reduced, so that the fairness, fairness and transparency of the removal work are better ensured.

Drawings

FIG. 1 is a schematic overall flow chart of the present invention;

FIG. 2 is a schematic diagram of a document preprocessing flow in accordance with the present invention;

FIG. 3 is a schematic diagram of a keyword extraction process according to the present invention;

FIG. 4 is a flow chart of the data structuring process of the present invention;

fig. 5 is a schematic diagram of a data verification process according to the present invention.

Detailed Description

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "configured to," "engaged with," "connected to," and the like are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Examples: referring to fig. 1, the method for extracting and structuring the full-flow system document data includes:

s1, collecting documents: collecting all relevant migration type documents, including policy documents, planning schemes, migration protocols, compensation schemes and the like;

s2, preprocessing a document: preprocessing the collected documents, including document format conversion, duplication removal, noise removal and the like;

s7, data verification: verifying the structured data, thereby ensuring the correctness and the integrity of the data;

s8, data storage: the structured data is stored in a database to facilitate subsequent queries and analysis.

Specifically, the step S2 includes the following steps:

s2-1, converting a document format: the related documents collected in the step S1 are converted into a unified format, for example, PDF files are converted into a text format, so that subsequent text processing is facilitated;

s2-2, de-duplication: for a plurality of identical documents, only one document is reserved, repeated processing and storage are reduced, and the repetition rate among the documents is effectively reduced, so that the processing speed and the storage space of the documents are reduced;

s2-3, denoising: extraneous content in the document is removed, and the influence on the extraction speed caused by excessive extraneous content is avoided;

s2-4, text cutting: cutting the text in the document according to a certain rule, so that subsequent text processing and extraction are facilitated;

s2-5, format standardization: the format in the document is normalized, so that the text is easier to read and process;

s2-6, character set conversion: converting the character set in the document into a unified character set, so that subsequent text processing and storage are facilitated;

s2-7, compressing a document: and compressing all the documents to further reduce the storage space of the documents.

Specifically, the step S2-2 comprises the following steps:

s2-21, text cleaning: cleaning the text data to remove useless information;

Specifically, the step S4 includes the following steps:

Specifically, the step S4-4 comprises the following steps:

Specifically, the step S6 includes the following steps:

s6-1, data cleaning: removing noise and error data in the data;

Specifically, the step S7 includes the following steps:

s7-1, checking data types: checking whether the type of the data is correct;

s7-3, checking data uniqueness: checking whether the data has uniqueness;

s7-5, checking data integrity: checking whether the data is complete;

s7-7, checking data outlier: check if there are outliers or unreasonable data.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third, etc. do not denote any order, but rather the terms first, second, third, etc. are used to interpret the terms as labels.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. The method for extracting and structuring the data of the whole-flow system document is characterized by comprising the following steps:

s1, collecting documents: collecting all relevant migration type documents;

s2, preprocessing a document: preprocessing the collected documents;

s7, data verification: verifying the structured data;

s8, data storage: the structured data is stored in a database.

2. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 1, wherein the method comprises the following steps: the step S2 comprises the following steps:

s2-3, denoising: removing irrelevant contents in the document;

s2-5, format standardization: normalizing the format in the document;

s2-7, compressing a document: all documents are compressed.

3. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 2, wherein the method comprises the following steps: the S2-2 comprises the following steps:

s2-21, text cleaning: cleaning the text data to remove useless information;

4. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 1, wherein the method comprises the following steps: the step S4 comprises the following steps:

5. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 4, wherein the method comprises the following steps: the step S4-4 comprises the following steps:

6. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 1, wherein the method comprises the following steps: the step S6 comprises the following steps:

s6-1, data cleaning: removing noise and error data in the data;

7. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 1, wherein the method comprises the following steps: the step S7 comprises the following steps:

s7-1, checking data types: checking whether the type of the data is correct;

s7-3, checking data uniqueness: checking whether the data has uniqueness;

s7-5, checking data integrity: checking whether the data is complete;

s7-7, checking data outlier: check if there are outliers or unreasonable data.