CN116483940A - Method for extracting and structuring data of whole-flow type document - Google Patents

Method for extracting and structuring data of whole-flow type document Download PDF

Info

Publication number
CN116483940A
CN116483940A CN202310461849.8A CN202310461849A CN116483940A CN 116483940 A CN116483940 A CN 116483940A CN 202310461849 A CN202310461849 A CN 202310461849A CN 116483940 A CN116483940 A CN 116483940A
Authority
CN
China
Prior art keywords
data
document
text
extracting
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310461849.8A
Other languages
Chinese (zh)
Inventor
杨丽艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Guofang Cloud Data Technology Service Co ltd
Original Assignee
Shenzhen Guofang Cloud Data Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Guofang Cloud Data Technology Service Co ltd filed Critical Shenzhen Guofang Cloud Data Technology Service Co ltd
Priority to CN202310461849.8A priority Critical patent/CN116483940A/en
Publication of CN116483940A publication Critical patent/CN116483940A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for extracting and structuring data of a complete-flow type document, and belongs to the technical field of numerical control machine tools. The method for extracting and structuring the whole-flow system document data comprises the following steps: s1, collecting documents: collecting all relevant migration type documents; s2, preprocessing a document: preprocessing the collected documents; s3, text recognition: the document is identified using OCR technology, text content in the document is extracted, and if a handwritten font exists in the document, handwriting recognition technology may be used for processing. The method for extracting and structuring the data of the removed full-flow type document can rapidly and accurately extract and sort the information by using the method for extracting and structuring the data of the removed full-flow type document, so that the information is changed into structured data, and the subsequent analysis and utilization are convenient. The working efficiency can be improved, and the occurrence of manual errors can be reduced, so that the fairness, fairness and transparency of the removal work are better ensured.

Description

Method for extracting and structuring data of whole-flow type document
Technical Field
The invention relates to the technical field, in particular to a method for extracting and structuring document data in a full-flow mode.
Background
In the migration planning process, a large number of standard documents and a large amount of information contained in the documents are generated. The system documents are widely related to content, including policy files, planning schemes, migration protocols, compensation schemes and the like, and contain a large amount of important information such as migration places, areas, compensation amounts, migration progress and the like. The information is very important for all parties such as government, removing companies and residents, in the prior art, the document information is manually extracted and arranged by workers, however, due to the fact that the number of documents is huge, the formats of the documents are not uniform, the contents of the documents are complex, and the like, the manual extraction and arrangement of the information is very time-consuming and labor-consuming, so that the labor intensity of the workers is improved, and the manual extraction and arrangement of the information can cause errors of the workers due to the overlarge working intensity, so that the labor intensity of the workers is reduced.
Based on the method, the invention designs a method for extracting and structuring the whole-flow system document data to solve the problems.
Disclosure of Invention
1. Technical problem to be solved
The invention aims to provide a method for extracting and structuring document data in a full-flow mode, which aims to solve the problems in the background technology:
in the prior art, document information is manually extracted and arranged by a worker, however, because of huge document quantity, non-uniform document format, complex document content and the like, the manual extraction and arrangement of the information is very time-consuming and labor-consuming, thereby improving the labor intensity of the worker, and the manual extraction and arrangement of the information can cause errors of the worker due to overlarge working intensity, thereby reducing the labor intensity of the worker.
2. Technical proposal
The method for extracting and structuring the whole-flow system document data comprises the following steps:
s1, collecting documents: collecting all relevant migration type documents;
s2, preprocessing a document: preprocessing the collected documents;
s3, text recognition: recognizing the document by using an OCR technology, extracting text content in the document, and if a handwritten font exists in the document, processing by using a handwriting recognition technology;
s4, keyword extraction: extracting keywords of places, related transition projects and transition policies related to the documents by using a keyword extraction technology;
s5, entity identification: identifying the name of a person, the name of a place and the name of an organization in the document by using an entity identification technology;
s6, data structuring: organizing the extracted keywords and entities according to a certain structure to form structured data;
s7, data verification: verifying the structured data;
s8, data storage: the structured data is stored in a database.
Preferably, the step S2 includes the steps of:
s2-1, converting a document format: converting the related documents collected in the step S1 into a unified format;
s2-2, de-duplication: for a plurality of identical documents, only one document is reserved, so that repeated processing and storage are reduced;
s2-3, denoising: removing irrelevant contents in the document;
s2-4, text cutting: cutting the text in the document according to a certain rule;
s2-5, format standardization: normalizing the format in the document;
s2-6, character set conversion: converting the character set in the document into a unified character set;
s2-7, compressing a document: all documents are compressed.
Preferably, the step S2-2 comprises the following steps:
s2-21, text cleaning: cleaning the text data to remove useless information;
s2-22, text standardization: the text is subjected to standardized processing so as to ensure consistency and comparability of the text data;
s2-23, extracting features: extracting characteristics of the text data, and converting the text into a vector form so as to facilitate the subsequent comparison and calculation;
s2-24, similarity calculation: comparing the similarity between the text data by using a similarity algorithm;
s2-25, de-duplication treatment: and determining which text data are similar and which text data are repeated according to the result of the similarity calculation, and performing deduplication processing.
Preferably, the step S4 includes the steps of:
s4-1, word segmentation: cutting the text into words or phrases according to a certain rule;
s4-2, removing stop words: filtering out some common words without practical meaning in the segmented text;
s4-3, part-of-speech tagging: marking the parts of speech of the word after word segmentation;
s4-4, extracting: extracting keywords from the text marked by the parts of speech;
s4-5, keyword filtering and sequencing: and filtering and sequencing the extracted keywords according to actual requirements to obtain a more accurate and useful keyword list.
Preferably, the step S4-4 comprises the following steps:
s4-41, calculating word frequency: counting each word in the segmented text to obtain the number of times of each word in the text;
s4-42, calculating TF values: for each word, calculating the word frequency of the word in the text divided by the total number of all the words in the text to obtain the TF value of the word;
s4-43, calculating an Inverse Document Frequency (IDF) value: for each word, calculating the number of documents which appear in the text set, dividing the number of the documents by the value, and taking the logarithm to obtain the IDF value of the word;
s4-44, calculating TF-IDF value: multiplying the TF value of each word with the IDF value of the word to obtain the TF-IDF value of the word;
s4-45, selecting keywords: and selecting words with higher values from the calculated TF-IDF values as keywords, setting a threshold according to actual requirements, and selecting words with TF-IDF values larger than or equal to the threshold as keywords.
Preferably, the step S6 includes the steps of:
s6-1, data cleaning: removing noise and error data in the data;
s6-2, data integration: integrating and combining the data of a plurality of sources into a data set, so as to facilitate subsequent processing;
s6-3, data transformation: transforming the data to make the data more in line with the analysis and modeling requirements;
s6-4, feature selection: selecting features useful for modeling, and removing features that are not useful or redundant for modeling;
s6-5, feature extraction: extracting features useful for modeling from the raw data;
s6-6, modeling data: selecting a proper modeling method according to a specific problem, and training and verifying;
s6-7, evaluating results: evaluating and analyzing the modeling result to judge the accuracy and reliability of the model;
s6-8, the result shows that: and visualizing and displaying the structured data and the modeling result so as to facilitate understanding and application.
Preferably, the step S7 includes the steps of:
s7-1, checking data types: checking whether the type of the data is correct;
s7-2, checking the data range: checking whether the data is within an expected range;
s7-3, checking data uniqueness: checking whether the data has uniqueness;
s7-4, checking the data logic relation: checking whether the logical relationship between the data meets expectations;
s7-5, checking data integrity: checking whether the data is complete;
s7-6, checking data consistency: checking whether the data are consistent among different data sources and data sets;
s7-7, checking data outlier: check if there are outliers or unreasonable data.
3. Advantageous effects
Compared with the prior art, the invention has the advantages that:
1) In the invention, through carrying out preliminary processing on all original documents, the subsequent data extraction and analysis are convenient, and meanwhile, through carrying out the de-duplication processing on the documents, the text data are similar and the text data are repeated, so that the storage space of the documents during storage can be effectively reduced through carrying out the de-duplication processing.
2) In the invention, the key word extraction is carried out on the original document, so that a worker can rapidly extract the information such as the place, the transition project, the transition policy and the like related to the document from the document, thereby being beneficial to the working efficiency of the worker when extracting the document data.
3) According to the invention, the information can be rapidly and accurately extracted and arranged by using the extraction and structuring method of the whole-flow type document data, so that the information is changed into structured data, and the subsequent analysis and utilization are convenient. The working efficiency can be improved, and the occurrence of manual errors can be reduced, so that the fairness, fairness and transparency of the removal work are better ensured.
Drawings
FIG. 1 is a schematic overall flow chart of the present invention;
FIG. 2 is a schematic diagram of a document preprocessing flow in accordance with the present invention;
FIG. 3 is a schematic diagram of a keyword extraction process according to the present invention;
FIG. 4 is a flow chart of the data structuring process of the present invention;
fig. 5 is a schematic diagram of a data verification process according to the present invention.
Detailed Description
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "configured to," "engaged with," "connected to," and the like are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Examples: referring to fig. 1, the method for extracting and structuring the full-flow system document data includes:
s1, collecting documents: collecting all relevant migration type documents, including policy documents, planning schemes, migration protocols, compensation schemes and the like;
s2, preprocessing a document: preprocessing the collected documents, including document format conversion, duplication removal, noise removal and the like;
s3, text recognition: recognizing the document by using an OCR technology, extracting text content in the document, and if a handwritten font exists in the document, processing by using a handwriting recognition technology;
s4, keyword extraction: extracting keywords of places, related transition projects and transition policies related to the documents by using a keyword extraction technology;
s5, entity identification: identifying the name of a person, the name of a place and the name of an organization in the document by using an entity identification technology;
s6, data structuring: organizing the extracted keywords and entities according to a certain structure to form structured data;
s7, data verification: verifying the structured data, thereby ensuring the correctness and the integrity of the data;
s8, data storage: the structured data is stored in a database to facilitate subsequent queries and analysis.
Specifically, the step S2 includes the following steps:
s2-1, converting a document format: the related documents collected in the step S1 are converted into a unified format, for example, PDF files are converted into a text format, so that subsequent text processing is facilitated;
s2-2, de-duplication: for a plurality of identical documents, only one document is reserved, repeated processing and storage are reduced, and the repetition rate among the documents is effectively reduced, so that the processing speed and the storage space of the documents are reduced;
s2-3, denoising: extraneous content in the document is removed, and the influence on the extraction speed caused by excessive extraneous content is avoided;
s2-4, text cutting: cutting the text in the document according to a certain rule, so that subsequent text processing and extraction are facilitated;
s2-5, format standardization: the format in the document is normalized, so that the text is easier to read and process;
s2-6, character set conversion: converting the character set in the document into a unified character set, so that subsequent text processing and storage are facilitated;
s2-7, compressing a document: and compressing all the documents to further reduce the storage space of the documents.
Specifically, the step S2-2 comprises the following steps:
s2-21, text cleaning: cleaning the text data to remove useless information;
s2-22, text standardization: the text is subjected to standardized processing so as to ensure consistency and comparability of the text data;
s2-23, extracting features: extracting characteristics of the text data, and converting the text into a vector form so as to facilitate the subsequent comparison and calculation;
s2-24, similarity calculation: comparing the similarity between the text data by using a similarity algorithm;
s2-25, de-duplication treatment: and determining which text data are similar and which text data are repeated according to the result of the similarity calculation, and performing deduplication processing.
Specifically, the step S4 includes the following steps:
s4-1, word segmentation: cutting the text into words or phrases according to a certain rule;
s4-2, removing stop words: filtering out some common words without practical meaning in the segmented text;
s4-3, part-of-speech tagging: marking the parts of speech of the word after word segmentation;
s4-4, extracting: extracting keywords from the text marked by the parts of speech;
s4-5, keyword filtering and sequencing: and filtering and sequencing the extracted keywords according to actual requirements to obtain a more accurate and useful keyword list.
Specifically, the step S4-4 comprises the following steps:
s4-41, calculating word frequency: counting each word in the segmented text to obtain the number of times of each word in the text;
s4-42, calculating TF values: for each word, calculating the word frequency of the word in the text divided by the total number of all the words in the text to obtain the TF value of the word;
s4-43, calculating an Inverse Document Frequency (IDF) value: for each word, calculating the number of documents which appear in the text set, dividing the number of the documents by the value, and taking the logarithm to obtain the IDF value of the word;
s4-44, calculating TF-IDF value: multiplying the TF value of each word with the IDF value of the word to obtain the TF-IDF value of the word;
s4-45, selecting keywords: and selecting words with higher values from the calculated TF-IDF values as keywords, setting a threshold according to actual requirements, and selecting words with TF-IDF values larger than or equal to the threshold as keywords.
Specifically, the step S6 includes the following steps:
s6-1, data cleaning: removing noise and error data in the data;
s6-2, data integration: integrating and combining the data of a plurality of sources into a data set, so as to facilitate subsequent processing;
s6-3, data transformation: transforming the data to make the data more in line with the analysis and modeling requirements;
s6-4, feature selection: selecting features useful for modeling, and removing features that are not useful or redundant for modeling;
s6-5, feature extraction: extracting features useful for modeling from the raw data;
s6-6, modeling data: selecting a proper modeling method according to a specific problem, and training and verifying;
s6-7, evaluating results: evaluating and analyzing the modeling result to judge the accuracy and reliability of the model;
s6-8, the result shows that: and visualizing and displaying the structured data and the modeling result so as to facilitate understanding and application.
Specifically, the step S7 includes the following steps:
s7-1, checking data types: checking whether the type of the data is correct;
s7-2, checking the data range: checking whether the data is within an expected range;
s7-3, checking data uniqueness: checking whether the data has uniqueness;
s7-4, checking the data logic relation: checking whether the logical relationship between the data meets expectations;
s7-5, checking data integrity: checking whether the data is complete;
s7-6, checking data consistency: checking whether the data are consistent among different data sources and data sets;
s7-7, checking data outlier: check if there are outliers or unreasonable data.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third, etc. do not denote any order, but rather the terms first, second, third, etc. are used to interpret the terms as labels.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (7)

1. The method for extracting and structuring the data of the whole-flow system document is characterized by comprising the following steps:
s1, collecting documents: collecting all relevant migration type documents;
s2, preprocessing a document: preprocessing the collected documents;
s3, text recognition: recognizing the document by using an OCR technology, extracting text content in the document, and if a handwritten font exists in the document, processing by using a handwriting recognition technology;
s4, keyword extraction: extracting keywords of places, related transition projects and transition policies related to the documents by using a keyword extraction technology;
s5, entity identification: identifying the name of a person, the name of a place and the name of an organization in the document by using an entity identification technology;
s6, data structuring: organizing the extracted keywords and entities according to a certain structure to form structured data;
s7, data verification: verifying the structured data;
s8, data storage: the structured data is stored in a database.
2. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 1, wherein the method comprises the following steps: the step S2 comprises the following steps:
s2-1, converting a document format: converting the related documents collected in the step S1 into a unified format;
s2-2, de-duplication: for a plurality of identical documents, only one document is reserved, so that repeated processing and storage are reduced;
s2-3, denoising: removing irrelevant contents in the document;
s2-4, text cutting: cutting the text in the document according to a certain rule;
s2-5, format standardization: normalizing the format in the document;
s2-6, character set conversion: converting the character set in the document into a unified character set;
s2-7, compressing a document: all documents are compressed.
3. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 2, wherein the method comprises the following steps: the S2-2 comprises the following steps:
s2-21, text cleaning: cleaning the text data to remove useless information;
s2-22, text standardization: the text is subjected to standardized processing so as to ensure consistency and comparability of the text data;
s2-23, extracting features: extracting characteristics of the text data, and converting the text into a vector form so as to facilitate the subsequent comparison and calculation;
s2-24, similarity calculation: comparing the similarity between the text data by using a similarity algorithm;
s2-25, de-duplication treatment: and determining which text data are similar and which text data are repeated according to the result of the similarity calculation, and performing deduplication processing.
4. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 1, wherein the method comprises the following steps: the step S4 comprises the following steps:
s4-1, word segmentation: cutting the text into words or phrases according to a certain rule;
s4-2, removing stop words: filtering out some common words without practical meaning in the segmented text;
s4-3, part-of-speech tagging: marking the parts of speech of the word after word segmentation;
s4-4, extracting: extracting keywords from the text marked by the parts of speech;
s4-5, keyword filtering and sequencing: and filtering and sequencing the extracted keywords according to actual requirements to obtain a more accurate and useful keyword list.
5. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 4, wherein the method comprises the following steps: the step S4-4 comprises the following steps:
s4-41, calculating word frequency: counting each word in the segmented text to obtain the number of times of each word in the text;
s4-42, calculating TF values: for each word, calculating the word frequency of the word in the text divided by the total number of all the words in the text to obtain the TF value of the word;
s4-43, calculating an Inverse Document Frequency (IDF) value: for each word, calculating the number of documents which appear in the text set, dividing the number of the documents by the value, and taking the logarithm to obtain the IDF value of the word;
s4-44, calculating TF-IDF value: multiplying the TF value of each word with the IDF value of the word to obtain the TF-IDF value of the word;
s4-45, selecting keywords: and selecting words with higher values from the calculated TF-IDF values as keywords, setting a threshold according to actual requirements, and selecting words with TF-IDF values larger than or equal to the threshold as keywords.
6. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 1, wherein the method comprises the following steps: the step S6 comprises the following steps:
s6-1, data cleaning: removing noise and error data in the data;
s6-2, data integration: integrating and combining the data of a plurality of sources into a data set, so as to facilitate subsequent processing;
s6-3, data transformation: transforming the data to make the data more in line with the analysis and modeling requirements;
s6-4, feature selection: selecting features useful for modeling, and removing features that are not useful or redundant for modeling;
s6-5, feature extraction: extracting features useful for modeling from the raw data;
s6-6, modeling data: selecting a proper modeling method according to a specific problem, and training and verifying;
s6-7, evaluating results: evaluating and analyzing the modeling result to judge the accuracy and reliability of the model;
s6-8, the result shows that: and visualizing and displaying the structured data and the modeling result so as to facilitate understanding and application.
7. The method for extracting and structuring the data of the document in the whole process of the removal according to claim 1, wherein the method comprises the following steps: the step S7 comprises the following steps:
s7-1, checking data types: checking whether the type of the data is correct;
s7-2, checking the data range: checking whether the data is within an expected range;
s7-3, checking data uniqueness: checking whether the data has uniqueness;
s7-4, checking the data logic relation: checking whether the logical relationship between the data meets expectations;
s7-5, checking data integrity: checking whether the data is complete;
s7-6, checking data consistency: checking whether the data are consistent among different data sources and data sets;
s7-7, checking data outlier: check if there are outliers or unreasonable data.
CN202310461849.8A 2023-04-26 2023-04-26 Method for extracting and structuring data of whole-flow type document Pending CN116483940A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310461849.8A CN116483940A (en) 2023-04-26 2023-04-26 Method for extracting and structuring data of whole-flow type document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310461849.8A CN116483940A (en) 2023-04-26 2023-04-26 Method for extracting and structuring data of whole-flow type document

Publications (1)

Publication Number Publication Date
CN116483940A true CN116483940A (en) 2023-07-25

Family

ID=87215292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310461849.8A Pending CN116483940A (en) 2023-04-26 2023-04-26 Method for extracting and structuring data of whole-flow type document

Country Status (1)

Country Link
CN (1) CN116483940A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050289182A1 (en) * 2004-06-15 2005-12-29 Sand Hill Systems Inc. Document management system with enhanced intelligent document recognition capabilities
CN105183869A (en) * 2015-09-16 2015-12-23 分众(中国)信息技术有限公司 Building knowledge mapping database and construction method thereof
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN110457696A (en) * 2019-07-31 2019-11-15 福州数据技术研究院有限公司 A kind of talent towards file data and policy intelligent Matching system and method
CN111353004A (en) * 2020-05-25 2020-06-30 浙江明度智控科技有限公司 Data association analysis method and system for drug document
CN113220672A (en) * 2021-04-26 2021-08-06 中国人民解放军军事科学院国防科技创新研究院 Military and civil fusion policy information database system
CN114817448A (en) * 2022-05-10 2022-07-29 桂林电子科技大学 Method for constructing hydrogen storage material database based on artificial intelligence technology
CN114997167A (en) * 2022-06-17 2022-09-02 北京金山数字娱乐科技有限公司 Resume content extraction method and device
CN115062117A (en) * 2022-07-11 2022-09-16 北京四方智汇信息科技有限公司 Method for automatically generating and classifying documents based on natural language processing technology

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050289182A1 (en) * 2004-06-15 2005-12-29 Sand Hill Systems Inc. Document management system with enhanced intelligent document recognition capabilities
CN105183869A (en) * 2015-09-16 2015-12-23 分众(中国)信息技术有限公司 Building knowledge mapping database and construction method thereof
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN110457696A (en) * 2019-07-31 2019-11-15 福州数据技术研究院有限公司 A kind of talent towards file data and policy intelligent Matching system and method
CN111353004A (en) * 2020-05-25 2020-06-30 浙江明度智控科技有限公司 Data association analysis method and system for drug document
CN113220672A (en) * 2021-04-26 2021-08-06 中国人民解放军军事科学院国防科技创新研究院 Military and civil fusion policy information database system
CN114817448A (en) * 2022-05-10 2022-07-29 桂林电子科技大学 Method for constructing hydrogen storage material database based on artificial intelligence technology
CN114997167A (en) * 2022-06-17 2022-09-02 北京金山数字娱乐科技有限公司 Resume content extraction method and device
CN115062117A (en) * 2022-07-11 2022-09-16 北京四方智汇信息科技有限公司 Method for automatically generating and classifying documents based on natural language processing technology

Similar Documents

Publication Publication Date Title
AU2022235604B2 (en) Massive scale heterogeneous data ingestion and user resolution
CN109753909B (en) Resume analysis method based on content blocking and BilSTM model
CN114117171B (en) Intelligent project file collecting method and system based on energized thinking
US20210366055A1 (en) Systems and methods for generating accurate transaction data and manipulation
CN114003791B (en) Depth map matching-based automatic classification method and system for medical data elements
CN112926299B (en) Text comparison method, contract review method and auditing system
CN115630621A (en) PDF financial data report form-based data acquisition and processing method and system
CN109389050B (en) Method for identifying connection relation of flow chart
CN117648093A (en) RPA flow automatic generation method based on large model and self-customized demand template
CN111091090A (en) Bank report OCR recognition method, device, platform and terminal
CN116483940A (en) Method for extracting and structuring data of whole-flow type document
CN114722159B (en) Multi-source heterogeneous data processing method and system for numerical control machine tool manufacturing resources
CN105573984A (en) Socio-economic indicator identification method and device
CN112348022B (en) Free-form document identification method based on deep learning
CN110688445A (en) Digital archive construction method
CN112989827A (en) Text data set quality evaluation method based on multi-source heterogeneous characteristics
CN112598503A (en) OCR recognition system and method based on credit investigation recognition
CN111275409A (en) Power grid overhaul audit data processing system and processing method
CN111191291A (en) Database attribute sensitivity quantification method based on attack probability
CN114637849B (en) Legal relation cognition method and system based on artificial intelligence
CN116126790B (en) Railway engineering archive archiving method and device, electronic equipment and storage medium
CN117332761B (en) PDF document intelligent identification marking system
LU504881B1 (en) Intelligent collection method and system for engineering archives based on enabling thinking
CN114691919A (en) Text format auditing module for financial long text rechecking system
CN112927115A (en) Intelligent data management system suitable for government departments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination