CN116502601A - Automatic indexing method for small sample format document based on feature learning - Google Patents

Automatic indexing method for small sample format document based on feature learning Download PDF

Info

Publication number
CN116502601A
CN116502601A CN202211646469.3A CN202211646469A CN116502601A CN 116502601 A CN116502601 A CN 116502601A CN 202211646469 A CN202211646469 A CN 202211646469A CN 116502601 A CN116502601 A CN 116502601A
Authority
CN
China
Prior art keywords
feature
features
learning
elements
line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211646469.3A
Other languages
Chinese (zh)
Inventor
李愿军
赵兰
吴涛
张镔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongfang Knowledge Network Digital Publishing Technology Co ltd
Original Assignee
Tongfang Knowledge Network Digital Publishing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongfang Knowledge Network Digital Publishing Technology Co ltd filed Critical Tongfang Knowledge Network Digital Publishing Technology Co ltd
Priority to CN202211646469.3A priority Critical patent/CN116502601A/en
Publication of CN116502601A publication Critical patent/CN116502601A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a small sample format document automatic indexing method based on feature learning, which comprises the following steps: sample preparation; extracting basic elements of the format document; learning a feature model; automatic indexing based on features; and (5) post-processing and outputting an indexing result. The invention provides an overall thought for realizing automatic indexing of the format document according to the visual and semantic characteristics of the format document and a specific solution for key links.

Description

Automatic indexing method for small sample format document based on feature learning
Technical Field
The invention relates to the field of format document processing, in particular to a small sample format document automatic indexing method based on feature learning.
Background
Along with the proposal of the ' speeding up the construction pace ' of a digital society ' perspective target outline, the digital transformation and upgrading of government authorities, enterprises and public institutions and the trend of data informatization construction are promoted, and all units of format documents are widely applied, so that the wide demands for processing documents with various types and structures are met.
There are many types of layout document data with basically consistent structures and different contents in government authorities, enterprises and institutions. If automatic indexing based on traditional rules is adopted, a program based on the structural rules of the current style is required to be written for documents with different structures, and some rules may even conflict with each other for documents with various structures, so that the automatic indexing based on the traditional rules is not efficient for processing documents with various types and various structures. The requirements of users cannot be completely met if deep learning is adopted, because the quantity of documents with similar style structures in most units cannot meet the training requirements of the deep learning, and the processing quality of the documents based on the deep learning is relatively low at the development stage at present.
The automatic indexing method for the small sample format document based on feature learning can extract visual features and semantic features from the produced small sample data through statistical analysis, and the automatic indexing can use an extracted feature model to realize full-automatic processing of the format document which is similar and accords with the features.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a small sample format document automatic indexing method based on feature learning, which is realized by the following technical scheme:
sample preparation;
extracting basic elements of the format document;
learning a feature model;
automatic indexing based on features;
and (5) post-processing and outputting an indexing result.
In the above technical scheme, the sample preparation means that a digital processing person selects samples and marks elements to be processed in blocks.
In the above technical scheme, the extraction of the basic elements of the format document refers to extracting characters and fonts, font sizes, styles (tilting, thickening and color) and positions corresponding to the characters in each page of the document by taking pages as dimensions, and establishing a character set; extracting horizontal lines and vertical lines, and establishing a line set; and extracting the graph and establishing a graph set.
In the above technical solution, feature model learning refers to extracting layout features, visual features, semantic features, weights thereof, mutual relation features and the like of each noted document element from sample data, establishing a feature model of such a document based on the extracted layout features, visual features, semantic features, weights thereof, mutual relation features and the like, and performing iterative update of the model through supplementary learning of the sample data.
In the above technical solution, the automatic indexing based on the features refers to a process of obtaining element blocks by clustering and fusing basic elements of the layout document according to visual features and semantic features of each element of the feature model.
In the above technical solution, post-processing and outputting of the indexing result means that the element block obtained by automatic indexing is extracted and normalized (space and other special character processing, normalized full/half angle character and line connection processing), and the normalized content is stored as an XML document.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention
FIG. 2 is a flow chart of feature model learning of the present invention
FIG. 3 is a flow chart of the present invention for automatic indexing using feature models
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
A small sample layout document automatic indexing method based on feature learning, as shown in fig. 1, the method comprises:
1. sample preparation
The digital processing personnel selects a plurality of data with relatively comprehensive semantic features and visual features, marks the elements to be processed according to the blocks, and stores the marked results as XML documents.
2. Layout document basic element extraction
The extraction of the basic elements of the layout document is to take pages as dimensions, extract characters and fonts, character sizes, styles (inclined, thickened and colored) and position information corresponding to the characters in each page of the document, aggregate the characters into a row set according to the position information, and statistically analyze row font information, position information, character contents of rows and typesetting modes; extracting horizontal lines and vertical lines, and aggregating the lines into a line set by using a line aggregation algorithm; and extracting the graph and establishing a graph set.
3. Feature model learning
Feature model learning comprises learning of document layout features, extraction of labeling element visual features and semantic features, and statistical learning of feature weights, as shown in fig. 2.
Learning document layout characteristics: extracting page width and page height of each page of all data from sample data, and carrying out statistical analysis to obtain a range value and a weight highest value of the page width and the page height; extracting the layout position of the marking element on the current page from each page, calculating the page center characteristic of the current page according to the layout position, namely, the four edge distances from the page boundary, and carrying out statistical analysis on the page center characteristics of all pages of all sample data to obtain the range value and the highest weight value of the page center characteristic.
Learning of visual characteristics of labeling elements: the visual characteristics of each labeling element are extracted from the sample data, and the visual characteristics of each labeling element comprise some characteristics of the element, such as labeling names, occupied pages, occupied lines, occupied space, typeset fonts, colors, wide heights, whether bold italics, visual alignment modes (including left alignment, centering and right alignment) of the element, line spacing inside the element, layout characteristics of the element, namely four edges from page boundaries, and the like, and relationship characteristics between the element and other labeling elements, such as appearance sequence of the element, possible elements before and after the element, edges between the element and the following element, and the like.
Learning of semantic features of labeling elements: semantic features, such as keyword features, semantic features of first line data, semantic features of some time dates and the like, of labeling elements can be extracted from sample data.
Statistical analysis learning of feature weights: carrying out statistical analysis on visual features and semantic feature values of labeling elements in all sample data to obtain a range value and a highest weight value of each feature value; for some non-numeric class of feature values, the respective value weight ratios for such features are calculated.
After learning the various characteristics from the sample data, constructing a characteristic model of the sample data according to the characteristics; if the sample data is supplemented, carrying out statistics and learning again, and carrying out iterative updating of the feature model; the feature model supports serialization and storage and read conversion of XML files.
4. Feature-based automatic indexing
The automatic indexing based on the characteristics mainly comprises the following specific steps as shown in fig. 3:
step 1: and loading the characteristic model, and loading the model into a memory structure.
Step 2: and extracting the basic elements of the format document.
Step 3: carrying out statistical analysis on the basic elements of the document loaded in the step 2, and counting the main row spacing, the average row spacing and the maximum and minimum row spacing of the document; the primary font, average font, maximum and minimum font information, and how often different fonts appear in the document are counted.
Step 4: the method for locating the initial position of the element based on the semantic features comprises the following steps: traversing the line set of the document page by page, and matching semantic features of each element in the feature model according to the line content, wherein the following matching formula is adopted: index=reg (Text, meta feature, security feature, rule), where meta feature is a current element feature, security feature is a semantic feature of a current element, rule is a keyword expression in the semantic feature, when Index > =0, it indicates that the current line satisfies the semantic feature of the corresponding element, and then it determines whether the current semantic feature is a feature keyword; if yes, confirming the initial position of the current matching element of the current behavior and recording (Index, metaFeatureName), wherein MetaFeatureName represents the element name, and if no, continuing to check according to the visual characteristics corresponding to the current element; if the visual characteristics are met, judging that the current line is the initial position of the current matching element and recording (Index, metaFeatureName), otherwise, continuing judging other element characteristics; after traversing the features of the complete element, if the positions of a plurality of elements in the position records which are currently matched, cutting according to Index, storing the processed (Index, metaFeatureName) into an interaction data set of the processing procedure, and setting the Flag of the corresponding element.
Step 5: and (3) traversing the elements according to the feature sequence based on the element aggregation of the visual feature sequence, searching whether the initial position of the element is determined in the step (4) in the interacted data, if not, continuing traversing, if so, defining the current behavior A, acquiring the next element set corresponding to the current element in the feature model, traversing whether the element exists in the interacted data, if so, defining the current behavior B, aggregating the lines meeting the document edition heart feature between A and B into interacted data corresponding to the A, setting Flag of the corresponding element, and continuing traversing.
Step 6: the element aggregation based on the visual characteristic information comprises the following specific steps: traversing the line set of the document page by page to obtain a first line of Flag > 0, traversing the line set from i continuously with index i and index j, and if the line Flag j corresponding to j is > 0 and Flag i-! The specific method of aggregation is as follows, with the exception that =flag j skips going on to traverse other rows:
in the method, whether the RowInDocRange computing line meets the plate center characteristics in the characteristic model or not; calculating whether the line spacing of the two lines meets the line spacing characteristics corresponding to the current element in the characteristic model or not by using MatchRowSpace; calculating whether the font characteristics of the line meet the font characteristics corresponding to the current element in the characteristic model or not by using the matchpoint; matchAlign calculates whether there are indentation, left-right alignment, etc. paragraph features for the current line and the first line.
Step 7: the element extraction method based on visual characteristics comprises the following steps: traversing each element in the feature model according to the feature sequence, skipping if the current element is already in the interacted data, skipping if the current element is not, but the semantic feature of the current element is also skipped, finding the feature of the element to be extracted, and traversing the line page by page; (1) If the maximum font characteristics exist in the characteristics, calculating whether the line is the maximum font according to the statistical information, if the line accords with the maximum font characteristics, calculating whether the font characteristics of the current line are in [ Fontmm, fontmax ] or not, otherwise, calculating; (2) Calculating whether the alignment information of the current row meets the alignment mode characteristics corresponding to the elements in the characteristic model; (3) calculating whether the current version center feature is satisfied; if the above 3 conditions are met, the current line is confirmed as the initial line of the element, step 6 is repeated, and the lines conforming to the visual characteristics are aggregated into the elements conforming to the characteristics.
5. Post-processing and outputting of indexing results
The post-processing and output of the indexing result are the standard speech processing of the content of each element identified by the interacted data: (1) processing of spaces and other special characters; (2) normalizing full/half angle characters; and (3) connecting the Chinese and English line contents. And finally storing the canonical content as XML. Although the embodiments of the present invention are described above, the embodiments are only used for facilitating understanding of the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the appended claims.

Claims (3)

1. An automatic indexing method for small sample format documents based on feature learning, which is characterized by comprising the following steps: sample preparation; extracting basic elements of the format document; learning a feature model; automatic indexing based on features; post-processing and outputting an indexing result;
the sample preparation is to select samples and mark elements to be processed according to blocks;
the extraction of the layout document basic elements comprises the following steps: extracting layout document characters and character font style information, extracting horizontal lines and vertical lines, and establishing a line set; extracting a graph and establishing a graph set;
the feature model learning includes: extracting document element layout characteristics, visual characteristics, semantic characteristics, weights thereof and mutual relation characteristics, establishing a characteristic model of the document based on the document element layout characteristics, visual characteristics, semantic characteristics and mutual relation characteristics, and carrying out iterative updating on the model through supplementary learning of sample data;
the automatic indexing based on the characteristics refers to a process of extracting document elements according to visual characteristics, semantic characteristics and interrelationships among the elements in a characteristic model and acquiring element blocks by clustering and fusing the elements;
the post-processing and outputting of the indexing result means that the element block automatically indexing and analyzing extracts the content and normalizes the content (processing of space and other special characters, and connecting processing of normalized full/half angle characters and lines), and the normalized content is stored as an XML document.
2. The automatic indexing method for small sample layout documents based on feature learning as claimed in claim 1, wherein the method for small sample feature learning is adopted;
the small sample feature learning includes: learning of the visual characteristics of the labeling elements, learning of the semantic characteristics of the labeling elements and statistical analysis learning of the characteristic weights;
the labeling element visual features include: labeling names, occupied pages, occupied lines, occupied space, typeset fonts, colors, width and height, whether bold and italic, visual alignment modes of elements (including left alignment, centering and right alignment), line spacing inside the elements, layout characteristics of the elements, namely four-edge distances from page boundaries and element sequence characteristics;
the labeling element semantic features include: element block keyword features, element block first line semantic features, and semantic features of time and date;
the statistical analysis of the feature weights refers to the statistical analysis of visual features and semantic feature values of labeling elements in all sample data to obtain a range value and a highest weight value of each feature value; for some non-numeric class of feature values, the respective value weight ratios for such features are calculated.
3. The automatic indexing method for small sample layout documents based on feature learning according to claim 1, wherein the method is characterized in that the initial positions of elements are positioned based on semantic features, visual features and feature sequences, and lines without marks in the documents are aggregated into element blocks conforming to the features according to the visual features and feature sequences, wherein a specific aggregation algorithm is as follows:
where i represents aggregation from the current Row, row k Representing the kth line, flag being an element tag corresponding to the line; whether the RowInDocRange computing line meets the plate core characteristics in the characteristic model or not; calculating whether the line spacing of the two lines meets the line spacing characteristics corresponding to the current element in the characteristic model or not by using MatchRowSpace; calculating whether the font characteristics of the line meet the font characteristics corresponding to the current element in the characteristic model or not by using the matchpoint; matchAlign calculates whether there are indentation, left-right alignment, etc. paragraph features for the current line and the first line.
CN202211646469.3A 2022-12-21 2022-12-21 Automatic indexing method for small sample format document based on feature learning Pending CN116502601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211646469.3A CN116502601A (en) 2022-12-21 2022-12-21 Automatic indexing method for small sample format document based on feature learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211646469.3A CN116502601A (en) 2022-12-21 2022-12-21 Automatic indexing method for small sample format document based on feature learning

Publications (1)

Publication Number Publication Date
CN116502601A true CN116502601A (en) 2023-07-28

Family

ID=87323680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211646469.3A Pending CN116502601A (en) 2022-12-21 2022-12-21 Automatic indexing method for small sample format document based on feature learning

Country Status (1)

Country Link
CN (1) CN116502601A (en)

Similar Documents

Publication Publication Date Title
US20160055376A1 (en) Method and system for identification and extraction of data from structured documents
DE102011079443A1 (en) Learning weights of typed font fonts in handwriting keyword retrieval
US8620079B1 (en) System and method for extracting information from documents
EP2544099A1 (en) Method for creating an enrichment file associated with a page of an electronic document
CN111274239A (en) Test paper structuralization processing method, device and equipment
CN108710671B (en) Method and device for extracting company name in text
CN112395851A (en) Text comparison method and device, computer equipment and readable storage medium
CN112036144A (en) Data analysis method and device, computer equipment and readable storage medium
CN113761202A (en) Optimization system for mapping unstructured financial Excel table to database
CN114528413A (en) Knowledge graph updating method, system and readable storage medium supported by crowdsourced marking
CN115936624A (en) Basic level data management method and device
CN114782965A (en) Visual rich document information extraction method, system and medium based on layout relevance
CN114004221A (en) Method and device for correcting table content
CN109472020B (en) Feature alignment Chinese word segmentation method
CN111783416B (en) Method for constructing document image data set by using priori knowledge
CN117034948B (en) Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN116502601A (en) Automatic indexing method for small sample format document based on feature learning
CN114579796B (en) Machine reading understanding method and device
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN116403233A (en) Image positioning and identifying method based on digitized archives
CN107145947B (en) Information processing method and device and electronic equipment
CN113127595B (en) Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract
CN112651725B (en) Electronic invoice parsing method and device
CN114417820A (en) Content filtering method for target object
CN114418014A (en) Test paper generation system for avoiding test question similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination