CN116502601A

CN116502601A - Automatic indexing method for small sample format document based on feature learning

Info

Publication number: CN116502601A
Application number: CN202211646469.3A
Authority: CN
Inventors: 李愿军; 赵兰; 吴涛; 张镔
Original assignee: Tongfang Knowledge Network Digital Publishing Technology Co ltd
Current assignee: Tongfang Knowledge Network Digital Publishing Technology Co ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-07-28

Abstract

The invention discloses a small sample format document automatic indexing method based on feature learning, which comprises the following steps: sample preparation; extracting basic elements of the format document; learning a feature model; automatic indexing based on features; and (5) post-processing and outputting an indexing result. The invention provides an overall thought for realizing automatic indexing of the format document according to the visual and semantic characteristics of the format document and a specific solution for key links.

Description

Automatic indexing method for small sample format document based on feature learning

Technical Field

The invention relates to the field of format document processing, in particular to a small sample format document automatic indexing method based on feature learning.

Background

Along with the proposal of the ' speeding up the construction pace ' of a digital society ' perspective target outline, the digital transformation and upgrading of government authorities, enterprises and public institutions and the trend of data informatization construction are promoted, and all units of format documents are widely applied, so that the wide demands for processing documents with various types and structures are met.

There are many types of layout document data with basically consistent structures and different contents in government authorities, enterprises and institutions. If automatic indexing based on traditional rules is adopted, a program based on the structural rules of the current style is required to be written for documents with different structures, and some rules may even conflict with each other for documents with various structures, so that the automatic indexing based on the traditional rules is not efficient for processing documents with various types and various structures. The requirements of users cannot be completely met if deep learning is adopted, because the quantity of documents with similar style structures in most units cannot meet the training requirements of the deep learning, and the processing quality of the documents based on the deep learning is relatively low at the development stage at present.

The automatic indexing method for the small sample format document based on feature learning can extract visual features and semantic features from the produced small sample data through statistical analysis, and the automatic indexing can use an extracted feature model to realize full-automatic processing of the format document which is similar and accords with the features.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a small sample format document automatic indexing method based on feature learning, which is realized by the following technical scheme:

sample preparation;

extracting basic elements of the format document;

learning a feature model;

automatic indexing based on features;

and (5) post-processing and outputting an indexing result.

In the above technical scheme, the sample preparation means that a digital processing person selects samples and marks elements to be processed in blocks.

In the above technical scheme, the extraction of the basic elements of the format document refers to extracting characters and fonts, font sizes, styles (tilting, thickening and color) and positions corresponding to the characters in each page of the document by taking pages as dimensions, and establishing a character set; extracting horizontal lines and vertical lines, and establishing a line set; and extracting the graph and establishing a graph set.

In the above technical solution, feature model learning refers to extracting layout features, visual features, semantic features, weights thereof, mutual relation features and the like of each noted document element from sample data, establishing a feature model of such a document based on the extracted layout features, visual features, semantic features, weights thereof, mutual relation features and the like, and performing iterative update of the model through supplementary learning of the sample data.

In the above technical solution, the automatic indexing based on the features refers to a process of obtaining element blocks by clustering and fusing basic elements of the layout document according to visual features and semantic features of each element of the feature model.

In the above technical solution, post-processing and outputting of the indexing result means that the element block obtained by automatic indexing is extracted and normalized (space and other special character processing, normalized full/half angle character and line connection processing), and the normalized content is stored as an XML document.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention

FIG. 2 is a flow chart of feature model learning of the present invention

FIG. 3 is a flow chart of the present invention for automatic indexing using feature models

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

A small sample layout document automatic indexing method based on feature learning, as shown in fig. 1, the method comprises:

1. sample preparation

The digital processing personnel selects a plurality of data with relatively comprehensive semantic features and visual features, marks the elements to be processed according to the blocks, and stores the marked results as XML documents.

2. Layout document basic element extraction

The extraction of the basic elements of the layout document is to take pages as dimensions, extract characters and fonts, character sizes, styles (inclined, thickened and colored) and position information corresponding to the characters in each page of the document, aggregate the characters into a row set according to the position information, and statistically analyze row font information, position information, character contents of rows and typesetting modes; extracting horizontal lines and vertical lines, and aggregating the lines into a line set by using a line aggregation algorithm; and extracting the graph and establishing a graph set.

3. Feature model learning

Feature model learning comprises learning of document layout features, extraction of labeling element visual features and semantic features, and statistical learning of feature weights, as shown in fig. 2.

Learning document layout characteristics: extracting page width and page height of each page of all data from sample data, and carrying out statistical analysis to obtain a range value and a weight highest value of the page width and the page height; extracting the layout position of the marking element on the current page from each page, calculating the page center characteristic of the current page according to the layout position, namely, the four edge distances from the page boundary, and carrying out statistical analysis on the page center characteristics of all pages of all sample data to obtain the range value and the highest weight value of the page center characteristic.

Learning of visual characteristics of labeling elements: the visual characteristics of each labeling element are extracted from the sample data, and the visual characteristics of each labeling element comprise some characteristics of the element, such as labeling names, occupied pages, occupied lines, occupied space, typeset fonts, colors, wide heights, whether bold italics, visual alignment modes (including left alignment, centering and right alignment) of the element, line spacing inside the element, layout characteristics of the element, namely four edges from page boundaries, and the like, and relationship characteristics between the element and other labeling elements, such as appearance sequence of the element, possible elements before and after the element, edges between the element and the following element, and the like.

Learning of semantic features of labeling elements: semantic features, such as keyword features, semantic features of first line data, semantic features of some time dates and the like, of labeling elements can be extracted from sample data.

Statistical analysis learning of feature weights: carrying out statistical analysis on visual features and semantic feature values of labeling elements in all sample data to obtain a range value and a highest weight value of each feature value; for some non-numeric class of feature values, the respective value weight ratios for such features are calculated.

After learning the various characteristics from the sample data, constructing a characteristic model of the sample data according to the characteristics; if the sample data is supplemented, carrying out statistics and learning again, and carrying out iterative updating of the feature model; the feature model supports serialization and storage and read conversion of XML files.

4. Feature-based automatic indexing

The automatic indexing based on the characteristics mainly comprises the following specific steps as shown in fig. 3:

step 1: and loading the characteristic model, and loading the model into a memory structure.

Step 2: and extracting the basic elements of the format document.

Step 3: carrying out statistical analysis on the basic elements of the document loaded in the step 2, and counting the main row spacing, the average row spacing and the maximum and minimum row spacing of the document; the primary font, average font, maximum and minimum font information, and how often different fonts appear in the document are counted.

Step 4: the method for locating the initial position of the element based on the semantic features comprises the following steps: traversing the line set of the document page by page, and matching semantic features of each element in the feature model according to the line content, wherein the following matching formula is adopted: index=reg (Text, meta feature, security feature, rule), where meta feature is a current element feature, security feature is a semantic feature of a current element, rule is a keyword expression in the semantic feature, when Index > =0, it indicates that the current line satisfies the semantic feature of the corresponding element, and then it determines whether the current semantic feature is a feature keyword; if yes, confirming the initial position of the current matching element of the current behavior and recording (Index, metaFeatureName), wherein MetaFeatureName represents the element name, and if no, continuing to check according to the visual characteristics corresponding to the current element; if the visual characteristics are met, judging that the current line is the initial position of the current matching element and recording (Index, metaFeatureName), otherwise, continuing judging other element characteristics; after traversing the features of the complete element, if the positions of a plurality of elements in the position records which are currently matched, cutting according to Index, storing the processed (Index, metaFeatureName) into an interaction data set of the processing procedure, and setting the Flag of the corresponding element.

Step 5: and (3) traversing the elements according to the feature sequence based on the element aggregation of the visual feature sequence, searching whether the initial position of the element is determined in the step (4) in the interacted data, if not, continuing traversing, if so, defining the current behavior A, acquiring the next element set corresponding to the current element in the feature model, traversing whether the element exists in the interacted data, if so, defining the current behavior B, aggregating the lines meeting the document edition heart feature between A and B into interacted data corresponding to the A, setting Flag of the corresponding element, and continuing traversing.

Step 6: the element aggregation based on the visual characteristic information comprises the following specific steps: traversing the line set of the document page by page to obtain a first line of Flag > 0, traversing the line set from i continuously with index i and index j, and if the line Flag j corresponding to j is > 0 and Flag i-! The specific method of aggregation is as follows, with the exception that =flag j skips going on to traverse other rows:

in the method, whether the RowInDocRange computing line meets the plate center characteristics in the characteristic model or not; calculating whether the line spacing of the two lines meets the line spacing characteristics corresponding to the current element in the characteristic model or not by using MatchRowSpace; calculating whether the font characteristics of the line meet the font characteristics corresponding to the current element in the characteristic model or not by using the matchpoint; matchAlign calculates whether there are indentation, left-right alignment, etc. paragraph features for the current line and the first line.

Step 7: the element extraction method based on visual characteristics comprises the following steps: traversing each element in the feature model according to the feature sequence, skipping if the current element is already in the interacted data, skipping if the current element is not, but the semantic feature of the current element is also skipped, finding the feature of the element to be extracted, and traversing the line page by page; (1) If the maximum font characteristics exist in the characteristics, calculating whether the line is the maximum font according to the statistical information, if the line accords with the maximum font characteristics, calculating whether the font characteristics of the current line are in [ Fontmm, fontmax ] or not, otherwise, calculating; (2) Calculating whether the alignment information of the current row meets the alignment mode characteristics corresponding to the elements in the characteristic model; (3) calculating whether the current version center feature is satisfied; if the above 3 conditions are met, the current line is confirmed as the initial line of the element, step 6 is repeated, and the lines conforming to the visual characteristics are aggregated into the elements conforming to the characteristics.

5. Post-processing and outputting of indexing results

The post-processing and output of the indexing result are the standard speech processing of the content of each element identified by the interacted data: (1) processing of spaces and other special characters; (2) normalizing full/half angle characters; and (3) connecting the Chinese and English line contents. And finally storing the canonical content as XML. Although the embodiments of the present invention are described above, the embodiments are only used for facilitating understanding of the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the appended claims.

Claims

1. An automatic indexing method for small sample format documents based on feature learning, which is characterized by comprising the following steps: sample preparation; extracting basic elements of the format document; learning a feature model; automatic indexing based on features; post-processing and outputting an indexing result;

the sample preparation is to select samples and mark elements to be processed according to blocks;

the extraction of the layout document basic elements comprises the following steps: extracting layout document characters and character font style information, extracting horizontal lines and vertical lines, and establishing a line set; extracting a graph and establishing a graph set;

the feature model learning includes: extracting document element layout characteristics, visual characteristics, semantic characteristics, weights thereof and mutual relation characteristics, establishing a characteristic model of the document based on the document element layout characteristics, visual characteristics, semantic characteristics and mutual relation characteristics, and carrying out iterative updating on the model through supplementary learning of sample data;

the automatic indexing based on the characteristics refers to a process of extracting document elements according to visual characteristics, semantic characteristics and interrelationships among the elements in a characteristic model and acquiring element blocks by clustering and fusing the elements;

the post-processing and outputting of the indexing result means that the element block automatically indexing and analyzing extracts the content and normalizes the content (processing of space and other special characters, and connecting processing of normalized full/half angle characters and lines), and the normalized content is stored as an XML document.

2. The automatic indexing method for small sample layout documents based on feature learning as claimed in claim 1, wherein the method for small sample feature learning is adopted;

the small sample feature learning includes: learning of the visual characteristics of the labeling elements, learning of the semantic characteristics of the labeling elements and statistical analysis learning of the characteristic weights;

the labeling element visual features include: labeling names, occupied pages, occupied lines, occupied space, typeset fonts, colors, width and height, whether bold and italic, visual alignment modes of elements (including left alignment, centering and right alignment), line spacing inside the elements, layout characteristics of the elements, namely four-edge distances from page boundaries and element sequence characteristics;

the labeling element semantic features include: element block keyword features, element block first line semantic features, and semantic features of time and date;

the statistical analysis of the feature weights refers to the statistical analysis of visual features and semantic feature values of labeling elements in all sample data to obtain a range value and a highest weight value of each feature value; for some non-numeric class of feature values, the respective value weight ratios for such features are calculated.

3. The automatic indexing method for small sample layout documents based on feature learning according to claim 1, wherein the method is characterized in that the initial positions of elements are positioned based on semantic features, visual features and feature sequences, and lines without marks in the documents are aggregated into element blocks conforming to the features according to the visual features and feature sequences, wherein a specific aggregation algorithm is as follows:

where i represents aggregation from the current Row, row _k Representing the kth line, flag being an element tag corresponding to the line; whether the RowInDocRange computing line meets the plate core characteristics in the characteristic model or not; calculating whether the line spacing of the two lines meets the line spacing characteristics corresponding to the current element in the characteristic model or not by using MatchRowSpace; calculating whether the font characteristics of the line meet the font characteristics corresponding to the current element in the characteristic model or not by using the matchpoint; matchAlign calculates whether there are indentation, left-right alignment, etc. paragraph features for the current line and the first line.