CN109635681B

CN109635681B - Document processing method and device

Info

Publication number: CN109635681B
Application number: CN201811419695.1A
Authority: CN
Inventors: 孟晓静; 高宝庆; 王战波
Original assignee: Hanwang Technology Co Ltd
Current assignee: Hanwang Technology Co Ltd
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2021-11-26
Anticipated expiration: 2038-11-26
Also published as: CN109635681A

Abstract

The application provides a document processing method, belongs to the field of document processing, and solves the problem that document data processing efficiency is low in the prior art. The method comprises the following steps: acquiring a feature template for expressing the body example features of the target document; performing text recognition on a text file describing a target document according to the feature template, and determining a feature value of a business feature of the target document; and outputting preset format document information of the target document according to the determined characteristic value of the service characteristic and the characteristic template. The document processing method disclosed by the embodiment of the application performs document data extraction based on the feature template, does not need semantic recognition of large data volume, effectively reduces the computation amount, and is beneficial to improving the efficiency of document data extraction.

Description

Document processing method and device

Technical Field

The present application relates to the field of document processing, and in particular, to a document processing method and apparatus.

Background

Ancient book documents are important basis for studying the natural, social, political, economic, cultural, etc. aspects of a certain period and/or a certain region. For example, local will is a kind of literature that can fully record the natural, social, political, economic and cultural conditions of a certain region at a certain time. In order to facilitate research and review of literature information, the structuring of ancient book documents is important. In the structuring process of the ancient book literature, the common method is to obtain words in fragmented literature through scanning identification; then, the words in the document are semantically recognized, so that the fragmented document content is classified, sorted or indexed.

The document processing method in the prior art needs semantic recognition with large data volume, and has the problem of low document processing efficiency.

Disclosure of Invention

The embodiment of the application provides a document processing method and device, which are used for identifying and matching document data through a feature template so as to solve the problem of low document data processing efficiency.

In a first aspect, an embodiment of the present application provides a document processing method, including:

obtaining a feature template for expressing physical example features of a target document, the feature template comprising: a service characteristic;

performing text recognition on the text file describing the target document according to the feature template, and determining a feature value of the service feature of the target document;

and outputting preset format document information of the target document according to the determined characteristic value of the service characteristic and the characteristic template.

Optionally, before the step of obtaining a feature template for expressing the physical example features of the target document, the method further includes:

and constructing a feature template with a body case hierarchical relationship according to the service features of the target document and the sequence and repetition rule of the service features in the target document.

Optionally, the feature template includes a format feature and a business feature of each body example level, and the step of performing text recognition on the text file describing the target document according to the feature template to determine a feature value of the business feature of the target document includes:

sequentially identifying texts matched with format features and business features of all body example levels in the feature template from front to back in the text file describing the target document according to the sequence of body example level relations from high to low;

and determining the characteristic value of the business characteristic of each physical instance level of the target document according to the recognized text.

Optionally, the step of sequentially recording text blocks in the target document in the text file, and sequentially identifying, from front to back in the text file describing the target document, texts matched with format features and business features of each body example level in the feature template according to the sequence from high to low of the body example level relationship includes:

a starting condition determining sub-step, which is used for determining that the highest body case level in the characteristic template is a specified body case level and determining that the first text block of the text file is a specified text block;

a step-by-step matching substep, configured to determine, in order from high to low, format features and business features of each body example level below the specified body example level in the feature template as current format features and current business features, and after determining the current format features and the current business features each time, perform traversal of the text file from front to back from a specified text block in a text file describing the target document, and determine an operation of a first text block after the specified text block in the text file, which is matched with the current format features and the current business features, until the traversal of the text file is completed, or until the determination of the first text block, which is matched with the format features and the business features in the lowest body example level of the feature template, is successful;

and when the operation of traversing the text file from front to back from the specified text block in the text file describing the target document and determining the first text block matched with the current format characteristic and the current service characteristic after the specified text block in the text file is executed again, the specified text block is the next text block of the previously determined first text block matched with the current format characteristic and the current service characteristic.

Optionally, the step of sequentially identifying, from front to back in the text file describing the target document according to the sequence of the body example hierarchy relationship from high to low, texts matched with the format features and the business features of each body example hierarchy in the feature template further includes:

after the first text block matching the format feature and the business feature in the lowest embodiment level of the feature template is determined to be successful, the following operations are performed:

determining the next text block of the first text block matched with the format feature and the service feature in the lowest embodiment level of the feature template as an appointed text block;

determining the format features that match the specified text block format;

judging whether the format characteristics matched with the specified text block format are successful or not;

if so, determining that the body case level to which the format features matched with the specified text block format belong is a specified body case level, and jumping to the layer-by-layer matching substep; otherwise, determining the next text block of the specified text block as the specified text block, and jumping to the substep of determining the format characteristic matched with the format of the specified text block.

Optionally, the step of determining the format feature matching the specified text block format includes:

and determining the format characteristics matched with the format of the specified text block according to the format of the specified text block or the formats of the specified text block and the context text block thereof.

Optionally, the step of traversing the text file from front to back from a specified text block in the text file describing the target document to determine a first text block after the specified text block in the text file, where the first text block matches the current format characteristic and the current business characteristic includes:

traversing the text file from front to back from a specified text block in the text file describing the target document, and determining a first text block which is matched with the current format feature after the specified text block in the text file;

judging whether the current service features are configured with corresponding service feature dictionaries;

if so, verifying the text content of the determined first text block matched with the current format feature through the corresponding service feature dictionary;

if the verification is successful, determining that the first text block matched with the current format characteristic after the specified text block in the text file is the first text block matched with the current format characteristic and the current service characteristic;

if the verification fails and the first text block matched with the current format feature is not the last text block in the text file, determining that the next text block of the first text block matched with the current format feature is a specified text block, skipping to the step of traversing the text file from front to back from the specified text block in the text file describing the target document, determining the first text block matched with the current format feature after the specified text block in the text file, and re-determining the first text block matched with the current format feature.

In a second aspect, an embodiment of the present application further provides a document processing apparatus, including:

a feature template obtaining module, configured to obtain a feature template used for expressing body example features of a target document, where the feature template includes: a service characteristic;

the text recognition module is used for performing text recognition on the text file describing the target document according to the feature template and determining the feature value of the service feature of the target document;

and the file information output module is used for outputting preset format document information of the target document according to the determined characteristic value of the service characteristic and the characteristic template.

Optionally, before obtaining a feature template for expressing a physical example feature of the target document, the apparatus further includes:

and the characteristic template construction module is used for constructing a characteristic template with a body case hierarchical relationship according to the business characteristics of the target document and the sequence and the repeated rule of the business characteristics appearing in the target document.

Optionally, the feature template includes a format feature and a business feature of each body example level, and when performing text recognition on a text file describing the target document according to the feature template and determining a feature value of the business feature of the target document, the text recognition module is further configured to:

Optionally, the sequentially recording text blocks in the target document in the text file, sequentially identifying, from front to back in the text file describing the target document, texts matched with format features and business features of each body example level in the feature template according to a sequence from high to low of body example level relationships, includes:

Optionally, the sequentially identifying, from front to back in the text file describing the target document, texts matched with the format features and the business features of each case level in the feature template according to the order from high to low of the case level relationship further includes:

determining the format features that match the specified text block format;

Optionally, traversing the text file from front to back from a specified text block in the text file describing the target document, and determining a first text block after the specified text block in the text file, which is matched with the current format feature and the current service feature, includes:

In a third aspect, an embodiment of the present application further discloses an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the document processing method according to the embodiment of the present application is implemented.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the document processing method disclosed in the present application.

In this way, the document processing method disclosed in the embodiment of the present application solves the problem of low document processing efficiency in the prior art by acquiring the feature template for expressing the body example features of the target document, then performing text recognition on the text file describing the target document according to the feature template, determining the feature value of the business features of the target document, and finally outputting the document information in the preset format of the target document according to the determined feature value of the business features and the feature template. The document processing method disclosed by the embodiment of the application performs document data extraction based on the feature template, does not need semantic recognition of large data volume, effectively reduces the computation amount, and is beneficial to improving the efficiency of document data extraction.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a document processing method according to a first embodiment of the present application;

FIG. 2 is a schematic view of a document processed by the document processing method of the first embodiment of the present application;

FIG. 3 is a schematic diagram of document data output by the document processing method according to the first embodiment of the present application;

FIG. 4 is a second schematic diagram of document data output by the document processing method according to the first embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating a matching step in a document processing method according to a first embodiment of the present application;

FIG. 6 is a flow chart illustrating the matching sub-step in the document processing method according to the first embodiment of the present application;

FIG. 7 is a schematic structural diagram of a document processing apparatus according to a second embodiment of the present application;

fig. 8 is another schematic structural diagram of a document processing apparatus according to the second embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The first embodiment is as follows:

the present embodiment provides a document processing method, as shown in fig. 1, including: step 100 to step 120.

And step 100, acquiring a feature template for expressing the physical example features of the target document.

Wherein, the characteristic template at least comprises: and (4) service characteristics.

The document data processed by the application is document data generated by scanning and identifying documents with specific physical and morphological characteristics such as local logs, ancient books and the like. The document data records the text blocks corresponding to the text blocks from front to back and the format of each text block according to the position sequence of the text blocks appearing in the document.

The physical example features described in the embodiments of the present application refer to the writing format of documents, including: format characteristics and service characteristics. Wherein, the format characteristics are such as top grid large characters, reversed white large characters, framed large characters and the like; such as dynasty, category, name of person, place name, description, etc. The two parts together form a complete feature. The format feature of "gold" in the document as shown in fig. 2 is "top grid large word".

In the embodiment of the application, the body example characteristics of the text block included in the document are determined by analyzing the format of the document and the service information in the content of the document. The text block may be a certain column of text, several columns of text, or several pages of text. And constructing a feature template for expressing the body example features of the document by using the determined body example features of the text block.

Taking the local log shown in fig. 2 as an example for analysis, the local log in fig. 2 mainly describes the related information of a person through analysis, and the business features to be extracted include a main category (e.g., "shi" shown as 210 in fig. 2), a sub-category (e.g., "literary" shown as 220 in fig. 2), a dynasty (e.g., "gold" shown as 230 in fig. 2), a name (e.g., "white " shown as 240 in fig. 2), and a description (e.g., "south mandshunning duhuizhou shanghua in the minghai officer, 250 in fig. 2).

Further, the format characteristics of the service characteristics are determined.

For example: the "Shidi", i.e., the format characteristic of the main category, shown as 210 in FIG. 2, is a lower two-grid large word; the format characteristic of the "literary" sub-category, shown at 220 in FIG. 2, is a lower two-case large word; "gold" as shown at 230 in FIG. 2, the format characteristic of the Korea, is top grid large; "white " indicated at 240 in FIG. 2, i.e., the name format is characterized by a lower case size; "the light field south China Yangyi Yan/seven sun/ancient child food bag", shown in fig. 2 at 250 in , by doctor front ancient city, i.e., 250 in fig. 2, describes a form characteristic as a small character.

And combining the service features in the documents with the format features corresponding to the service features to obtain the body example features of the corresponding text blocks. The body example features of each text block in the document are combined into the body example features of the document. In specific implementation, the profile characteristics can be expressed in the form of "format characteristics + service characteristics". For example, the physical example features of the document shown in fig. 2 include: "low two-grid big word + major category", "low two-grid big word + minor category", "top grid big word + dynasty", "low one-grid big word + name", "small word + description".

In the literature, different text blocks describe different contents, and therefore, the value of the business feature in the example feature may include a plurality of values. In specific implementation, the value range of each service characteristic can be determined by analyzing the literature. For example, the range of the eigenvalues of the service characteristic "dynasty" may be: jin, Song, Han and Qing, etc.

In some embodiments of the present application, first, a feature template having a body case hierarchical relationship is constructed according to the business features of the target document and the sequence and repetition rule of the business features appearing in the target document. For example, the hierarchy of the feature template is determined according to the sequence and the repetition rule of the business features appearing in each text block of the document, and then, the feature template with the body example hierarchical relationship is constructed according to the determined hierarchy of the feature template, the business features and the format features corresponding to the business features, so as to extract corresponding business feature data of document data.

Still taking the document shown in fig. 2 as an example, since the reading order of the document is from right to left and from top to bottom, the order of appearance of the business features in the document shown in fig. 2 is: the main category 210, the sub-category 220, the dynasty 230, the name 240, the description 250, and the description … … start from the name 240 and the description 250, and the business features are repeated, so that the lowest layer in the hierarchy of the feature template can be determined to be the hierarchy corresponding to the business feature description according to the repetition rule of the business features. It can thus be determined that the feature template of the document shown in fig. 2 includes 5 case levels. And according to the sequence of the business features appearing in the literature, determining the case hierarchy corresponding to the business features, wherein the case hierarchy of the business features appearing first is the highest, and the case hierarchy of the business features appearing last is the lowest. The feature template of the document shown in fig. 2 can be expressed as: "low two-grid large word + main category" → "low two-grid large word + sub category" → "top grid large word + dynasty" → "low one-grid large word + person name" → "small word + description".

The feature template can not only express the features of the document data of the page, but also express the features of the document data of all pages in the category in the local blog.

And step 110, performing text recognition on the text file describing the target document according to the feature template, and determining the feature value of the business feature of the target document.

In the embodiment of the application, a text file recorded by taking document data as a script language exemplifies a specific technical scheme of document data processing. The data of the text file generally records the text blocks corresponding to the text blocks from front to back according to the position sequence of the text blocks appearing in the document and the format of each text block. In a text file, the format information of the characters in the document is represented by different format attributes. For example, information of a large-and-small word is represented using a font _ type attribute, where font _ type ═ 0 "represents a small word and font _ type ═ 1" represents a large word; empty lattices are represented using integer values of the space _ count attribute; whether the head lattice is represented by a head attribute, head ═ 0 indicates a non-head lattice, and head ═ 1 indicates a head lattice. The feature template comprises one or more format features formed by combining formats, so that whether the text conforms to the format features can be judged by judging the format attributes of the text.

For automatic recognition, the format feature in the feature template is first encoded, for example, a "two-grid-low large word" is represented by an encoding "0, 1, 2", where 0 represents a non-top grid, 1 represents a font, i.e., font _ type ═ 1", and 2 represents a position, i.e., space _ count ═ 2". The encoding rule of the format characteristic is the same as the format encoding rule of the text block in the text file of the target document. For example, the determination "two lower lattice large word" may perform matching recognition on the format by head ═ 0", space count ═ 2", and font _ type ═ 1 ".

In this embodiment, it is assumed that the text file is a text file in an XML format generated by a script language, and data of the text file is as follows:

in the above text file, the content and format of a certain text block are identified by a preset symbol, for example, < text …/text > between < text _ column > and </text _ column > marks the data of a text block in which a character string font _ type is used to indicate the character number of the text block, a character string head is used to indicate the character lattice information of the text block, and the character number and the lattice information constitute the format of the text block. The text block in the text file and the format of the text block can be determined through the preset symbols.

In specific implementation, the feature template comprises format features and business features of each body case level, text recognition is carried out on a text file describing a target document according to the feature template, and a feature value of the business features of the target document is determined, wherein the step comprises the following steps: sequentially identifying texts matched with format features and business features of all body example levels in the feature template from front to back in a text file describing a target document according to the sequence of the body example level relationship from high to low; and determining the characteristic value of the business characteristic of each embodiment level of the target document according to the recognized text.

Further, as shown in fig. 5, the step of sequentially recording the text blocks in the target document in the text file, and sequentially identifying the texts matching the format features and the business features of each body example level in the feature template from front to back in the text file describing the target document according to the sequence from high to low of the body example level relationship further includes: sub-step 510 to sub-step 530.

An initial condition determining substep 510, configured to determine a highest body case level in the feature template as a specified body case level, and determine a first text block of the text file as a specified text block;

a step-by-step matching substep 520, configured to determine, according to the order of the body example hierarchy relationship from high to low, the format features and the business features of each body example hierarchy below the specified body example hierarchy in the feature template as current format features and current business features, and after determining the current format features and the current business features each time, perform an operation of traversing the text file from front to back from a specified text block in the text file describing the target document, determining a first text block after the specified text block in the text file, which matches the current format features and the current business features, until the text file is traversed, and then jumping to substep 530, or until the first text block, which matches the format features and the business features in the lowest body example hierarchy of the feature template, is determined to be successful. When the operation of traversing the text file from front to back from the specified text block in the text file describing the target document is executed for the first time and the first text block which is matched with the current format characteristic and the current service characteristic after the specified text block in the text file is determined, the specified text block is the first text block of the text file; and traversing the text file from front to back from the specified text block in the text file describing the target document, and determining the first text block matched with the current format characteristic and the current service characteristic after the specified text block in the text file, wherein the specified text block is the next text block of the first text block matched with the current format characteristic and the current service characteristic determined last time.

And substep 530, the text file matching ends.

Specifically, in the present embodiment, the process of identifying the text file is as follows.

First, the highest case level in the feature template is determined to be a specified case level. Namely, the format feature of the highest body case level is the current format feature, the service feature of the highest body case level is determined to be the current service feature, and the first text block in the text file is determined to be the designated text block.

Taking the above-mentioned feature template as an example, the format feature "two lower cases large word" of the highest level is taken as the current format feature, the service feature "main category" of the highest level is taken as the current service feature, and the first text block "taigu zhisugao four" in the text file is determined as the specified text block.

In specific implementation, the layer-by-layer matching substep 520 further comprises: substeps 5201 to 5205.

Substep 5201 determines the format feature and business feature of the specified physical instance level in the feature template as the current format feature and the current business feature.

And if the text block matching recognition is performed for the first time, the body example level is designated as the highest body example level of the feature model, and if the text block matching recognition is not performed for the first time, the designated body example level is determined according to a format matching result of the text block.

Next, sub-step 5202 is performed to traverse the text file from the front to the back from the text block specified in the text file describing the target document, and determine the first text block in the text file after the specified text block that matches the current format characteristic and the current business characteristic.

In specific implementation, the service features of each level in the feature template may correspond to a pre-established service feature dictionary, so that the service features in the document data can be conveniently identified and verified. For example, a service feature dictionary corresponding to a service feature "generation" includes: characteristic values of gold, Song, Han and Qing, etc.

Then, in the above text file, the text blocks in the text file are traversed backward starting from the specified text block "taigu log volume four", and the first text block matching both the current format characteristic and the current service characteristic is determined. For example, the text blocks "Taigu zhi quan", "Shi-chi", "Wen-Shi-chi" … … are traversed sequentially until the first text block matching the current level's case feature "two lower case big word + main category" is determined. In this embodiment, the value range of the service feature "main category" includes "shi di-shi", and the format feature "low two large words" of the "main category" is the same as the format — font _ type ═ 1"space _ count ═ 2" corresponding to the text block "shi di-shi" in the text file, so that the first text block matching both the current format feature and the current service feature is the 2 nd text block.

And a substep 5203 of judging whether the first text block matched with the current format characteristic and the current service characteristic is successfully determined, if so, executing the substep 5204, and otherwise, ending the traversal.

After the sub-step 5202 is executed, it is determined whether the first text block matching the current format feature and the current service feature after the specified text block in the text file is successful, and if successful, and the text file is not traversed completely, the text block conforming to the example feature of the lower hierarchy level is further identified. If not, say the file traversal is complete, then the traversal operation ends.

And a substep 5204 of judging whether the business feature of the lowest layer in the feature template is identified, if so, ending the traversal, and otherwise, executing the substep 5205.

Substep 5205, determining the next text block of the determined first text block matched with the current format characteristic and the current service characteristic as the specified text block, taking the next text block of the currently specified body example level as the specified body example level, jumping to substep 5201, and continuing to perform the matching identification of the text blocks.

And then judging whether the service features at the lowest layer in the feature template are identified or whether the text file of the target file is traversed. Specifically, in this embodiment, it is determined whether the format feature and the service feature "small word + description" of the lowest hierarchy level in the feature template have been identified. If the format feature and business feature of the lowest body case level in the feature template are not identified and the text file is not traversed, then go to substep 5201 to continue identifying the business feature of the next body case level. If the service features of the lowest layer in the feature template are identified and the text file is not traversed, the format features and the service features of the next text block in the text file are continuously identified.

And when the format characteristics and the service characteristics which are lower than the embodiment level are continuously identified, determining the next text block of the previously determined first text block matched with the current format characteristics and the current service characteristics as the specified text block.

For example, while continuing to identify the format feature and the business feature of the next highest case level, i.e., "big words in two lower cases + sub-category", the next text block "literary" of the text block "literary" in the text file is determined as the designated text block, and then the operation of traversing the text file from the front to the back from the designated text block in the text file describing the target document, determining the first text block after the designated text block in the text file matching the current format feature and the current business feature is performed. At this time, the current format feature is "two lower large words", and the current service feature is "sub-category".

In a specific implementation, when determining the first text block matching the current format characteristic and the current service characteristic in the sub-step 5202, as shown in fig. 6, traversing the text file from the specified text block in the text file describing the target document to the front and back, and determining the first text block matching the current format characteristic and the current service characteristic after the specified text block in the text file further includes: sub-step 610 through sub-step 670.

And a substep 610 of traversing the text file from front to back from the specified text block in the text file describing the target document, and determining a first text block which is matched with the current format characteristic after the specified text block in the text file.

Substep 620, judging whether the current service feature is configured with a corresponding service feature dictionary, if so, executing substep 630, otherwise, determining that the matching of the specified text block is successful, ending the matching of the current specified text block, and jumping to substep 670.

And a substep 630, verifying the text content of the determined first text block matched with the current format feature through the corresponding service feature dictionary.

Substep 640, determining whether the content is successfully verified, if so, determining that a first text block matched with the current format characteristic after the specified text block in the text file is a first text block matched with the current format characteristic and the current service characteristic, and skipping to substep 670; if the verification fails, substep 650 is further performed.

And a substep 650 of determining whether the text file has been traversed, and if the text file has not been traversed, that is, the first text block matching the current format feature is not the last text block in the text file, performing substep 660.

Substep 660 of determining the next text block of the first text block matching the current format feature as the designated text block, and proceeding to substep 610. And traversing the text file from front to back from the specified text block in the text file describing the target document, determining the first text block matched with the current format feature after the specified text block in the text file, and re-determining the first text block matched with the current format feature. Otherwise, if the text file is traversed, then go to substep 670.

Sub-step 670, ending the matching of the currently specified text block.

And finishing the matching of the current specified text block when the text file is traversed, namely the first text block matched with the current format characteristic is the last text block in the text file, or the matching of the current specified text block is successful.

When identifying the text block which accords with the current format characteristic and the current service characteristic, firstly, the matching of the current format characteristic is carried out. And when the format of the text block is matched with the current format characteristic, the text content of the text block is further verified through a preset service characteristic dictionary corresponding to the current service characteristic. In specific implementation, some service features need to be verified, and some service features do not need to be verified. For the service features needing to be verified, a service feature dictionary corresponding to the service features needs to be preset, and only when the format of a text block is matched with the current format features and the text content of the text block is matched with a preset service feature dictionary corresponding to the current service features, the text block is determined to be the first text block matched with the current format features and the current service features. Otherwise, continuously traversing the text blocks behind the text block to identify the text block with the format matched with the current format characteristic and the current service characteristic of the text content. And for the service features which do not need to be verified, determining that the feature matching of the text block case is successful as long as the format features are matched.

In some embodiments of the present application, the processed text file of the target document may include the X layer physical characteristic text block that matches the characteristic template, which is repeated and needs to be extracted completely. Wherein, X is a natural number which is less than or equal to the hierarchy number of the feature template.

In specific implementation, as shown in fig. 5, the step of sequentially identifying, from front to back, in a text file describing the target document, texts matched with the format features and the business features of each body example level in the feature template according to a sequence from high to low body example level relationships further includes: after the first text block matching the format feature and the business feature in the lowest embodiment level of the feature template is determined to be successful, the following operations are performed:

substep 540, determining the next text block of the first text block matched with the format feature and the service feature in the lowest embodiment level of the feature template as the designated text block;

substep 550, determining format features in the feature template matching the specified text block format;

a substep 560 of determining whether the format feature in the feature template matching the specified text block format is successful; if successful, perform substep 570; otherwise, substep 580 is performed;

substep 570, determining the body case level to which the format feature in the feature template matched with the specified text block format belongs as a specified body case level, and skipping to substep 520 of layer-by-layer matching;

substep 580 of determining a text block subsequent to the specified text block as the specified text block, and then go to substep 550.

In practical implementation, in the foregoing sub-step 550, the step of determining the format feature matching the format of the specified text block includes: and determining the format characteristics matched with the format of the specified text block according to the format of the specified text block or the formats of the specified text block and the context text block thereof.

For example, in the document shown in fig. 2, the text blocks of the "big words lower + the name of the person" and "small words + the description" feature continuously reappear, and after determining the first text block "matching the format feature and the business feature in the lowest case level of the feature template" by lovely officials of love in china, ancient city, both south china and south china, and ancient royal purple and gold fish pocket ", the latter text block" font _ type ═ 1"head ═ 0" space _ count ═ 1"> wenty <" is determined as the specified text block; then, it is determined that the format feature in the feature template that matches the format of the specified text block (i.e., "font _ type ═ 1", "head ═ 0", "space _ count ═ 1" in the text file described above) is "one lower large word". Since the case level of the format feature "lower one large word" is the next lower case level, the next lower case level is taken as the designated case level. Later, starting from a text block ' wenbi', the next lower body example level and the lowest body example level are matched layer by layer, and characteristic values ' wenbi' and ' wen version of the service characteristics of the two body example levels are sequentially determined, wherein the characteristic values are' wenbi 'and' tezhou Shang Ji and yuyi.

In some embodiments of the present application, format features of each case level in the corresponding feature template in the text file may be the same, and for example, format features of a highest case level and a next highest case level in the foregoing feature template are both "two lower cases large characters", and further, a format feature matching a format of a specified text block may be determined according to the format of the specified text block and a format of a context text block thereof.

For example, after the format of the specified text block is "two lower case words", since the format features matching with the "two lower case words" belong to two case hierarchies, and it cannot be determined which case hierarchy the specified text block specifically corresponds to, it can be determined which case hierarchy's format features the specified text block format matches in combination with the format of the context text block of the specified text block.

Taking the aforementioned feature template as an example, if the format of the text block preceding the specified text block is "two lower case large word", it may be determined that the format of the specified text block matches the format feature of the next highest case level.

Or if the format of the text block subsequent to the specified text block is 'two lower case large word', the format of the specified text block can be determined to match the format characteristic of the highest body case level; if the format of the text block subsequent to the specified text block is "top case large word", it may be determined that the format of the specified text block matches the format feature of the next highest body case level. In specific implementation, the context text blocks of the specified text block are the first M text blocks and the last N text blocks, where M and N are integers greater than or equal to 0, and both M and N are less than or equal to the number of levels of the feature template.

The hierarchy of the feature template is determined according to the display sequence and the repetition rule of the business features, and the text block with the body case features of any hierarchy can repeatedly appear in the document, so that after the first text block matched with the body case features of the lowest hierarchy of the feature template is identified, other text blocks matched with the body case features of each hierarchy after the text block need to be continuously identified until the text file of the target document is traversed.

Still taking the document shown in fig. 2 as an example, the document will obtain a text file that includes a plurality of text blocks with repeated occurrences in succession, wherein the text blocks are respectively "big character with one lower case + person name" and "small character + description". After the first text block which accords with the body example characteristics of the 'small characters + description' in the text file is determined, the subsequent text blocks which accord with the body example characteristics of the 'low large characters + person names' and the 'small characters + description' are continuously identified, and the text information which accords with the characteristic template of the target document can be more comprehensively extracted.

According to the method, the format characteristics and the service characteristics of each body case level in the feature template are sequentially identified in the text file of the target document until the format characteristics and the service characteristics of each body case level in the feature template are determined or the text file of the target document is traversed.

In specific implementation, if the body case characteristics of a certain body case level in the characteristic template are still not identified when the text file of the target document is traversed, preset format document information of the target document cannot be output. If the text block with the format of the low two-grid large word and the text content matched with the feature value in the business feature dictionary corresponding to the business feature sub-category is not recognized after the text file of the target document is traversed when the format feature and the business feature of the next-highest body example level in the feature template, namely the low two-grid large word and the sub-category are recognized, the document data processing fails, and the preset format document information of the target document cannot be output.

Then, the service characteristics of the text block matched with the format characteristics and the service characteristics of each body example level, which are obtained by identification, are used as the characteristic values of the service characteristics of the corresponding body example level of the target document.

And step 120, outputting preset format document information of the target document according to the determined feature value of the service feature and the feature template.

And after the text block conforming to the feature template is identified, determining the feature value of the business feature of the corresponding body example level according to the identification result. Then, organizing the characteristic value of each service characteristic according to a preset format, and outputting the document data in the format.

During specific implementation, when a text matched with the format feature and the service feature corresponding to the highest body example level is identified, a root node corresponding to the highest body example level is established at the same time, and the root node is used for storing the text matched with the format feature and the service feature of the highest body example level, which is obtained through identification.

Correspondingly, when the text matched with the format feature and the service feature corresponding to the second highest body example level is identified, a child node of the root node is established, the child node corresponds to the second highest body example level, and the child node is used for storing the text matched with the format feature and the service feature of the second highest body example level, which is obtained by identification.

By analogy, when the text blocks in the text file of the target document are identified according to the sequence from high to low of the body example levels, the nodes corresponding to the current body example level are established step by step, the nodes corresponding to the current body example level are child nodes of the nodes corresponding to the high body example level, and when the matching of the feature templates is completed, a tree structure is finally formed. As shown in fig. 3.

If the text matched with the format feature and the business feature of a certain embodiment level is repeatedly identified for multiple times, multiple sub-nodes are established under the nodes corresponding to the embodiment level at the higher embodiment level of the embodiment level and are used for recording the text matched with the format feature and the business feature of the embodiment level, which is identified each time.

When the preset format is a tree structure, the document information as shown in fig. 3 may be directly output.

When the preset format is a table, each entry of the table corresponds to the business feature of each instance level of the feature template. In specific implementation, the data of each column in each row of the table can be determined in sequence by traversing each branch of the tree structure, so as to obtain the document data shown in fig. 4.

The document processing method disclosed in the embodiment of the application solves the problem of low document processing efficiency in the prior art by acquiring the feature template for expressing the body example features of the target document, then performing text recognition on the text file describing the target document according to the feature template, determining the feature value of the business features of the target document, and finally outputting the document information in the preset format of the target document according to the determined feature value of the business features and the feature template.

The document processing method disclosed by the embodiment of the application performs document data extraction based on the feature template, does not need semantic recognition of large data volume, effectively reduces the computation amount, and is beneficial to improving the efficiency of document data extraction. Furthermore, when document data are extracted based on the characteristic template with hierarchical relationship, the hierarchical relationship expresses the appearance sequence of texts in documents, so that the document data obtained after processing has a clearer and more reasonable structure and is hierarchical.

The document processing method disclosed by the embodiment of the application can improve the accuracy of document data processing by performing text recognition based on the format characteristics and the service characteristics, performing preliminary judgment through the format characteristics, and then verifying the text data through the service characteristics.

Example two:

accordingly, the present application also discloses a document processing apparatus, as shown in fig. 7, comprising:

a feature template obtaining module 710, configured to obtain a feature template used for expressing body example features of a target document, where the feature template includes: a service characteristic;

the text recognition module 720 is configured to perform text recognition on a text file describing a target document according to the feature template, and determine a feature value of a business feature of the target document;

and a document information output module 730, configured to output document information in a preset format of the target document according to the determined feature value of the service feature and the feature template.

Optionally, as shown in fig. 8, before obtaining a feature template for expressing physical example features of the target document, the apparatus further includes:

the feature template construction module 700 is configured to construct a feature template having a body case hierarchical relationship according to the business features of the target document and the sequence and repetition rule of the business features appearing in the target document.

Optionally, the feature template includes a format feature and a business feature of each body example level, and when performing text recognition on a text file describing the target document according to the feature template and determining a feature value of the business feature of the target document, the text recognition module 720 is further configured to:

sequentially identifying texts matched with format features and business features of all body example levels in the feature template from front to back in a text file describing a target document according to the sequence of body example level relations from high to low;

Optionally, the step of sequentially recording text blocks in the target document in a text file, sequentially identifying, from front to back, texts matched with format features and business features of each body example level in the feature template in the text file describing the target document according to a sequence from high to low of a body example level relationship, includes:

a step-by-step matching substep, which is used for determining the format characteristics and the service characteristics of each body example level below the specified body example level in the feature template as the current format characteristics and the current service characteristics according to the sequence of the body example level relationship from high to low, and after determining the current format characteristics and the current service characteristics each time, respectively executing the operation of traversing the text file from the specified text block in the text file describing the target document to the front and back, determining the first text block matched with the current format characteristics and the current service characteristics after the specified text block in the text file, until the text file is traversed, or until the first text block matched with the format characteristics and the service characteristics in the lowest body example level of the feature template is determined successfully;

when the operation of traversing the text file from front to back from the specified text block in the text file describing the target document and determining the first text block matched with the current format characteristic and the current service characteristic after the specified text block in the text file is executed again, the specified text block is the last text block of the first text block matched with the current format characteristic and the current service characteristic determined last time.

Optionally, sequentially identifying, from front to back, texts matched with the format features and the business features of each body example level in the feature template in a text file describing the target document according to the sequence of body example level relationships from high to low, and further including:

after the first text block matching the format feature and the business feature in the lowest embodiment level of the feature template is determined to be successful, the following operations are executed: determining the next text block of the first text block matched with the format feature and the service feature in the lowest embodiment level of the feature template as an appointed text block; determining format features in the feature template which are matched with the specified text block format; judging whether the format feature in the feature template matched with the specified text block format is successful or not; if the result is successful, determining that the body case level to which the format features matched with the format of the specified text block belong is the specified body case level, jumping to a layer-by-layer matching sub-step, and repeatedly executing the layer-by-layer matching sub-step; otherwise, determining the next text block of the specified text block as the specified text block, jumping to the substep of determining the format characteristic matched with the specified text block format, and repeatedly executing the substep.

In specific implementation, the step of determining the body case level to which the format feature matched with the format of the specified text block belongs as the specified body case level includes: determining format characteristics matched with the format of the specified text block according to the format of the specified text block or the format of the specified text block and the context text block thereof; and determining the body example level of the format feature matched with the specified text block format as the specified body example level.

Optionally, traversing the text file from front to back from a specified text block in the text file describing the target document, and determining a first text block after the specified text block in the text file, which is matched with the current format characteristic and the current service characteristic, includes:

traversing the text file from front to back from a specified text block in the text file describing the target document, and determining a first text block which is matched with the current format characteristic after the specified text block in the text file;

if the verification fails and the first text block matched with the current format feature is not the last text block in the text file, determining that the next text block of the first text block matched with the current format feature is a specified text block, jumping to a step of traversing the text file from the specified text block in the text file describing the target document to the front to the back, determining the first text block matched with the current format feature after the specified text block in the text file, and re-determining the first text block matched with the current format feature.

The document processing device disclosed in the embodiment of the application solves the problem of low document processing efficiency in the prior art by acquiring the feature template for expressing the body example features of the target document, then performing text recognition on the text file describing the target document according to the feature template, determining the feature value of the business features of the target document, and finally outputting the document information in the preset format of the target document according to the determined feature value of the business features and the feature template. The document processing method disclosed by the embodiment of the application performs document data extraction based on the feature template, does not need semantic recognition of large data volume, effectively reduces the computation amount, and is beneficial to improving the efficiency of document data extraction. Furthermore, when document data are extracted based on the characteristic template with hierarchical relationship, the hierarchical relationship expresses the appearance sequence of texts in documents, so that the document data obtained after processing has a clearer and more reasonable structure and is hierarchical.

The document processing device disclosed in the embodiment of the application performs text recognition based on the format characteristics and the service characteristics, performs preliminary judgment through the format characteristics, and then verifies the text data through the service characteristics, so that the accuracy of document data processing can be improved.

Correspondingly, the present application also discloses an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the document processing method according to the first embodiment of the present application is implemented. The electronic device can be a PC, a mobile terminal, a personal digital assistant, a tablet computer and the like.

The present application also discloses a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the document processing method according to the first embodiment of the present application.

The embodiment of the apparatus, the embodiment of the electronic device, the embodiment of the storage medium, and the method in the present application correspond, and for specific implementation of each module and each unit in the embodiment of the apparatus, reference is made to the method as an embodiment, and details are not described here again.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be appreciated by those of ordinary skill in the art that in the embodiments provided herein, the units described as separate components may or may not be physically separate, may be located in one place, or may be distributed across multiple network elements. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can appreciate changes and substitutions without inventive step in the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A document processing method, comprising:

obtaining a feature template for expressing physical example features of a target document, the feature template comprising: a service characteristic; the physical example characteristics comprise service characteristics and format characteristics corresponding to the service characteristics; the service characteristics comprise at least one of dynasty, category, name of person, name of place and description; the target document is an ancient book document; the body example features of the target document are body example feature combinations of a plurality of text blocks in the target document;

outputting preset format document information of the target document according to the determined feature value of the service feature and the feature template;

before the step of obtaining a feature template for expressing the body case features of the target document, the method further comprises the following steps:

determining the level of the feature template according to the sequence and the repetition rule of the business features appearing in the text blocks in the target document;

and constructing a feature template with a body case hierarchical relationship according to the hierarchy of the feature template, the service features and the format features corresponding to the service features.

2. The method according to claim 1, wherein the feature template comprises a format feature and a business feature of each embodiment level, and the step of performing text recognition on the text file describing the target document according to the feature template to determine a feature value of the business feature of the target document comprises:

3. The method according to claim 2, wherein the step of sequentially recording the text blocks in the target document in the text file, and sequentially identifying the text matching the format features and the business features of each case level in the feature template from front to back in the text file describing the target document according to the order from high to low of the case level relationship comprises:

4. The method according to claim 3, wherein the step of sequentially identifying the text matching the format features and the business features of each case level in the feature template from front to back in the text file describing the target document according to the order of case level relationships from high to low further comprises:

determining the format features that match the specified text block format;

if so, determining that the body case level to which the format features matched with the specified text block format belong is a specified body case level, and jumping to the layer-by-layer matching substep;

otherwise, determining the next text block of the specified text block as the specified text block, and jumping to the substep of determining the format characteristic matched with the format of the specified text block.

5. The method of claim 4, wherein the step of determining the format characteristic that matches the specified text block format comprises:

6. The method according to claim 3 or 4, wherein the step of traversing the text file from front to back from a specified text block in the text file describing the target document to determine a first text block in the text file after the specified text block matching the current format characteristic and the current business characteristic comprises:

7. A document processing apparatus, comprising:

a feature template obtaining module, configured to obtain a feature template used for expressing body example features of a target document, where the feature template includes: a service characteristic; the physical example characteristics comprise service characteristics and format characteristics corresponding to the service characteristics; the service characteristics comprise at least one of dynasty, category, name of person, name of place and description; the target document is an ancient book document; the body example features of the target document are body example feature combinations of a plurality of text blocks in the target document;

the file information output module is used for outputting preset format document information of the target document according to the determined characteristic value of the service characteristic and the characteristic template;

the device also comprises a feature template construction module, wherein the feature template construction module is specifically used for determining the hierarchy of the feature template according to the sequence and the repeated rule of the business features appearing in the text blocks in the target document; and constructing a feature template with a body case hierarchical relationship according to the hierarchy of the feature template, the service features and the format features corresponding to the service features.

8. The apparatus of claim 7, wherein the feature template comprises a format feature and a business feature of each embodiment level, and when performing text recognition on a text file describing the target document according to the feature template to determine a feature value of the business feature of the target document, the text recognition module is further configured to:

9. The apparatus according to claim 8, wherein the text file sequentially records text blocks in the target document, and sequentially identifies texts matching format features and business features of each case level in the feature template from front to back in the text file describing the target document according to the order from high to low of the case level relationship, and the method comprises:

10. The apparatus according to claim 9, wherein the identifying of the text matching the format feature and the business feature of each body case level in the feature template in the text file describing the target document from front to back in the order of body case level relationship from high to low further comprises:

determining the format features that match the specified text block format;

11. The apparatus of claim 10, wherein the step of determining the format characteristic that matches the specified text block format comprises:

12. The apparatus according to claim 10 or 11, wherein traversing the text file from front to back from a specified text block in the text file describing the target document, determining a first text block in the text file after the specified text block that matches the current format feature and the current business feature comprises:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the document processing method of any one of claims 1 to 6 when executing the computer program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the document processing method of any one of claims 1 to 6.