CN110826488B

CN110826488B - Image identification method and device for electronic document and storage equipment

Info

Publication number: CN110826488B
Application number: CN201911075895.4A
Authority: CN
Inventors: 李程
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2022-07-26
Anticipated expiration: 2039-11-06
Also published as: CN110826488A

Abstract

The invention discloses an image identification method, a device and a computer storage device for an electronic document, which are characterized in that firstly, a document page image is subjected to block segmentation to obtain a plurality of page image blocks; then, respectively extracting the features of the plurality of page image blocks to obtain a plurality of corresponding feature vectors; further performing labeling data sampling according to the plurality of feature vectors; performing model training according to the plurality of feature vectors and the sampled labeled data to obtain a prediction model; and finally, carrying out image identification on the plurality of page image blocks by using the prediction model to obtain a target prediction image.

Description

Image identification method and device for electronic document and storage equipment

Technical Field

The present invention relates to electronic document application technologies, and in particular, to an image recognition method and apparatus for an electronic document, and a computer storage device.

Background

With the rapid development of computers and network technologies, electronic documents are more and more widely applied, especially format documents in PDF and the like. At present, the image recognition mode for a document page mainly includes an XY tree-based global recursive cutting algorithm and a deep learning-based object detection algorithm.

However, the above method of image recognition for document pages currently has the following defects: 1) the XY tree-based global recursive cutting algorithm cannot effectively cut a specific format (such as a table and a dividing line); 2) the segmentation is mainly carried out according to the blank interval of the projection, and the interval of which is used as the segmentation threshold value is totally dependent on experience, so that the block segmentation accuracy is low, and the problem of image identification is not solved; 3) object detection algorithms based on deep learning (such as the YOLO algorithm) require a large amount of training data, are lack of labeling data, and are high in calculation cost.

Disclosure of Invention

The embodiment of the invention provides an image identification method and device for an electronic document and a computer storage device, aiming at effectively overcoming various problems existing in the prior image identification for a document page.

According to a first aspect of the embodiments of the present invention, there is provided an image recognition method for an electronic document, the method including: performing block segmentation on the document page image to obtain a plurality of page image blocks; respectively extracting the features of the page image blocks to obtain a plurality of corresponding feature vectors; performing labeling data sampling according to the plurality of feature vectors; performing model training according to the plurality of feature vectors and the sampled labeled data to obtain a prediction model; and carrying out image identification on the plurality of page image blocks by using the prediction model to obtain a target prediction image.

According to an embodiment of the present invention, the block division of the document page image to obtain a plurality of page image blocks includes: the method comprises the steps of first operation, obtaining two projection arrays of a document page image, wherein the two projection arrays respectively comprise projection values of the document page image on an X axis and a Y axis; a second operation of preprocessing the two projection arrays; and thirdly, performing block segmentation on the document page image according to the blank position between the two preprocessed projection arrays to obtain a plurality of page image blocks after primary segmentation.

According to an embodiment of the present invention, the block segmentation of the document page image according to a blank position between the two preprocessed projection arrays includes: detecting whether a blank exists between the two preprocessed projection arrays or not; if the blank exists, block segmentation is carried out on the document page image according to the blank position in the middle of the two preprocessed projection arrays; if no blank exists, the block division flow is ended.

According to an embodiment of the present invention, the block dividing of the document page image to obtain a plurality of page image blocks further includes: and repeating the first operation to the third operation for each page image block in the plurality of page image blocks to obtain a plurality of N-time segmented page image blocks, wherein the value of N is a positive integer greater than 1.

According to an embodiment of the present invention, the preprocessing the two projection arrays includes: determining a mode of all array elements in each projection array; subtracting the mode from the value corresponding to each array element in the projected array; and if the array element corresponding value obtained by subtracting the mode value is a negative value, marking the array element corresponding value as zero.

According to an embodiment of the present invention, the performing feature extraction on the plurality of page image blocks respectively to obtain a plurality of corresponding feature vectors includes: performing feature extraction on each page image block in the plurality of page image blocks to obtain the length and the width of the page image block and position coordinates in the document page image; and determining the length and the width of the page image block and the position coordinates in the document page image as corresponding characteristic vectors.

According to a second aspect of the embodiments of the present invention, there is also provided an image recognition apparatus for an electronic document, the apparatus including: the block segmentation module is used for carrying out block segmentation on the document page image to obtain a plurality of page image blocks; the characteristic extraction module is used for respectively extracting the characteristics of the page image blocks to obtain a plurality of corresponding characteristic vectors; the sampling module is used for sampling the labeled data according to the vectors; the training module is used for carrying out model training according to the plurality of characteristic vectors and the labeled data obtained by sampling to obtain a prediction model; and the prediction identification module is used for carrying out image identification on the plurality of page image blocks by utilizing the prediction model to obtain a target prediction image.

According to an embodiment of the present invention, the block division module includes: the device comprises a first unit, a second unit and a third unit, wherein the first unit is used for acquiring two projection arrays of a document page image, and the two projection arrays respectively comprise projection values of the document page image on an X axis and a Y axis; the second unit is used for preprocessing the two projection arrays; and the third unit is used for carrying out block segmentation on the document page image according to the blank position between the two preprocessed projection arrays to obtain a plurality of page image blocks after primary segmentation.

According to an embodiment of the present invention, the third unit includes: the detection subunit is used for detecting whether a blank exists between the two preprocessed projection arrays; the block segmentation subunit is used for carrying out block segmentation on the document page image according to a blank position in the middle of the two preprocessed projection arrays if blanks exist; if no blank exists, the block division flow is ended.

According to an embodiment of the present invention, the block division module is further configured to repeat operations of the first unit to the third unit for each of the plurality of page image blocks to obtain a plurality of N divided page image blocks, where a value of N is a positive integer greater than 1.

According to an embodiment of the present invention, the second unit is specifically configured to determine a mode of all array elements in each projection array; subtracting the mode from the value corresponding to each array element in the projected array; and if the array element corresponding value obtained by subtracting the mode value is a negative value, marking the array element corresponding value as zero.

According to an embodiment of the present invention, the feature extraction module is specifically configured to perform feature extraction on each of the plurality of page image blocks to obtain a length and a width of the page image block and a position coordinate in the document page image; and determining the length and the width of the page image block and the position coordinates in the document page image as corresponding characteristic vectors.

According to a third aspect of embodiments of the present invention, there is provided a computer storage device comprising a set of computer executable instructions for performing any of the image recognition methods for electronic documents described above when executed.

The embodiment of the invention aims at an image identification method, an image identification device and computer storage equipment of an electronic document, and comprises the steps of firstly, carrying out block segmentation on a document page image to obtain a plurality of page image blocks; then, respectively extracting the features of the page image blocks to obtain a plurality of corresponding feature vectors; further performing labeling data sampling according to the plurality of feature vectors; then, performing model training according to the plurality of feature vectors and the sampled marking data to obtain a prediction model; and finally, carrying out image identification on the plurality of page image blocks by using the prediction model to obtain a target prediction image. In this way, the invention performs image recognition by sampling the labeled data and training the model based on the feature vector of the page image block on the basis of block segmentation of the document page image. Compared with a YOLO algorithm, the model is simple in calculation process, does not need too much annotation data, and greatly reduces the image prediction cost.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, like or corresponding reference characters designate like or corresponding parts.

FIG. 1 is a first flowchart illustrating an implementation of an image recognition method for an electronic document according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a second specific implementation of the image recognition method for an electronic document according to an embodiment of the present invention;

fig. 3 is a schematic diagram showing a component structure of an image recognition apparatus for an electronic document according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

FIG. 1 is a first flow chart illustrating an implementation of an image recognition method for an electronic document according to an embodiment of the present invention; fig. 2 is a schematic diagram illustrating a specific implementation flow of the image recognition method for an electronic document according to the embodiment of the present invention.

Referring to fig. 1, an image recognition method for an electronic document according to an embodiment of the present invention includes: an operation 101, performing block segmentation on a document page image to obtain a plurality of page image blocks; operation 102, performing feature extraction on the plurality of page image blocks respectively to obtain a plurality of corresponding feature vectors; operation 103, performing label data sampling according to the plurality of feature vectors; operation 104, performing model training according to the plurality of feature vectors and the sampled labeled data to obtain a prediction model; and operation 105, performing image identification on the multiple page image blocks by using the prediction model to obtain a target prediction image.

In operation 101, referring to fig. 2, the block segmentation is performed on the document page image through a recursive algorithm to obtain a plurality of page image blocks, which specifically includes: a first operation 1011, obtaining two projection arrays of the document page image, where the two projection arrays respectively include projection values of the document page image on an X axis and a Y axis; a second operation 1012 of pre-processing the two projection arrays; in a third operation 1013, the document page image is block-divided according to a blank position between the two preprocessed projection arrays, so as to obtain a plurality of page image blocks after one division.

Specifically, before the first operation 1011 is implemented, firstly, the document page image is subjected to gray scale processing to obtain a corresponding gray scale image; then, carrying out binarization on the gray level image, namely, setting the value of the gray level value larger than the gray level threshold value as 0, and otherwise, setting the value of the gray level value as 1; furthermore, the behavior of the unknown words is projected in the X-axis direction and the Y-axis direction respectively to obtain two projection arrays which correspond to the projection values in the X direction and the Y direction respectively. In practical application, the value of the grayscale threshold may be 200.

At a second operation 1012, determining a mode of all array elements in each projected array; subtracting the mode from the value corresponding to each array element in the projected array; if the array element corresponding value obtained by subtracting the mode value is a negative value, the array element corresponding value is recorded as zero (namely, if the value is less than 0, the value is 0). The value of the mode is usually 0, and if there is a table or a frame, the value should be the width of the table line or the frame. Therefore, the invention can effectively cut specific formats (such as tables and dividing lines), thereby improving the block dividing performance.

In a third operation 1013, detecting whether a blank exists in the middle of the preprocessed two projection arrays; if the blank exists, block segmentation is carried out on the document page image according to a blank position between the two preprocessed projection arrays; if no blank exists, the block division flow is ended.

Specifically, if continuous blank exists in the middle of the projection array, the positions of the continuous blank are recorded, and if N sections of continuous blank exist in the projection array, the projection array is divided into N-1 sections, and the original image (namely the document page image) is subjected to block division according to the blank positions to obtain N-1 page image blocks. Therefore, the blank space of the block division does not depend on experience any more, but effective block division is carried out according to the page layout (such as line spacing intervals), and therefore the block division performance is greatly improved.

Further, in operation 101, referring to fig. 2, the method further includes: the first operation 1011 to the third operation 1013 are repeated for each of the plurality of page image blocks to obtain a plurality of N-time divided page image blocks, where a value of N is a positive integer greater than 1.

In this way, the present invention continuously iterates the operations 1011 to 1013 in the operation 101, and finally cuts the original image (i.e. the document page image) into the non-repartitionable page image blocks. The page image blocks cut by the recursive algorithm form a tree structure, and a rule-based comparative classic XY tree-based global recursive cutting algorithm in the prior art generates a binary tree, so that a sub-tree generated by the recursive algorithm is more than or equal to 2, and the performance and the calculation efficiency of block segmentation can be effectively improved.

Further, on the basis of performing block segmentation of the recursive algorithm, continuously operating 102, first performing feature extraction on each page image block of the plurality of page image blocks to obtain the length and width of the page image block and the position coordinates in the document page image; and then determining the length and the width of the page image block and the position coordinates in the document page image as corresponding characteristic vectors.

In operation 103, label data sampling is performed according to the feature vectors, specifically, a sampling image block that meets the condition is marked as 1, otherwise, the sampling image block is marked as 0.

In operation 104-105, fitting the plurality of feature vectors and the sampled labeled data by using a random forest algorithm, and training a model; and then, performing image recognition on the page image blocks obtained in the operation 101 by using the trained prediction model, so as to obtain a final desired figure, namely a target prediction image.

The embodiment of the invention aims at an image identification method of an electronic document, firstly, a document page image is subjected to block segmentation to obtain a plurality of page image blocks; then, respectively extracting the features of the plurality of page image blocks to obtain a plurality of corresponding feature vectors; performing labeling data sampling according to the plurality of feature vectors; then, performing model training according to the plurality of feature vectors and the sampled labeled data to obtain a prediction model; and finally, carrying out image identification on the plurality of page image blocks by using the prediction model to obtain a target prediction image. In this way, the invention performs image recognition by sampling the labeled data and training the model based on the feature vector of the page image block on the basis of block segmentation of the document page image. In practical applications, the labeled data volume of the YOLO is hundreds of thousands of levels, while the labeled data volume of the method of the present invention is hundreds to thousands of levels. Compared with a YOLO algorithm, the model is simple in calculation process, does not need much labeling data, and greatly reduces the image prediction cost. The efficient and accurate image recognition provided by the invention can lay a good figure extraction foundation for constructing an image search engine.

Also, based on the image recognition method for an electronic document as described above, an embodiment of the present invention provides a computer-readable storage medium storing a program that, when executed by a processor, causes the processor to perform at least the following operation steps: an operation 101, performing block segmentation on a document page image to obtain a plurality of page image blocks; operation 102, performing feature extraction on the plurality of page image blocks respectively to obtain a plurality of corresponding feature vectors; performing labeling data sampling according to the plurality of feature vectors; operation 103, performing model training according to the plurality of feature vectors and the sampled labeled data to obtain a prediction model; and operation 104, performing image recognition on the multiple page image blocks by using the prediction model to obtain a target prediction image.

Further, based on the image recognition method for electronic documents as described above, the present invention also provides an image recognition apparatus for electronic documents, as shown in fig. 3, the apparatus 30 includes: the block segmentation module 301 is configured to perform block segmentation on the document page image to obtain a plurality of page image blocks; the feature extraction module 302 is configured to perform feature extraction on the multiple page image blocks respectively to obtain multiple corresponding feature vectors; a sampling module 303, configured to perform labeling data sampling according to the multiple vectors; a training module 304, configured to perform model training according to the multiple feature vectors and the sampled labeled data to obtain a prediction model; and a prediction identification module 305, configured to perform image identification on the multiple page image blocks by using the prediction model, so as to obtain a target prediction image.

According to an embodiment of the present invention, the block division module 301 includes: the device comprises a first unit, a second unit and a third unit, wherein the first unit is used for acquiring two projection arrays of a document page image, and the two projection arrays respectively comprise projection values of the document page image on an X axis and a Y axis; the second unit is used for preprocessing the two projection arrays; and the third unit is used for carrying out block segmentation on the document page image according to the blank position in the middle of the two preprocessed projection arrays to obtain a plurality of page image blocks after primary segmentation.

According to an embodiment of the present invention, the third unit includes: the detection subunit is used for detecting whether a blank exists between the two preprocessed projection arrays; the block segmentation subunit is used for carrying out block segmentation on the document page image according to a blank position between the two preprocessed projection arrays if a blank exists; if no blank exists, the block division flow is ended.

According to an embodiment of the present invention, the block segmentation module 301 is further configured to repeat the operations of the first unit to the third unit for each page image block of the plurality of page image blocks, so as to obtain a plurality of N times of segmented page image blocks, where a value of N is a positive integer greater than 1.

According to an embodiment of the present invention, the second unit is specifically configured to determine a mode of all array elements in each projection array; subtracting the mode from the corresponding value of each array element in the projected array; and if the corresponding value of the array element obtained by subtracting the mode value is a negative value, marking the corresponding value of the array element as zero.

According to an embodiment of the present invention, the feature extraction module 302 is specifically configured to perform feature extraction on each page image block of the plurality of page image blocks to obtain a length and a width of the page image block and a position coordinate of the page image block in the document page image; and determining the length and the width of the page image block and the position coordinates in the document page image as corresponding characteristic vectors.

It is to be noted here that: the above description of the embodiment of the image recognition apparatus is similar to the description of the embodiment of the method shown in fig. 1 and 2, and has similar beneficial effects to the embodiment of the method shown in fig. 1 and 2, and therefore, the description is not repeated. For technical details that are not disclosed in the embodiment of the image recognition apparatus of the present invention, please refer to the description of the method embodiment shown in fig. 1 and 2 of the present invention for understanding, and therefore, for brevity, will not be described again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element identified by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: various media that can store program code, such as removable storage devices, ROMs, magnetic or optical disks, etc.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An image recognition method for an electronic document, the method comprising:

the method for carrying out block segmentation on the document page image to obtain a plurality of page image blocks comprises the following steps:

the method comprises the steps of firstly, obtaining two projection arrays of a document page image, wherein the two projection arrays respectively comprise projection values of the document page image on an X axis and a Y axis;

a second operation of preprocessing the two projection arrays;

performing block segmentation on the document page image according to a blank position in the middle of the two preprocessed projection arrays to obtain a plurality of page image blocks after primary segmentation;

repeating the first operation to the third operation for each page image block in the plurality of page image blocks to obtain a plurality of N divided page image blocks, wherein the value of N is a positive integer greater than 1;

the preprocessing the two projection arrays comprises:

determining a mode of all array elements in each projection array;

subtracting the mode from the value corresponding to each array element in the projected array;

if the corresponding value of the array element after subtracting the mode is a negative value, recording the corresponding value of the array element as zero;

respectively extracting the features of the plurality of page image blocks to obtain a plurality of corresponding feature vectors, wherein the extracting step comprises the following steps:

performing feature extraction on each page image block in the plurality of page image blocks to obtain the length and the width of the page image block and position coordinates in the document page image;

determining the length and width of the page image block and the position coordinates in the document page image as corresponding characteristic vectors;

performing label data sampling according to the plurality of feature vectors;

performing model training according to the plurality of feature vectors and the sampled labeled data to obtain a prediction model;

and performing image identification on the plurality of page image blocks by using the prediction model to obtain a target prediction image.

2. The method of claim 1, wherein the block segmenting the document page image according to the blank position in the middle of the two preprocessed projection arrays comprises:

detecting whether a blank exists between the two preprocessed projection arrays or not;

if the blank exists, block segmentation is carried out on the document page image according to the blank position in the middle of the two preprocessed projection arrays;

if no blank exists, the block division process is ended.

3. An image recognition apparatus for an electronic document, the apparatus comprising:

the block segmentation module is used for carrying out block segmentation on the document page image to obtain a plurality of page image blocks;

the block division module includes:

the device comprises a first unit, a second unit and a third unit, wherein the first unit is used for acquiring two projection arrays of a document page image, and the two projection arrays respectively comprise projection values of the document page image on an X axis and a Y axis;

the second unit is used for preprocessing the two projection arrays;

the third unit is used for carrying out block segmentation on the document page image according to a blank position between the two preprocessed projection arrays to obtain a plurality of page image blocks after primary segmentation;

the block division module is further used for repeating the operations from the first unit to the third unit for each page image block in the plurality of page image blocks to obtain a plurality of N divided page image blocks, wherein the value of N is a positive integer greater than 1;

the second unit is specifically configured to determine a mode of all array elements in each projection array; subtracting the mode from the value corresponding to each array element in the projected array; if the array element corresponding value obtained by subtracting the mode is a negative value, recording the array element corresponding value as zero;

the feature extraction module is used for respectively extracting features of the plurality of page image blocks to obtain a plurality of corresponding feature vectors;

the feature extraction module is specifically configured to perform feature extraction on each of the plurality of page image blocks to obtain the length and width of the page image block and a position coordinate in the document page image; determining the length and width of the page image block and the position coordinates in the document page image as corresponding characteristic vectors;

the sampling module is used for sampling the labeled data according to the plurality of eigenvectors;

the training module is used for carrying out model training according to the plurality of characteristic vectors and the labeled data obtained by sampling to obtain a prediction model;

and the prediction identification module is used for carrying out image identification on the plurality of page image blocks by utilizing the prediction model to obtain a target prediction image.

4. The apparatus of claim 3, wherein the third unit comprises:

the detection subunit is used for detecting whether a blank exists between the two preprocessed projection arrays;

the block segmentation subunit is used for carrying out block segmentation on the document page image according to a blank position in the middle of the two preprocessed projection arrays if blanks exist; and if no blank exists, ending the block segmentation process.

5. A computer storage medium comprising a set of computer executable instructions which when executed perform the method of image recognition for an electronic document of any one of claims 1 to 2.