CN110991279A

CN110991279A - Document image analysis and recognition method and system

Info

Publication number: CN110991279A
Application number: CN201911143272.6A
Authority: CN
Inventors: 豆浩斌; 陈博; 朱风云; 庞在虎
Original assignee: Beijing Lingban Future Technology Co ltd
Current assignee: Beijing Lingban Future Technology Co ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-04-10
Anticipated expiration: 2039-11-20
Also published as: CN110991279B

Abstract

The invention discloses a document image analysis and recognition system, which comprises: the system comprises a user operation end, an interaction center, a process control end, a machine engine management end, a manual labeling management end, a machine terminal cluster and a manual terminal cluster; the user operation end, the process control end, the machine engine management end and the manual labeling management end are respectively connected to the interaction center; the machine engine management end is connected with the machine terminal cluster; and the manual marking management end is connected with the manual terminal cluster. In addition, the invention also discloses a document image analysis and identification method. The document image analysis and recognition system has the advantages of machine efficiency and manual accuracy, simple operation steps and reliable processing results are provided for users, and meanwhile, the human-computer coupling mode can play a teaching role on the machine in the continuous iteration process, so that the machine performance is gradually enhanced, and the manual participation degree is reduced.

Description

Document image analysis and recognition method and system

Technical Field

The invention relates to the technical field of document image analysis and identification, in particular to a document image analysis and identification method and system.

Background

Optical Character Recognition (OCR) is a technology that optically converts characters in a paper document into an image file of a pixel lattice, and converts the characters in the image into a text format through Recognition software for further editing and processing by word processing software.

Document Image Analysis And Recognition (DIAR) is a technology that analyzes the physical And logical structure of a Document Image by a computer vision method, And locates And recognizes each element (such as text, table, Image, graph, etc.) inside the Document, thereby forming a complete description of the Document.

A distributed software system is a software system that supports distributed processing, and is a system that performs tasks on a multiprocessor architecture interconnected by a communication network.

In the prior art, the prototype of the document image analysis and recognition technology is the traditional optical character recognition technology, the optical character recognition technology mainly processes and recognizes the text part in the document image, and with the gradual improvement of the software and hardware capabilities of a computer and the requirements of people on higher levels and more aspects of document image processing, more technologies related to document image processing are deeply researched, such as page segmentation, format analysis, chart analysis and the like, so that complete analysis and description of different levels of the document image are realized, and high-level functions of document retrieval, abstract generation, knowledge extraction and the like can be better completed. Current document image analysis and recognition systems typically include the following processing steps:

1. image preprocessing, including noise removal and distortion correction, to obtain a regular easy-to-process document image;

2. page segmentation, namely dividing a document image into a plurality of consistent areas such as texts, graphs, images, tables and the like;

3. analyzing the hierarchical structure relationship of the document image, including the relative position and spatial layout of the physical layer, and semantic labels such as headers, footers, titles, chapters, paragraphs, headers, icons and the like of the logical layer;

4. chart analysis, a chart is a structured and visualized strong information presentation mode. The chart analysis is to extract the structural information presented by the chart by analyzing the internal structure of the chart;

5. text positioning and recognition, namely determining the position information and the text content of a text in a document, and according to different processing algorithms, the method can be divided into the positioning and the recognition of text lines and the positioning and the recognition of single characters;

6. structural description and format conversion of the document. The document structure obtained by parsing is described, stored and transmitted in a specific format, and can be converted into a common document format, such as MS Word, PDF, HTML and the like.

However, the inventors have found through research that the document image analysis and recognition system in the prior art mainly has the following problems: the method comprises the following steps that firstly, the functions are incomplete, only a plurality of functions in document image analysis and identification can be provided, a plurality of types of objects are identified, and complete description of the hierarchical structure of the document image cannot be formed; secondly, the accuracy is not high, and higher identification accuracy cannot be guaranteed for document images with poor quality and complex formats; and thirdly, a perfect manual proofreading tool and service are lacked, and the use experience of a user is poor.

In addition, the prior art cannot provide a complete processing flow due to cost or efficiency, and lacks some processing steps, so that complete description of document information is difficult to obtain; in addition, the prior art scheme only provides a software tool, lacks functions of subsequent proofreading and verification and the like, needs a user to solve the problem additionally, and increases the use difficulty.

Disclosure of Invention

In view of the above problems, the present invention can provide a complete solution for analyzing and identifying a document image, generate a complete description of a hierarchical structure of the document image, and ensure efficiency and accuracy of the whole processing flow by a distributed man-machine coupling manner.

Based on this, a document image analysis and recognition method is especially provided, which comprises:

step 1, a message communication end of a document image analysis and recognition system receives a task initiating message sent by a user operation end, and the document image analysis and recognition system starts a document image analysis and recognition processing task;

step 2, acquiring a document image to be processed, inputting the document image to be processed into the document image analysis and recognition system, and acquiring basic information of the document to be processed;

step 3, performing page segmentation on the document image to be processed, simultaneously generating segmentation tasks of all page images in the document image to be processed in a message queue mode, sending the segmentation tasks to a machine engine terminal for executing the tasks through a machine engine management terminal, forwarding an initial page segmentation result obtained after the preprocessing of the machine engine terminal to a manual annotation terminal, and returning a final page segmentation result after the manual annotation to a process control terminal of the document image analysis and identification system for updating the page segmentation result;

step 4, obtaining initial information of table analysis processing after page segmentation processing is completed, simultaneously generating all table analysis tasks of the document image to be processed by adopting a message queue method, sending the table analysis tasks to a machine engine terminal for executing the tasks through a machine engine management terminal, forwarding an initial table analysis result obtained after the preprocessing of the machine engine terminal to a manual labeling terminal, and returning a final table analysis result after manual correction by a manual labeling operator to a process control terminal of the document image analysis and identification system for updating the table analysis result;

step 5, obtaining initial information of text detection after page segmentation processing and form analysis processing, simultaneously generating all text detection tasks of a document image to be processed in a message queue mode, sending the text detection tasks to a machine engine terminal executing the tasks through a machine engine management terminal, forwarding an initial text detection result obtained after preprocessing of the machine engine terminal to a manual labeling terminal, and returning a final text detection result after manual proofreading by a manual labeling person to a process control terminal of the document image analysis and recognition system for updating the text detection result;

step 6, obtaining initial information of text recognition after text detection is completed, generating all text recognition tasks of the document image to be processed in a message queue mode, sending the text recognition tasks to a machine engine terminal executing the tasks through a machine engine management terminal, forwarding an initial text recognition result obtained after the preprocessing of the machine engine terminal to a manual labeling terminal, and returning a final text recognition result after the manual correction of the manual labeling operator to a process control terminal of the document image analysis and recognition system for updating the text recognition result;

and 7, when the tasks of page segmentation, table analysis, text detection and text recognition of the document image to be processed are all completed, integrating the labeling results of different levels by the document image analysis and recognition system, and exporting the electronic document file.

In one embodiment, the acquiring the image of the document to be processed comprises acquiring a page image of the document to be processed by scanning or photographing, and recording basic information of the document to be processed, wherein the basic information comprises a name, an author, a publishing institution and a publishing date.

In one embodiment, the machine engine terminal runs a document image analysis and recognition model based on a deep neural network, determines the output of the model required to be called according to the current document image analysis and recognition processing steps, and returns the processing result to the machine engine management terminal.

In one embodiment, the deep neural network-based document image analysis and recognition model comprises an input layer, a feature extraction network, a multitask prediction network and a multitask output layer; the input layer is connected to the feature extraction network, the feature extraction network is connected to the multitask prediction network, and the multitask prediction network is connected to the multitask output layer;

the input layer receives an input page image, wherein the input page image is a page image in a document to be processed currently; the feature extraction network is a stacked multilayer convolutional neural network; the multi-task prediction network is a multi-layer prediction network which is specially used for corresponding tasks and is respectively constructed aiming at different prediction tasks; and the multitask output layer outputs output results of different prediction networks.

In one embodiment, the feature extraction network is a stacked multilayer convolutional neural network, each layer of convolutional neural network is a nonlinear mapping output by a previous layer of convolutional neural network, the input page image is represented and described through multiple times of nonlinear mapping, and the representation features are extracted and output; the representation characteristics of the page images acquired through the characteristic extraction network are shared characteristics which are shared by various prediction tasks; the multiple prediction tasks comprise page segmentation, table analysis, text detection and text identification;

the multi-task prediction network comprises a page segmentation prediction network, a table analysis prediction network, a text detection prediction network and a text recognition prediction network, and is respectively used for realizing the prediction tasks of page segmentation, table analysis, text detection and text recognition; the page segmentation prediction network, the table analysis prediction network, the text detection prediction network and the text recognition prediction network share input features, namely different prediction networks share representation features output by a feature extraction network; the multi-task prediction network determines different prediction network structures according to different prediction tasks respectively;

the multitask output layer comprises a page segmentation result output by the page segmentation prediction network, a table analysis result output by the table analysis prediction network, a text detection result output by the text detection prediction network and a text recognition result output by the text recognition prediction network.

In addition, in order to solve the technical problems in the prior art, the document image analysis and identification system is particularly provided, and comprises a user operation end, an interaction center, a process control end, a machine engine management end, a manual labeling management end, a machine terminal cluster and a manual terminal cluster;

the user operation end, the process control end, the machine engine management end and the manual labeling management end are respectively connected to the interaction center; the machine engine management end is connected with the machine terminal cluster; the manual marking management end is connected with the manual terminal cluster;

the interaction center comprises a data storage end and a message communication end; the data storage terminal is used for storing data uploaded by a user, a result after document image analysis and identification processing and data required by interaction between different modules and terminals in the document image analysis and identification system; the message communication terminal is used for establishing and completing message communication among all modules and terminals in the document image analysis and recognition system;

the user operation end is used for the user to perform system login, data uploading, task initiation, progress checking, result downloading and recharging payment operations;

the process control end is used for controlling a man-machine coupled document image analysis and recognition processing process and storing key data in the document image analysis and recognition processing process;

the machine terminal cluster comprises a plurality of machine engine terminals; the machine engine management end is used for managing and scheduling the machine engine terminals, determining operation steps by receiving information sent by the process control end and issuing operation tasks to corresponding execution terminals according to the running state of each current machine engine terminal; after the operation is finished, returning a corresponding message to the process control end;

the artificial terminal cluster comprises a plurality of artificial labeling terminals; the manual marking management end is used for managing and scheduling the manual marking terminals, determining operation steps by receiving information sent by the process control end, and issuing operation tasks to corresponding execution terminals according to the running states of the current manual marking terminals; and when the operation is finished, returning a corresponding message to the process controller.

In one embodiment, the man-machine coupled document image analysis and recognition processing flow comprises that the flow control terminal receives a task initiation message of a user operation terminal through the message communication terminal, so as to start a document image analysis and recognition processing flow; the process control end acquires the completed step of the current task, determines the next processing step and sends the step to a machine engine management end or a manual labeling management end through a message communication end; and after the processing flow is finished, the flow control end sends a message to the user operation end, so that the user operation end updates the current task finishing state.

In one embodiment, the machine terminal cluster includes a plurality of machine engine terminals; all machine engine terminals in the system are numbered uniformly, and are managed and allocated uniformly by a machine engine management end; the machine engine terminal runs a document image analysis and recognition model based on a deep neural network, determines the output of the model to be called according to the current document image analysis and recognition processing step, and returns the processing result to the machine engine management end;

the manual terminal cluster comprises a plurality of manual marking terminals, the manual marking terminals correspond to manual marking personnel, all the manual marking personnel and the corresponding manual marking terminals are numbered uniformly, and the manual marking terminals are managed and allocated uniformly by a manual marking management end; and the manual labeling operator checks and modifies the current labeling result in the manual labeling terminal, determines an operation page required to be called by the manual labeling terminal according to the current processing step, and returns the labeling result to the manual labeling management terminal.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the document image analysis and recognition processing method disclosed by the invention, each step has a process of machine initial judgment and manual verification, and the final processing result of each step is taken as the initial condition of the next step, so that the whole document image analysis and recognition processing system can have the efficiency of a machine and the accuracy of manual work. In the document image analysis and recognition processing system disclosed by the invention, manual work and machines are distributed on a plurality of nodes on the network, organic integration and communication are carried out through the process control end, the data storage end and the message communication end, and finally the document image analysis and recognition processing system is provided for users in a distributed network service mode, so that simple operation steps and reliable processing results are provided for the users. The man-machine coupling mode can generate teaching effect on the machine in the continuous iteration process, so that the performance of the machine is gradually enhanced, and the manual participation degree is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other schematic diagrams can be obtained according to the drawings without creative efforts;

wherein:

FIG. 1 is a schematic diagram of a deep neural network-based document image analysis and recognition model in the present invention;

FIG. 2 is a schematic diagram of a man-machine depth-coupled distributed document image analysis and recognition system according to the present invention;

FIG. 3 is a flowchart of a document image analysis and recognition method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the invention, a document image analysis and recognition model based on a deep neural network is firstly constructed, the document image analysis and recognition model has multitask output, and results of a plurality of different processing stages can be simultaneously output. In order to avoid the increase of model complexity and calculation amount caused by multitask output, the document image analysis and identification model adopts a unique shared characteristic mode to improve the operation efficiency;

the deep neural network comprises a full connection layer, a convolution layer, a circulation connection layer, a pooling layer and a normalization layer;

the full connection layer is used for connecting each output node with all input nodes, so that the integral transformation of input characteristics is realized; the all-connection layer can be expressed as Fc (x; 'ch _ i,' ch _ o, g (∙)) = g (Wx + b), where x ∈ R [ ("ch _ i × 1) is the input feature vector, W ∈ R [ (" ch _ o × "ch _ i) is the weight, b ∈ R [ (" ch _ o × 1) is the offset, ch _ i is the number of channels that are input features, ch _ o is the number of channels that are output features, g (∙) is the activation function; the types of the activation function comprise five types of Linear, Sigmoid, Tanh, ReLU and SoftMax.

The convolution layer realizes local transformation processing on input characteristics in a mode of sharing local connection; the convolutional layers may be expressed as Conv (x; h _ k, W _ k, 'ch >,' o, 'sx,' k, 'sy,' ∙) = g (W x + b), where the symbol is a convolution operator, x is an R ^ h _ i × CH _ i, is an input feature map, W is an R ^ h _ k × CH \ o × CH _ i) is a convolution kernel weight, b is an offset, ch \ i is the number of channels of input features, ch \ o is the number of channels of output features, sx _ k is the horizontal shift of convolution kernel, and g is the vertical shift of activation function ∙.

The cyclic connection layer feeds back the output of the deep neural network as the input of the deep neural network, thereby realizing the feature extraction and transformation of the serialized signals; the cyclic link layer may be expressed as Rnn (x _ t, h _ (t-1);, ' ch _ i, ' ch _ o, g (∙)) = g (Wx _ t + Uh _ (t-1) + b), where x _ t ∈ R (' ch _ i × 1) is the input feature vector at time t, h _ (t-1) is the output feature vector at last time, W ∈ R ^ ch _ o ^ ch _ i) and U ∈ R ^ ([ (ch _ o x [ (ch _ o) are the mapping weights of the input feature at current time and the output feature at last time, respectively, b ∈ R ^ ch _ o × 1) is the offset, ch _ i is the number of channels of the input feature, and g is the number of channels of the output feature, ∙.

The pooling layer includes two types: maximum pooling layers and average pooling layers are determined. The pooling layers can be expressed as Pool (x; h _ k, w _ k, 'sx _ k,' sy _ k) which mainly realizes the aggregation and aggregation of local areas of input features, the maximum pooling layer is found to find the maximum value of the local areas, and the average pooling layer is found to find the average of the local areas. Where h _ k is the width of the pooled local area, w _ k is the height of the pooled local area, sx _ k is the horizontal moving step of the pooled area, sy is the vertical moving step of the pooled area.

The normalization layer realizes the normalization transformation of input characteristics by methods of mean variance normalization, batch normalization and the like; in particular, the method of batch normalization can be adopted to realize the normalized transformation of the input features.

As shown in FIG. 1, the deep neural network-based document image analysis and recognition model comprises an input layer 11, a feature extraction network 12, a multitask prediction network 13 and a multitask output layer 14; the input layer 11 is connected to the feature extraction network 12, the feature extraction network 12 is connected to the multitask prediction network 13, and the multitask prediction network 13 is connected to the multitask output layer 14;

the input layer 11 receives an input page image, wherein the input page image is a page image in a document to be processed currently;

the feature extraction network 12 is a multilayer convolutional neural network, each layer of network is a nonlinear mapping for the previous layer of network output, effective representation and description of an input page image are realized through multiple times of nonlinear mapping, and shared features are extracted and output;

specifically, the feature extraction network 12 employs a convolutional neural network including 13 convolutional layers and 1 pooling layer, the first layer of the feature extraction network is a convolutional layer with a parameter Conv (5,5,1,16,1,1, ReLU), and the next layer is a maximum pooling layer with a parameter Pool (3,3,2, 2); subsequently connected are 6 residual modules, which are added with cross-layer connections directly from input to output on the basis of normal multi-layer sequential connections, each residual module being composed of 2 convolutional layers, wherein the 2 convolutional layer parameters of the first residual module are Conv (3,3,16,32,2,2, 2, ReLU) and Conv (3,3,32,32,1,1, ReLU), the 2 convolutional layer parameters of the second residual module are Conv (3,3,32,64,2,2, ReLU) and Conv (3,3,64,64,1,1, ReLU), the 2 convolutional layer parameters of the third residual module are Conv (3,3,64,128,2,2, ReLU) and Conv (3,3,128,1, 1,1, ReLU), the 2 convolutional layer parameters of the fourth residual module are Conv (3,3,128,2, 2,2, 3,256,2, 2,3, 256, 3,2, 3, and ReLU, 3,256, 1,1, ReLU), the 2 convolutional layer parameters of the fifth residual module are Conv (3, 256,512,2,2, ReLU) and Conv (3, 512,1,1, ReLU), respectively, and the 2 convolutional layer parameters of the sixth residual module are Conv (3, 512,1024,2,2, ReLU) and Conv (3, 1024,1,1, ReLU), respectively; the shared features are representative features of page images acquired through the feature extraction network 12, and the representative features are shared by a plurality of prediction tasks; the multiple prediction tasks comprise page segmentation, table analysis, text detection and text identification;

in particular, in order to realize multi-scale feature description of an input image, output features of different layers of the feature extraction network are scaled to a uniform size by image interpolation processing, and convolution layers with parameters Conv (1,1, [ (Ch) o, [ (Ch) i,1,1, ReLU) are added to convert the output channel number thereof to a specific size, where [ (Ch) o is the channel number of the input features and [ (Ch) i is the channel number of the output features; for the output characteristics of 6 residual modules in the characteristic extraction network, respectively converting the output characteristics of each residual module into the number of output channels (Ch) with a value of & lt _ i =32, and then splicing the output channels in the dimension of 6 × 32=192 to finally obtain the shared characteristics of the output channels;

the multi-task prediction network 13 is a task-specific multi-layer prediction network which is respectively constructed for different tasks, and comprises a page segmentation prediction network, a table analysis prediction network, a text detection prediction network and a text recognition prediction network which are respectively used for realizing prediction tasks of page segmentation, table analysis, text detection and text recognition; the page segmentation prediction network, the table analysis prediction network, the text detection prediction network and the text recognition prediction network share the same input features, namely share the shared features output by the feature extraction network; determining different prediction network structures of the tasks according to respective characteristics of the tasks;

in particular, the multi-layer predictive network of the multitasking predictive network 13 shares the same input features.

Wherein the page division prediction network comprises a layer of convolutional network Conv (3, 192,5,1,1, SoftMax), wherein ch (o = 5) indicates that the page area comprises 5 types of background, text, image, table and division line;

the table analysis prediction network predicts the position and the orientation of the table line, adopts a layer of convolution network, and has the structure of Conv (3, 192,2,1,1, Sigmoid), wherein ch (o = 2) represents the position and the orientation of the table line 2 predicted values;

the text detection prediction network predicts the position and orientation information of the character line, and the adopted convolution network structure can be expressed as Conv (3, 192,6,1,1, Sigmoid), wherein ch (ch) o =6 represents the probability score, the four upper, lower, left and right frame positions and the overall orientation of the character line, and 6 predicted values.

The text recognition and prediction network needs to firstly convert the characteristics of corresponding text line areas into sequence characteristics with unified characteristic dimensions through a space transformation network according to the position and orientation information of the text lines, describe and depict the sequence relationship through adding a cyclic network Rnn (192,256, Tanh), and finally add a layer of convolutional network to obtain a final prediction result, wherein the structure of the convolutional network is Conv (1,1,256, CharNum,1,1, SoftMax), wherein [ ch _ o = CharNum ] is the number of character categories to be recognized;

wherein, the multi-task output layer 14 outputs the output results of different task prediction networks, that is, the page segmentation prediction network outputs the page segmentation result, the table parsing prediction network outputs the table parsing result, the text detection prediction network outputs the text detection result, and the text identification prediction network outputs the text identification result; the output result is a final result or an intermediate result of the prediction task, and the intermediate result is subjected to post-processing to obtain a final result; constraint relationships exist between different output results.

In the operation process, the shared characteristics only need to be calculated once and cached in the data storage end. Under the condition of given sharing characteristics, different task prediction networks are relatively independent, the task prediction network to be operated is determined according to the task message of the process control end, and a corresponding prediction result is obtained.

In order to sufficiently train the deep neural network-based document image analysis and recognition model, the automatic document image generation method based on program synthesis is adopted to generate an image and simultaneously generate annotation information corresponding to a plurality of output results required by model training. The document image analysis and recognition model constructed by the automatic document image generation method based on program synthesis and the machine engine running the model can quickly and completely analyze the input document image to be processed.

In the invention, a distributed document image analysis and recognition system based on man-machine depth coupling is shown in FIG. 2.

The document image analysis and recognition system comprises a user operation end 21, an interaction center 22, a process control end 23, a machine engine management end 24, a manual labeling management end 25, a machine terminal cluster 26 and a manual terminal cluster 27;

the user operation end 21, the process control end 23, the machine engine management end 24, and the manual labeling management end 25 are respectively connected to the interaction center 22; the machine engine management end 24 is connected with the machine terminal cluster 26; the manual labeling management terminal 25 is connected with the manual terminal cluster 27;

the interaction center 22 comprises a data storage end 221 and a message communication end 222; the data storage end 221 is used for storing data uploaded by a user, processed results and data required by interaction between different modules and terminals; the message communication terminal 222 is used for establishing and completing message communication between each module and terminal in the document image analysis and recognition system;

the user operation end 21 is used for the user to perform operations of system login, data uploading, task initiation, progress checking, result downloading, recharging and paying;

the process control end 23 is a processing center of the document image analysis and recognition system, and is used for controlling a human-computer coupled document image analysis and recognition processing process and storing key data in the processing process;

specifically, the process control end 23 receives a task initiation message from the user operation end 21 through the message communication end 222, so as to start a document image analysis and identification processing flow; the process control end 23 obtains the completion step of the current task, determines the next processing step, and sends the step to the machine engine management end 24 or the manual labeling management end 25 through the message communication end 222; after the whole processing flow is completed, the flow control end 23 sends a message to the user operation end 21, so that the user operation end 21 updates the current task completion state;

wherein the machine terminal cluster 26 comprises a plurality of machine engine terminals; all machine engine terminals are numbered uniformly and managed and allocated uniformly by a machine engine management end 26; running a document image analysis and recognition model based on a deep neural network in the machine engine terminal, determining the output of the model to be called according to the current processing step, and returning the processing result to the machine engine management end;

the machine engine management terminal 24 is configured to manage and schedule the machine engine, determine the next operation by receiving information sent by the process controller, and issue an operation task to a corresponding execution terminal according to the current running state of each machine engine terminal; when the operation is finished, returning a corresponding message to the process controller;

wherein the artificial terminal cluster 27 comprises a plurality of artificial labeling terminals; the manual marking terminals correspond to manual marking personnel, all the manual marking personnel and the corresponding manual marking terminals are numbered uniformly, and are managed and allocated uniformly by a manual marking management end 25; checking and modifying the current labeling result in the manual labeling terminal, determining an operation page to be called according to the current processing step, and returning the labeling result to the manual labeling management terminal 25;

the manual labeling management terminal 25 is configured to manage and schedule the manual labeling terminals, determine the next operation by receiving information sent by the process controller 23, and issue an operation task to the corresponding execution terminal according to the current running state of each manual labeling terminal; when the operation is completed, a corresponding message is returned to the process controller 23.

As shown in fig. 3, in order to improve the stability of the system, the present invention further provides a document image analysis and recognition processing method with a deep human-machine coupling, so that the system can combine the efficiency of the machine and the accuracy of the human being, the document image analysis and recognition processing method includes the following steps:

step 1, a message communication end receives a task initiating message of a user operation end, so that a document image analysis and recognition processing task is started;

specifically, acquiring an image of a document to be processed includes acquiring a page image of the document to be processed by scanning or photographing, and recording basic information of the document to be processed, where the basic information includes a name, an author, a publishing organization, and a publishing date;

step 3, performing page segmentation on the document to be processed, simultaneously generating segmentation tasks of all page images in a message queue mode, sending the segmentation tasks to a machine engine, processing the initial result by the machine engine, then transferring the initial result to a manual labeling system, and returning the result after manual proofreading to a process control end for updating the page segmentation result;

step 4, obtaining initial information of table analysis after page segmentation is completed, simultaneously generating all table analysis tasks by adopting a message queue method, sequentially performing pretreatment and manual proofreading by a machine engine, and returning a final result to update a table analysis result;

step 5, after the page segmentation and the form analysis are completed, initial information of text detection can be obtained, all text detection tasks are generated in a message queue mode, and are subjected to machine engine preprocessing and manual proofreading in sequence, and a final result is returned to update a text detection result;

step 6, obtaining initial information of text recognition after text detection is completed, generating all text recognition tasks in a message queue mode, sequentially performing machine engine preprocessing and manual proofreading, and returning a final result to update a text recognition result;

and 7, when all processing tasks of the document image to be processed are finished, integrating labeling results of different levels and exporting the electronic document file with a specific format.

The process control end receives a task initiating message of a user operation end through the message communication end, so that a document image analysis and recognition processing process is started; the process control end acquires the completed step of the current task, determines the next processing step and sends the step to the machine engine management end 24 or the manual labeling management end 25 through the message communication end; after the processing flow is completed, the flow control end sends a message to the user operation end 21, so that the user operation end 21 updates the current task completion state.

Wherein the machine terminal cluster 26 comprises a plurality of machine engine terminals; all machine engine terminals in the system are numbered uniformly, and are managed and allocated uniformly by a machine engine management end 24; the machine engine terminal runs a document image analysis and recognition model based on a deep neural network, determines the output of the model to be called according to the current document image analysis and recognition processing steps, and returns the processing result to the machine engine management terminal 24;

the manual terminal cluster 27 comprises a plurality of manual labeling terminals, the manual labeling terminals correspond to manual labeling personnel, all the manual labeling personnel and the corresponding manual labeling terminals are numbered in a unified manner, and the manual labeling management terminal 25 manages and allocates the manual labeling terminals in a unified manner; the manual labeling operator checks and modifies the current labeling result in the manual labeling terminal, determines an operation page required to be called by the manual labeling terminal according to the current processing step, and returns the labeling result to the manual labeling management terminal 25.

As shown in fig. 1, the deep neural network-based document image analysis and recognition model includes an input layer 11, a feature extraction network 12, a multitask prediction network 13 and a multitask output layer 14; the input layer 11 is connected to the feature extraction network 12, the feature extraction network 12 is connected to the multitask prediction network 13, and the multitask prediction network 13 is connected to the multitask output layer 14;

the input layer 11 receives an input page image, wherein the input page image is a page image in a document to be processed currently; the feature extraction network is a stacked multilayer convolutional neural network; the multi-task prediction network is a multi-layer prediction network which is specially used for corresponding tasks and is respectively constructed aiming at different prediction tasks; and the multitask output layer outputs output results of different prediction networks.

The feature extraction network 12 is a stacked multilayer convolutional neural network, each layer of convolutional neural network is a nonlinear mapping output by a previous layer of convolutional neural network, the input page image is represented and described through multiple times of nonlinear mapping, and the representation features are extracted and output; the representation characteristics of the page images acquired through the characteristic extraction network are shared characteristics which are shared by various prediction tasks; the multiple prediction tasks comprise page segmentation, table analysis, text detection and text recognition.

The multi-task prediction network 13 comprises a page segmentation prediction network, a table analysis prediction network, a text detection prediction network and a text recognition prediction network, and is respectively used for realizing the prediction tasks of page segmentation, table analysis, text detection and text recognition; the page segmentation prediction network, the table analysis prediction network, the text detection prediction network and the text recognition prediction network share input features, namely different prediction networks share representation features output by a feature extraction network; the multi-task prediction network 13 determines its different prediction network structures according to different prediction tasks, respectively.

The multitask output layer 14 includes a page partition result output by the page partition prediction network, a table parsing result output by the table parsing prediction network, a text detection result output by the text detection prediction network, and a text recognition result output by the text recognition prediction network.

In the method for analyzing and identifying the document image by human-computer deep coupling provided by the invention, the human-computer deep coupling means that a process of initial machine judgment and manual verification exists for each step in the document image analyzing and identifying process, and the final processing result of each step is used as the initial condition of the next step.

the invention adopts a deep neural network model based on multi-task shared characteristics as a machine engine of a document image analysis and recognition system, and the model can provide analysis and recognition results of each hierarchy of the document image through multiple outputs. In actual operation, the shared features of a certain page image are cached in the system, and all the hierarchical tasks for executing the page image can directly load the cached features without repeated calculation.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A document image analysis and recognition method is characterized by comprising the following steps:

2. The document image analysis and recognition method of claim 1,

acquiring the image of the document to be processed comprises acquiring a page image of the document to be processed in a scanning or photographing mode, and recording basic information of the document to be processed, wherein the basic information comprises a name, an author, a publishing organization and a publishing date.

3. The document image analysis and recognition method of claim 1,

the machine engine terminal runs a document image analysis and recognition model based on a deep neural network, determines the output of the model required to be called according to the current document image analysis and recognition processing steps, and returns the processing result to the machine engine management end.

4. The document image analysis and recognition method of claim 3,

the document image analysis and recognition model based on the deep neural network comprises an input layer, a feature extraction network, a multitask prediction network and a multitask output layer; the input layer is connected to the feature extraction network, the feature extraction network is connected to the multitask prediction network, and the multitask prediction network is connected to the multitask output layer;

5. The document image analysis and recognition method of claim 4,

the feature extraction network is a plurality of layers of superposed convolutional neural networks, each layer of convolutional neural network is a nonlinear mapping output by the previous layer of convolutional neural network, the input page image is represented and described through the nonlinear mapping for a plurality of times, and the representation features are extracted and output; the representation characteristics of the page images acquired through the characteristic extraction network are shared characteristics which are shared by various prediction tasks; the multiple prediction tasks comprise page segmentation, table analysis, text detection and text identification;

6. A document image analysis and recognition system is characterized by comprising a user operation end, an interaction center, a process control end, a machine engine management end, a manual labeling management end, a machine terminal cluster and a manual terminal cluster;

7. The document image analysis and recognition system of claim 6,

the man-machine coupled document image analysis and recognition processing flow comprises that the flow control end receives a task initiating message of a user operation end through the message communication end, so as to start a document image analysis and recognition processing flow; the process control end acquires the completed step of the current task, determines the next processing step and sends the step to a machine engine management end or a manual labeling management end through a message communication end; and after the processing flow is finished, the flow control end sends a message to the user operation end, so that the user operation end updates the current task finishing state.

8. The document image analysis and recognition system of claim 6,

the machine terminal cluster comprises a plurality of machine engine terminals; all machine engine terminals in the system are numbered uniformly, and are managed and allocated uniformly by a machine engine management end; the machine engine terminal runs a document image analysis and recognition model based on a deep neural network, determines the output of the model to be called according to the current document image analysis and recognition processing step, and returns the processing result to the machine engine management end;

9. The document image analysis and recognition system of claim 8,

10. The document image analysis and recognition system of claim 9,