CN112733658B - Electronic document filing method and device - Google Patents

Electronic document filing method and device Download PDF

Info

Publication number
CN112733658B
CN112733658B CN202011619714.2A CN202011619714A CN112733658B CN 112733658 B CN112733658 B CN 112733658B CN 202011619714 A CN202011619714 A CN 202011619714A CN 112733658 B CN112733658 B CN 112733658B
Authority
CN
China
Prior art keywords
electronic document
layout
information
layout structure
structure information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011619714.2A
Other languages
Chinese (zh)
Other versions
CN112733658A (en
Inventor
贺敏
赵岳
朱相宇
黄福林
刘明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Thunisoft Information Technology Co ltd
Original Assignee
Beijing Thunisoft Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Thunisoft Information Technology Co ltd filed Critical Beijing Thunisoft Information Technology Co ltd
Priority to CN202011619714.2A priority Critical patent/CN112733658B/en
Publication of CN112733658A publication Critical patent/CN112733658A/en
Application granted granted Critical
Publication of CN112733658B publication Critical patent/CN112733658B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image

Abstract

The application discloses an electronic document filing method and device. The method comprises the following steps: receiving an electronic document to be archived; analyzing an electronic document to be filed by adopting an image segmentation algorithm to obtain layout structure information; and archiving the electronic document to be archived according to the layout structure information. According to the electronic document filing method, the layout of the electronic document to be filed is analyzed by adopting an image segmentation algorithm, and the region with key information is subjected to OCR recognition according to the layout structure, so that the classification and the purpose classification of the electronic document are realized. The electronic document filing method avoids resource occupation and data redundancy caused by a large amount of OCR recognition, and further improves the precision and efficiency of electronic document filing.

Description

Electronic document filing method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for archiving an electronic document.
Background
Filing and collating of electronic documents has long been done manually. With the development of artificial intelligence technology, many automatic document classification and targeting methods have appeared in recent years, but these products rely heavily on optical character recognition technology (OCR). The OCR recognition process has a high requirement on computing resources, and if OCR recognition is performed on each page of scanned pictures of an electronic document, performance is inevitably reduced. For the characteristic, some researchers use a deep learning technology to classify the materials in the front and the end pages before OCR, and only perform OCR recognition on the front page of the materials and perform text analysis, so as to classify the materials. However, when the layout of the document is relatively disordered or the distribution of the key information is not uniform, if all the first pages of the document are subjected to OCR recognition, many rules need to be set to process the recognition result, and meanwhile, the redundancy of decision information is easily caused when the whole page of the document is recognized.
Therefore, there is a need for a more efficient method of identifying and categorizing electronic documents.
Disclosure of Invention
The method utilizes a deep learning method, adopts an image segmentation algorithm to analyze the layout of the electronic document to be archived, and realizes classification and cataloguing of the electronic document by identifying and analyzing the contents of key layout areas such as a title, a text, a header, a footer and the like according to the layout structure. By the method, the file classification efficiency and the utilization rate of computing resources can be effectively improved.
The application provides an electronic document filing method, which comprises the following steps:
receiving an electronic document to be archived;
analyzing an electronic document to be filed by adopting an image segmentation algorithm to obtain layout structure information;
and archiving the electronic document to be archived according to the layout structure information.
Further, in a preferred embodiment provided in the present application, an image segmentation algorithm is used to analyze an electronic document to be archived and obtain layout structure information, and the method specifically includes:
acquiring the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardization processing;
according to the document page element distribution characteristics of standardized processing, an image segmentation algorithm is adopted to segment the electronic document page into a plurality of layout areas;
aggregating the mapping relation between the element distribution characteristics of the layout area and the layout type, and determining the mapping relation between the element distribution characteristic sample space and the layout type sample space;
acquiring layout structure information according to the page element distribution characteristics of the electronic document to be filed, wherein the layout structure information comprises layout categories and coordinate information;
wherein the image segmentation algorithm comprises at least one of maskrcnn, fastrcnn and u-net.
Further, in a preferred embodiment provided by the present application, the layout structure information includes a layout type and coordinate information, and the layout type includes at least one of a background, a title, a text, a picture, a table, a header, and a footer.
Further, in a preferred embodiment provided in the present application, the method for archiving an electronic document to be archived according to layout structure information further includes:
determining a first region of the layout according to the layout structure information;
performing OCR recognition on a first area of the layout to generate first classification information;
and archiving the electronic document to be archived according to the first classification information.
Further, in a preferred embodiment provided by the present application, the first area of the layout is a title area.
Further, in a preferred embodiment provided in the present application, the method for archiving an electronic document to be archived according to layout structure information further includes:
inputting the layout structure information into a file classifier to generate second classification information;
and archiving the electronic document to be archived according to the second classification information.
Further, in a preferred embodiment provided herein, the document classifier is constructed and optimized by at least one of SVM, random forest, and linear regression.
Further, in a preferred embodiment provided by the present application, when obtaining the layout structure information fails, performing OCR recognition on an electronic document to be archived to generate a first intermediate document;
determining third classification information of the first intermediate document according to the first intermediate document;
and archiving the electronic document to be archived according to the third classification information.
Further, in a preferred embodiment provided herein, the method is used for archiving judicial portfolio.
The present application further provides an electronic document filing apparatus, comprising:
the receiving module is used for receiving the electronic document to be archived;
the image segmentation module is used for analyzing the electronic document to be archived by adopting an image segmentation algorithm to acquire layout structure information;
and the document filing module is used for filing the electronic document to be filed according to the layout structure information.
According to the electronic document filing method, the layout of the electronic document to be filed is analyzed by adopting an image segmentation algorithm, and the OCR recognition is carried out on the area with the key information according to the layout structure, so that the classification and the purpose attribution of the electronic document are realized. The electronic document filing method avoids resource occupation and data redundancy caused by a large amount of OCR recognition, and further improves the precision and efficiency of electronic document filing.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart of an electronic document archiving method according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of an electronic document filing apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
Referring to fig. 1, an electronic document filing method provided in the embodiment of the present application specifically includes the following steps:
s100: an electronic document to be archived is received.
The electronic documents to be filed can be various electronic documents, including official documents, referee documents, administrative determinants, personnel information registration lists, hospital diagnosis and treatment records, paper articles and other various electronic documents to be filed.
S200: and analyzing the electronic document to be filed by adopting an image segmentation algorithm to obtain layout structure information.
Image segmentation (image segmentation) is an important research direction in the field of computer vision, and is an important part of image semantic understanding. Image segmentation refers to a process of dividing an image into several regions having similar properties, and from a mathematical point of view, is a process of dividing an image into mutually disjoint regions. In recent years, with the deep learning technology, the image segmentation technology has been developed dramatically. The electronic document can be segmented according to file elements and distribution through different image segmentation algorithms. Electronic documents to be filed usually have a certain layout structure. For example, the report data generated based on online filling usually has certain standard requirements. For another example, government agencies and documents typically have strict formatting specifications. Both papers and documents have certain format standards. In addition, documents of interest formed in specific fields such as medical treatment, judicial law enforcement, and the like have a fixed format. Therefore, the layout structure can be analyzed by adopting an image segmentation algorithm to obtain layout structure information.
Further, in a preferred embodiment provided by the present application, an image segmentation algorithm is used to analyze an electronic document to be archived, and to obtain layout structure information, which specifically includes:
acquiring the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardization processing;
according to the document page element distribution characteristics subjected to standardization processing, an image segmentation algorithm is adopted to segment the electronic document page into a plurality of layout areas;
aggregating the mapping relation between the element distribution characteristics of the layout area and the layout type, and determining the mapping relation between the element distribution characteristic sample space and the layout type sample space;
acquiring layout structure information according to the distribution characteristics of page elements of an electronic document to be filed, wherein the layout structure information comprises layout categories and coordinate information;
wherein the image segmentation algorithm comprises at least one of maskrcnn, fastrcnn and u-net.
Specifically, different electronic documents often have different layout structures, and different layouts distribute different characters or other content information to show different page element distribution characteristics. For example, titles are usually located in the top area of an electronic document, money drops and time are usually located in the bottom area of the electronic document, and different content elements (e.g., characters, pictures, charts, tables) show different element distribution characteristics on the electronic document.
The method comprises the steps of obtaining the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardization processing, wherein the purpose of the standardization processing is to convert the distribution characteristics of the page elements of the electronic document into characteristics which can be identified by an algorithm. For example, information such as the spatial orientation and size of an area occupied by a certain page element, and the orientation and size of distribution of specific content in the area is standardized to obtain an association array, which is determined as a specific element block that can be divided and identified. And (4) segmenting the electronic document by using a mask rcnn/fastrcnn/u-net algorithm, and segmenting the electronic document into different layout parts according to pixels. The division of the layout area takes the principle of having standardized uniform element characteristics and content information. Different element distribution characteristics have different space orientations (coordinate information) and layout content information, and correspond to different layout types. By aggregating the mapping relationship between the element distribution characteristics of a large number of layout areas and the layout types, the mapping relationship between the element distribution characteristic sample space and the layout type sample space can be determined. And analyzing the element distribution characteristics of the electronic document page to be filed according to the mapping relation between the element distribution characteristic sample space and the layout type sample space, and acquiring layout structure information. The layout structure information includes a layout type and coordinate information.
The maskrcnn, fastrcnn and u-net are all existing image segmentation algorithm tools, and the specific methods thereof are elaborated in relevant documents. And will not be described in detail herein.
Further, in a preferred embodiment provided by the present application, the layout structure information includes a layout type and coordinate information, where the layout type includes at least one of a background, a title, a text, a picture, a table, a header, and a footer.
It is understood that electronic documents usually have layout settings such as background, title, text, header, footer, etc., and there are elements such as text, picture, and table in the document. Different layout areas and position information correspond to different layout types and have different element distribution characteristics. The electronic document is divided according to the above elements, and the layout structure information can be acquired by recognizing the image information and the position coordinate information of the different layout areas. For example, the background usually covers the entire document page area, the header usually is located at the top center of the document, the body usually is relatively regular segmented text, and the table has a distinct regular border. The background, the title, the text, the picture, the table, the header and the footer are analyzed by an image segmentation algorithm, so that the information of different page types and coordinates can be obtained.
S300: and filing the electronic document to be filed according to the layout structure information.
It will be appreciated that different types of electronic documents to be archived, typically have different layouts. For example, official documents usually have a title, a body and a drop, official documents have a title, a receiving unit, a body and a drop, a staff information registry usually is a fixed format tabloid, articles of thesiology usually have article titles or chapter titles at the headers and footnotes below the pages of the document. According to different layout structure information, the primary classification of the documents can be judged, and then the documents are filed according to the classification condition. For the documents which can not be directly classified according to the layout structure information, the layout structure information can be subjected to subsequent processing according to the filing requirement and principle, specific contents related to the layout area are obtained, and then the documents are further filed.
Further, in a preferred embodiment provided in the present application, the archiving the electronic document to be archived according to the layout structure information further includes:
determining a first region of the layout according to the layout structure information;
performing OCR recognition on a first area of the layout to generate first classification information;
and archiving the electronic document to be archived according to the first classification information.
Specifically, according to the layout structure information and the principle of electronic document filing, the layout containing the specific information for determining the filing type of the electronic document is determined as the first area. For example, if an electronic document needs to be archived by date, it is typically determined that the bottom drop portion of the page is the first area of the layout. If an electronic document is to be archived according to text type, it is typically determined that the header portion at the top of the page is the first region of the layout.
After the first area of the layout is determined, the first area of the layout is identified by adopting an OCR technology, content information of the corresponding area is obtained, and first classification information is generated according to the content information. For example, the money drop part at the bottom of the page is determined as a first area of the layout, and the date of the money drop part extracted after recognition is first classification information. Or determining the title part at the top of the page as a first region of the layout, identifying and extracting key fields of the title part related to file types (official documents such as notices and decisions, and official documents such as civil judgment, criminal judgment, administrative judgment, official documents and the like) to generate first classification information.
And according to the first classification information (date, document type and the like), classifying the electronic documents to be archived into corresponding categories or catalogues according to the archiving principle.
Further, in a preferred embodiment provided by the present application, the first area of the layout is a title area.
It will be appreciated that the header area is typically capable of embodying the most focused, essential information about the archiving of electronic documents. Therefore, the title area is determined as the first area of the layout for OCR recognition, key information related to document classification can be effectively acquired, first classification information is generated, and the electronic document to be filed is classified into a corresponding category or a corresponding catalogue according to the first classification information.
Further, in a preferred embodiment provided in the present application, the archiving the electronic document to be archived according to the layout structure information further includes:
inputting the layout structure information into a file classifier to generate second classification information;
and archiving the electronic document to be archived according to the second classification information.
Specifically, a file classifier can be constructed through a machine learning algorithm, and the archiving method of the electronic document is optimized. Classification is a very important method in data mining. The concept of classification is to learn a classification function or construct a classification model based on the existing data, which can map the data records in the database to a certain category, so as to be applied to data prediction. The classifier is a general term of a method for classifying samples in data mining, and comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like.
According to different layout structure information of different electronic documents to be filed, a certain amount of layout structure information and corresponding file classifications are collected, and an algorithm is selected to construct and train a file classifier, so that the document classification is judged according to different layout classifications and coordinate information. Inputting the current electronic document to be archived into a file classifier for judgment, acquiring the file category of the current electronic document, generating second classification information according to the file category of the current electronic document, and classifying the electronic document to be archived into a corresponding category or directory according to the second classification information. For example, in litigation activities, identification documents such as identification cards and corporate licenses are usually provided, and usually, identification cards are copied from front and back pages of an identification card on a single page, so that the identification documents can be quickly identified by a document classifier and classified into corresponding categories or catalogues.
Further, in a preferred embodiment provided by the present application, the inputting the layout structure information into a file classifier, and generating second classification information specifically includes:
determining the category of the electronic document, acquiring the layout structure information of the electronic document, and establishing a mapping relation between the layout structure information and the category of the electronic document;
aggregating the statistical mapping relation between the layout structure information and the category of the electronic document, and determining the mapping relation between the layout structure sample space and the category sample space of the electronic document;
constructing a file classifier according to a mapping relation between a layout structure sample space and an electronic document category sample space;
inputting the layout structure information of the electronic document to be archived into a file classifier to acquire the category of the electronic document to be archived;
generating second classification information according to the classification of the electronic document to be archived;
the file classifier is constructed and optimized through at least one of SVM, random forest and linear regression.
Specifically, different electronic documents to be filed are obtained, document types of the electronic documents are manually identified, page types and coordinate information of different areas are obtained by analyzing page surfaces of the electronic documents through an image segmentation algorithm, and a mapping relation between page structure information and the types of the electronic documents is established; aggregating the statistical mapping relation between the layout structure information and the category of the electronic document, correspondingly inputting the layout category, the coordinate information and the document category into a classifier for training, and determining the mapping relation between the layout structure sample space and the electronic document category sample space. The classifier model is obtained by training through a certain number, for example 1000 batches of data. During the use process, the analysis result of the classifier is continuously adjusted and optimized according to specific conditions, and the judgment accuracy of the classifier is improved.
SVM, random forest and linear regression are all existing general machine learning algorithms, and specific methods thereof are elaborated in relevant documents. And will not be described in detail herein.
Further, in a preferred embodiment provided in the present application, the electronic document filing method further includes:
when the acquisition of the layout structure information fails, performing OCR (optical character recognition) on an electronic document to be archived to generate a first intermediate document;
determining third classification information of the first intermediate document according to the first intermediate document;
and archiving the electronic document to be archived according to the third classification information.
Specifically, when the electronic document to be archived does not have a layout structure which is easy to segment or cannot acquire layout structure information through a segmentation algorithm, OCR recognition is performed on the electronic document to acquire content information of a corresponding document page, a first intermediate document is generated according to an OCR recognition result, third classification information is generated according to key elements in the content information of the first intermediate document, and the electronic document to be archived is classified into a corresponding category or directory according to the third classification information.
Further, in a preferred embodiment provided herein, the method is used for archiving judicial portfolio.
It can be understood that, with the development of judicial informatization, the number of the judicial portfolio also shows a geometric growth trend, the court portfolio material has the characteristics of multiple sources, isomerism, mass and the like, and meanwhile, the court portfolio material has the characteristics of standard and consistent information categories.
The electronic document filing method can be used for filing judicial portfolio. And receiving the judicial portfolio to be archived, analyzing the judicial portfolio to be archived by adopting an image segmentation algorithm, and acquiring layout structure information. And classifying the judicial portfolio to be archived into a corresponding category or directory according to the layout structure information and the archiving principle. For example, by filing according to the category of the files such as appeal, decision book, court trial notes, evidence materials, and the like. Different books have different layout structures. For example, a litigation typically consists of a layout of different regions for a title, litigation request, facts and reasons, statutes, and so forth; the decision book is composed of titles, case numbers, original defended information, decision subject and payment, and the evidence materials usually include the table of the evidence catalogue and various evidence materials. Different documents usually have different layout structures, and the layout structures can be directly obtained through an image segmentation algorithm, so that the categories of the documents are judged and filed. In addition, the filing efficiency can be further improved through the method for constructing the classifier and the machine learning algorithm.
For the referee document, a title part can be determined as a first region of a layout, the title region is identified through an OCR identification technology, detailed information such as case types (civil affairs, criminals and administration), audition levels (judged according to information of a superior complainer or a superior complainer) and the like is obtained, classification information is generated according to related information, and judicial portfolio is filed according to the classification information.
For some electronic documents which can not obtain the layout information, such as handwritten certificates, scanned articles of borrows and other relevant evidence materials, the content information of the documents can be obtained through OCR recognition, intermediate documents are generated according to the content information, the key information in the content of the intermediate documents is extracted, classification information is determined by matching classification rules, and then the electronic documents to be filed are classified into corresponding categories or catalogues according to the classification information.
The present application further provides an electronic document filing apparatus 100, comprising:
a receiving module 11, configured to receive an electronic file to be archived;
the image segmentation module 12 is used for analyzing the electronic file to be archived by adopting an image segmentation algorithm to acquire layout structure information;
and the document filing module 13 is configured to file the electronic file to be filed according to the layout structure information.
By executing the electronic document filing apparatus 100 provided by the present application, an electronic document can be automatically filed. The electronic documents to be filed can be various electronic documents, including official documents, referee documents, administrative determinants, personnel information registration lists, hospital medical records, thesis articles and other various electronic documents needing to be filed.
Image segmentation (image segmentation) is an important research direction in the field of computer vision, and is an important part of image semantic understanding. Image segmentation refers to a process of dividing an image into several regions having similar properties, and from a mathematical point of view, is a process of dividing an image into mutually disjoint regions. In recent years, with the deep learning technology, the image segmentation technology has been developed dramatically. The electronic document can be segmented according to file elements and distribution through different image segmentation algorithms. Electronic documents to be filed usually have a certain layout structure. For example, the tablature data generated based on online filling usually has certain standard requirements. For another example, government agencies and documents typically have strict formatting specifications. Both papers and documents have certain format standards. In addition, documents of interest formed in specific fields such as medical treatment, judicial law enforcement, and the like have a fixed format. Therefore, the image segmentation module 12 can be used to analyze the layout structure of the image to obtain layout structure information.
It will be appreciated that different types of electronic documents to be archived, typically have different layouts. For example, official documents usually have a title, a body and a drop, official documents have a title, a receiving unit, a body and a drop, a staff information registry usually is a fixed format tabloid, articles of thesiology usually have article titles or chapter titles at the headers and footnotes below the pages of the document. According to different layout structure information, the primary classification of the documents can be judged, and then the documents are filed according to the classification condition. For the documents which can not be directly classified according to the layout structure information, the layout structure information can be subjected to subsequent processing according to the filing requirement and principle, specific contents related to the layout area are obtained, and then the documents are further filed.
Further, in a preferred embodiment provided in the present application, an image segmentation algorithm is used to analyze an electronic document to be archived and obtain layout structure information, and the method specifically includes:
acquiring the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardized processing;
according to the document page element distribution characteristics of standardized processing, an image segmentation algorithm is adopted to segment the electronic document page into a plurality of layout areas;
aggregating the mapping relation between the element distribution characteristics of the layout area and the layout type, and determining the mapping relation between the element distribution characteristic sample space and the layout type sample space;
acquiring layout structure information according to the page element distribution characteristics of the electronic document to be filed, wherein the layout structure information comprises layout categories and coordinate information;
wherein the image segmentation algorithm comprises at least one of maskrcnn, fastrcnn and u-net.
Specifically, different electronic documents often have different layout structures, and different layouts distribute different characters or other content information to show different page element distribution characteristics. For example, the title is usually located in the top area of the electronic document, the loss and the time are usually located in the bottom area of the electronic document, and different content elements (such as characters, pictures, charts and tables) show different element distribution characteristics on the electronic document.
The method comprises the steps of obtaining the distribution characteristics of page elements of the electronic document to be filed and carrying out standardization processing, wherein the standardization processing aims to convert the distribution characteristics of the page elements of the electronic document into characteristics which can be identified by an algorithm. For example, information such as the spatial orientation and size of an area occupied by a certain page element, and the orientation and size of distribution of specific content in the area is standardized to obtain an association array, which is determined as a specific element block that can be divided and identified. And (4) segmenting the electronic document by using a mask rcnn/fastrcnn/u-net algorithm, and segmenting the electronic document into different layout parts according to pixels. The division of the layout area takes the principle of having standardized uniform element characteristics and content information. Different element distribution characteristics have different space orientations (coordinate information) and layout content information, and correspond to different layout categories. By aggregating the mapping relationships between the element distribution characteristics of a large number of layout regions and the layout categories, the mapping relationship between the element distribution characteristic sample space and the layout category sample space can be determined. And analyzing the element distribution characteristics of the electronic document page to be filed according to the mapping relation between the element distribution characteristic sample space and the layout type sample space to obtain layout structure information. The layout structure information includes layout type and coordinate information.
The maskrnn, fastrcnn and u-net are all existing image segmentation algorithm tools, and specific methods thereof are elaborated in relevant documents. And will not be described in detail herein.
Further, in a preferred embodiment provided by the present application, the layout structure information includes a layout type and coordinate information, and the layout type includes at least one of a background, a title, a text, a picture, a table, a header, and a footer.
It is understood that electronic documents usually have layout settings such as background, title, text, header, footer, etc., and there are elements such as text, picture, and table in the document. Different layout areas and position information correspond to different layout types and have different element distribution characteristics. The electronic document is divided according to the above elements, and the layout structure information can be acquired by recognizing the image information and the position coordinate information of the different layout areas. For example, the background typically covers the entire document page area, the header is typically centered above the document, the body is typically relatively regular segmented text, and the table has distinct regular borders. The different page types and coordinate information can be obtained by analyzing the background, the title, the text, the picture, the table, the header and the footer through an image segmentation algorithm.
Further, in a preferred embodiment provided in the present application, the document filing module 13 is configured to file the electronic file to be filed according to the layout structure information, and is specifically configured to:
determining a first region of the layout according to the layout structure information;
performing OCR recognition on a first area of the layout to generate first classification information;
and archiving the electronic document to be archived according to the first classification information.
Specifically, according to the layout structure information and the principle of electronic document filing, the layout containing the specific information for determining the filing type of the electronic document is determined as the first area. For example, if an electronic document needs to be archived by date, it is typically determined that the bottom drop portion of the page is the first area of the layout. If an electronic document is to be archived according to text type, it is typically determined that the header portion at the top of the page is the first region of the layout.
After the first area of the layout is determined, the first area of the layout is identified by adopting an OCR technology, content information of the corresponding area is obtained, and first classification information is generated according to the content information. For example, the money drop part at the bottom of the page is determined as a first area of the layout, and the date of the money drop part extracted after recognition is first classification information. Or determining the title part at the top of the page as a first region of the page, identifying and then extracting key fields of the title part related to file types (documents such as notices and decisions, and judgment documents such as civil judgment, criminal judgment, administrative judgment, judgment and the like) to generate first classification information.
And according to the first classification information (date, document type and the like), classifying the electronic documents to be archived into corresponding categories or catalogues according to the archiving principle.
Further, in a preferred embodiment provided by the present application, the first area of the layout is a title area.
It will be appreciated that the header area is typically capable of embodying the most focused, essential information about the archiving of electronic documents. Therefore, the title area is determined to be the first area of the layout for OCR recognition, key information related to document classification can be effectively acquired, first classification information is generated, and the electronic document to be archived is classified into a corresponding category or a catalog according to the first classification information.
Further, in a preferred embodiment provided in the present application, the document filing module 13 is configured to file the electronic file to be filed according to the layout structure information, and is further configured to:
inputting the layout structure information into a file classifier to generate second classification information;
and archiving the electronic document to be archived according to the second classification information.
Specifically, a file classifier can be constructed through a machine learning algorithm, and the archiving method of the electronic document is optimized. Classification is a very important method in data mining. The concept of classification is to learn a classification function or construct a classification model based on the existing data, which can map the data records in the database to a certain category, so as to be applied to data prediction. The classifier is a general term of a method for classifying samples in data mining, and comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like.
According to different layout structure information of different electronic documents to be filed, a certain amount of layout structure information and corresponding file classification are collected, and an algorithm is selected to construct and train a file classifier, so that the document classification is judged according to different layout classifications and coordinate information. Inputting the current electronic document to be filed into a file classifier for judgment, acquiring the file category of the current electronic document, generating second classification information according to the file category of the current electronic document, and classifying the electronic document to be filed into a corresponding category or a directory according to the second classification information. For example, in litigation activities, identification documents such as identification cards and corporate licenses need to be provided, and usually, identification cards are copied from front and back pages of a single page, and identification documents can be quickly identified by a document classifier and classified into corresponding categories or directories.
Further, in a preferred embodiment provided in the present application, the inputting the layout structure information into a file classifier to generate second classification information specifically includes:
determining the category of the electronic document, acquiring the layout structure information of the electronic document, and establishing a mapping relation between the layout structure information and the category of the electronic document;
aggregating the statistical mapping relation between the layout structure information and the category of the electronic document, and determining the mapping relation between the layout structure sample space and the category sample space of the electronic document;
constructing a file classifier according to a mapping relation between a layout structure sample space and an electronic document category sample space;
inputting the layout structure information of the electronic document to be archived into a file classifier to acquire the category of the electronic document to be archived;
generating second classification information according to the classification of the electronic document to be archived;
the file classifier is constructed and optimized through at least one of SVM, random forest and linear regression.
Specifically, different electronic documents to be filed are obtained, document types of the electronic documents are manually identified, page types and coordinate information of different areas are obtained by analyzing page surfaces of the electronic documents through an image segmentation algorithm, and a mapping relation between page structure information and the types of the electronic documents is established; aggregating the statistical mapping relation between the layout structure information and the category of the electronic document, correspondingly inputting the layout category, the coordinate information and the document category into a classifier for training, and determining the mapping relation between the layout structure sample space and the electronic document category sample space. The classifier model is obtained by training through a certain number, for example, 1000 batches of data. During the use process, the analysis result of the classifier is continuously adjusted and optimized according to specific conditions, and the judgment accuracy of the classifier is improved.
SVM, random forest and linear regression are all existing general machine learning algorithms, and specific methods thereof are elaborated in relevant documents. And will not be described in detail herein.
Further, in a preferred embodiment provided in the present application, the document filing module 13 is configured to file the electronic file to be filed according to the layout structure information, and is further configured to:
when the layout structure information is failed to be acquired, performing OCR recognition on an electronic document to be archived to generate a first intermediate document;
determining third classification information of the first intermediate document according to the first intermediate document;
and archiving the electronic document to be archived according to the third classification information.
Specifically, when the electronic document to be archived does not have a layout structure which is easy to segment or cannot acquire layout structure information through a segmentation algorithm, OCR recognition is performed on the electronic document to acquire content information of a corresponding document page, a first intermediate document is generated according to an OCR recognition result, third classification information is generated according to key elements in the content information of the first intermediate document, and the electronic document to be archived is classified into a corresponding category or directory according to the third classification information.
Further, in a preferred embodiment provided by the present application, the electronic file filing apparatus is used for filing judicial files.
It can be understood that, with the development of judicial informatization, the number of the judicial portfolio also shows a geometric growth trend, the court portfolio material has the characteristics of multiple sources, isomerism, mass and the like, and meanwhile, the court portfolio material has the characteristics of standard and consistent information categories.
The electronic document filing device can be used for filing judicial portfolio. And receiving the judicial portfolio to be archived, analyzing the judicial portfolio to be archived by adopting an image segmentation algorithm, and acquiring layout structure information. And classifying the judicial portfolio to be archived into a corresponding category or directory according to the layout structure information and the archiving principle. For example, by filing according to the category of the files such as appeal, decision book, court trial notes, evidence materials, and the like. Different books have different layout structures. For example, a litigation typically consists of a layout of different regions for a title, litigation request, facts and reasons, statutes, and so forth; the decision book is composed of titles, case numbers, original defended information, decision subject and payment, and the evidence materials usually include the table of the evidence catalogue and various evidence materials. Different documents usually have different layout structures, and the layout structures can be directly obtained through an image segmentation algorithm, so that the categories of the documents are judged and filed. In addition, the filing efficiency can be further improved through the method for constructing the classifier and the machine learning algorithm.
For the referee document, a title part can be determined as a first region of a layout, the title region is identified by an OCR (optical character recognition) technology, detailed information such as case types (civil, criminal and administrative), audition levels (judged according to information such as a person to be appetized or a person to be appetized) and the like is obtained, classification information is generated according to related information, and judicial portfolio is filed according to the classification information.
For some electronic documents which can not obtain the layout information, such as handwritten certificates, scanned articles of borrows and other relevant evidence materials, the content information of the documents can be obtained through OCR recognition, intermediate documents are generated according to the content information, the key information in the content of the intermediate documents is extracted, classification information is determined by matching classification rules, and then the electronic documents to be filed are classified into corresponding categories or catalogues according to the classification information.
In a typical configuration, a computer may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (5)

1. An electronic document archiving method, comprising:
receiving an electronic document to be archived;
adopting an image segmentation algorithm to analyze an electronic document to be filed and acquiring layout structure information, specifically comprising:
acquiring the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardized processing,
according to the document page element distribution characteristics of standardized processing, adopting image segmentation algorithm to segment the electronic document page into several layout areas,
aggregating the mapping relation between the element distribution characteristics of the layout areas and the layout types, determining the mapping relation between the element distribution characteristic sample space and the layout type sample space,
acquiring layout structure information according to the distribution characteristics of page elements of the electronic document to be filed, wherein the layout structure information comprises layout categories and coordinate information,
wherein the image segmentation algorithm comprises at least one of maskrcnn and u-net,
the layout structure information comprises a layout type and coordinate information, wherein the layout type comprises at least one of a background, a title, a text, a picture, a table, a header and a footer;
according to the layout structure information, the electronic document to be archived is archived, which comprises the following steps:
determining a first region of the layout according to the layout structure information,
performing OCR recognition on the first area of the layout to generate first classification information,
archiving the electronic document to be archived according to the first classification information,
the first area of the layout is a title area;
when the layout structure information is failed to be acquired, performing OCR recognition on the electronic document to be archived to generate a first intermediate document, determining third classification information of the first intermediate document according to the first intermediate document, and archiving the electronic document to be archived according to the third classification information.
2. The electronic document filing method according to claim 1, wherein the electronic document to be filed is filed in accordance with the layout structure information, further comprising:
inputting the layout structure information into a file classifier to generate second classification information;
and archiving the electronic document to be archived according to the second classification information.
3. The method of claim 2, wherein inputting the layout structure information into a file classifier to generate second classification information comprises:
determining the category of the electronic document, acquiring the layout structure information of the electronic document, and establishing a mapping relation between the layout structure information and the category of the electronic document;
aggregating the statistical mapping relation between the layout structure information and the category of the electronic document, and determining the mapping relation between the layout structure sample space and the category sample space of the electronic document;
constructing a file classifier according to a mapping relation between a layout structure sample space and an electronic document category sample space;
inputting the layout structure information of the electronic document to be archived into a file classifier to obtain the category of the electronic document to be archived;
generating second classification information according to the classification of the electronic document to be archived;
the file classifier is constructed and optimized through at least one of SVM, random forest and linear regression.
4. The electronic document filing method according to claim 1, wherein the method is used for filing a judicial portfolio.
5. An electronic document filing apparatus, comprising:
the receiving module is used for receiving the electronic document to be archived;
the image segmentation module is used for analyzing the electronic document to be filed by adopting an image segmentation algorithm to acquire layout structure information, and is specifically used for:
acquiring the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardized processing,
according to the document page element distribution characteristics of standardized processing, adopting image segmentation algorithm to segment the electronic document page into several layout areas,
aggregating the mapping relation between the element distribution characteristics of the layout areas and the layout types, determining the mapping relation between the element distribution characteristic sample space and the layout type sample space,
acquiring layout structure information according to the distribution characteristics of page elements of the electronic document to be filed, wherein the layout structure information comprises layout categories and coordinate information,
wherein the image segmentation algorithm comprises at least one of maskrnnn and u-net,
the layout structure information comprises layout categories and coordinate information, wherein the layout categories comprise at least one of background, title, text, picture, table, header and footer
The document filing module is used for filing the electronic document to be filed according to the layout structure information, and is specifically used for:
determining a first region of the layout according to the layout structure information,
performing OCR recognition on a first area of the layout to generate first classification information,
archiving the electronic document to be archived according to the first classification information,
the first area of the layout is a title area;
when the layout structure information is failed to be acquired, performing OCR recognition on the electronic document to be archived to generate a first intermediate document, determining third classification information of the first intermediate document according to the first intermediate document, and archiving the electronic document to be archived according to the third classification information.
CN202011619714.2A 2020-12-31 2020-12-31 Electronic document filing method and device Active CN112733658B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011619714.2A CN112733658B (en) 2020-12-31 2020-12-31 Electronic document filing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011619714.2A CN112733658B (en) 2020-12-31 2020-12-31 Electronic document filing method and device

Publications (2)

Publication Number Publication Date
CN112733658A CN112733658A (en) 2021-04-30
CN112733658B true CN112733658B (en) 2022-11-25

Family

ID=75608135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011619714.2A Active CN112733658B (en) 2020-12-31 2020-12-31 Electronic document filing method and device

Country Status (1)

Country Link
CN (1) CN112733658B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204610A (en) * 2021-05-06 2021-08-03 广东博维创远科技有限公司 Automatic cataloguing method based on criminal case electronic file and computer readable storage device
CN113688872A (en) * 2021-07-28 2021-11-23 达观数据(苏州)有限公司 Document layout classification method based on multi-mode fusion
CN114241501B (en) * 2021-12-20 2023-03-10 北京中科睿见科技有限公司 Image document processing method and device and electronic equipment
CN115422125B (en) * 2022-09-29 2023-05-19 浙江星汉信息技术股份有限公司 Electronic document automatic archiving method and system based on intelligent algorithm
CN116758561A (en) * 2023-08-16 2023-09-15 湖北微模式科技发展有限公司 Document image classification method and device based on multi-mode structured information fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05342408A (en) * 1991-04-04 1993-12-24 Fuji Xerox Co Ltd Document image filing device
JP2002014981A (en) * 2000-06-29 2002-01-18 Mitsubishi Electric Corp Document filing device
JP2002312385A (en) * 2001-04-18 2002-10-25 Mitsubishi Electric Corp Document automated dividing device
CN101226595A (en) * 2007-01-15 2008-07-23 夏普株式会社 Document image processing apparatus and document image processing process

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2788804B2 (en) * 1991-09-17 1998-08-20 日本電気アイシーマイコンシステム株式会社 Element region extraction method
CN105512197A (en) * 2015-11-27 2016-04-20 广州宝钢南方贸易有限公司 Digitized archiving device of documents and archiving and searching device thereof
US11016035B2 (en) * 2017-09-18 2021-05-25 Elite Semiconductor Inc. Smart defect calibration system and the method thereof
CN107908745A (en) * 2017-11-16 2018-04-13 理光图像技术(上海)有限公司 Masses of Document scanning collating unit, method, medium and equipment
CN109344815B (en) * 2018-12-13 2021-08-13 深源恒际科技有限公司 Document image classification method
CN111046784B (en) * 2019-12-09 2024-02-20 科大讯飞股份有限公司 Document layout analysis and identification method and device, electronic equipment and storage medium
CN112052749A (en) * 2020-08-20 2020-12-08 中国建设银行股份有限公司 Archive filing method and device, electronic equipment and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05342408A (en) * 1991-04-04 1993-12-24 Fuji Xerox Co Ltd Document image filing device
JP2002014981A (en) * 2000-06-29 2002-01-18 Mitsubishi Electric Corp Document filing device
JP2002312385A (en) * 2001-04-18 2002-10-25 Mitsubishi Electric Corp Document automated dividing device
CN101226595A (en) * 2007-01-15 2008-07-23 夏普株式会社 Document image processing apparatus and document image processing process

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
行政文书电子归档系统功能探讨;刘淑君;《兰台内外》;20200128(第03期);第39-40页 *

Also Published As

Publication number Publication date
CN112733658A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
CN112733658B (en) Electronic document filing method and device
Kavasidis et al. A saliency-based convolutional neural network for table and chart detection in digitized documents
Siersdorfer et al. Analyzing and predicting sentiment of images on the social web
US8260062B2 (en) System and method for identifying document genres
CA3117374C (en) Sensitive data detection and replacement
Joseph Effect of supervised learning methodologies in offline handwritten Thai character recognition
Clinchant et al. Comparing machine learning approaches for table recognition in historical register books
Karaa et al. Mining multimedia documents
Tian et al. Image classification based on the combination of text features and visual features
Salih et al. An effective bi-layer content-based image retrieval technique
Lee Machine learning, template matching, and the International Tracing Service digital archive: Automating the retrieval of death certificate reference cards from 40 million document scans
Kalaiarasi et al. Clustering of near duplicate images using bundled features
Salih et al. Two-layer content-based image retrieval technique for improving effectiveness
Yu et al. Connecting people in photo-sharing sites by photo content and user annotations
CN115880702A (en) Data processing method, device, equipment, program product and storage medium
Singhal et al. Gaussian local ternary co-occurrence pattern for image retrieval
Kalaiarasi et al. Visual content based clustering of near duplicate web search images
Mehri Historical document image analysis: a structural approach based on texture
Vajda et al. Large image modality labeling initiative using semi-supervised and optimized clustering
Abkrakhmanov et al. A Novel 2D Deep Convolutional Neural Network for Multimodal Document Categorization
Hong et al. Information Extraction and Analysis on Certificates and Medical Receipts
Qiu et al. Evaluation of Generative AI Q&A Chatbot Chained to Optical Character Recognition Models for Financial Documents
John Bosco et al. Improved similar images retrieval: Dynamic multi-feature of fusion a method with texture features
Rekathati Curating news sections in a historical Swedish news corpus
CN115730074A (en) File classification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: He Min

Inventor after: Zhao Yue

Inventor after: Zhu Xiangyu

Inventor after: Wang Yajing

Inventor after: Liu Ming

Inventor before: He Min

Inventor before: Zhao Yue

Inventor before: Zhu Xiangyu

Inventor before: Huang Fulin

Inventor before: Liu Ming

CB03 Change of inventor or designer information
CI03 Correction of invention patent

Correction item: Inventor

Correct: He Min|Zhao Yue|Zhu Xiangyu|Huang Fulin|Liu Ming

False: He Min|Zhao Yue|Zhu Xiangyu|Wang Yajing|Liu Ming

Number: 12-01

Volume: 39

CI03 Correction of invention patent