CN112733658B

CN112733658B - Electronic document filing method and device

Info

Publication number: CN112733658B
Application number: CN202011619714.2A
Authority: CN
Inventors: 贺敏; 赵岳; 朱相宇; 黄福林; 刘明
Original assignee: Beijing Thunisoft Information Technology Co ltd
Current assignee: Beijing Thunisoft Information Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-11-25
Anticipated expiration: 2040-12-31
Also published as: CN112733658A

Abstract

The application discloses an electronic document filing method and device. The method comprises the following steps: receiving an electronic document to be archived; analyzing an electronic document to be filed by adopting an image segmentation algorithm to obtain layout structure information; and archiving the electronic document to be archived according to the layout structure information. According to the electronic document filing method, the layout of the electronic document to be filed is analyzed by adopting an image segmentation algorithm, and the region with key information is subjected to OCR recognition according to the layout structure, so that the classification and the purpose classification of the electronic document are realized. The electronic document filing method avoids resource occupation and data redundancy caused by a large amount of OCR recognition, and further improves the precision and efficiency of electronic document filing.

Description

Electronic document filing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for archiving an electronic document.

Background

Filing and collating of electronic documents has long been done manually. With the development of artificial intelligence technology, many automatic document classification and targeting methods have appeared in recent years, but these products rely heavily on optical character recognition technology (OCR). The OCR recognition process has a high requirement on computing resources, and if OCR recognition is performed on each page of scanned pictures of an electronic document, performance is inevitably reduced. For the characteristic, some researchers use a deep learning technology to classify the materials in the front and the end pages before OCR, and only perform OCR recognition on the front page of the materials and perform text analysis, so as to classify the materials. However, when the layout of the document is relatively disordered or the distribution of the key information is not uniform, if all the first pages of the document are subjected to OCR recognition, many rules need to be set to process the recognition result, and meanwhile, the redundancy of decision information is easily caused when the whole page of the document is recognized.

Therefore, there is a need for a more efficient method of identifying and categorizing electronic documents.

Disclosure of Invention

The method utilizes a deep learning method, adopts an image segmentation algorithm to analyze the layout of the electronic document to be archived, and realizes classification and cataloguing of the electronic document by identifying and analyzing the contents of key layout areas such as a title, a text, a header, a footer and the like according to the layout structure. By the method, the file classification efficiency and the utilization rate of computing resources can be effectively improved.

The application provides an electronic document filing method, which comprises the following steps:

receiving an electronic document to be archived;

analyzing an electronic document to be filed by adopting an image segmentation algorithm to obtain layout structure information;

and archiving the electronic document to be archived according to the layout structure information.

Further, in a preferred embodiment provided in the present application, an image segmentation algorithm is used to analyze an electronic document to be archived and obtain layout structure information, and the method specifically includes:

acquiring the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardization processing;

according to the document page element distribution characteristics of standardized processing, an image segmentation algorithm is adopted to segment the electronic document page into a plurality of layout areas;

aggregating the mapping relation between the element distribution characteristics of the layout area and the layout type, and determining the mapping relation between the element distribution characteristic sample space and the layout type sample space;

acquiring layout structure information according to the page element distribution characteristics of the electronic document to be filed, wherein the layout structure information comprises layout categories and coordinate information;

wherein the image segmentation algorithm comprises at least one of maskrcnn, fastrcnn and u-net.

Further, in a preferred embodiment provided by the present application, the layout structure information includes a layout type and coordinate information, and the layout type includes at least one of a background, a title, a text, a picture, a table, a header, and a footer.

Further, in a preferred embodiment provided in the present application, the method for archiving an electronic document to be archived according to layout structure information further includes:

determining a first region of the layout according to the layout structure information;

performing OCR recognition on a first area of the layout to generate first classification information;

and archiving the electronic document to be archived according to the first classification information.

Further, in a preferred embodiment provided by the present application, the first area of the layout is a title area.

inputting the layout structure information into a file classifier to generate second classification information;

and archiving the electronic document to be archived according to the second classification information.

Further, in a preferred embodiment provided herein, the document classifier is constructed and optimized by at least one of SVM, random forest, and linear regression.

Further, in a preferred embodiment provided by the present application, when obtaining the layout structure information fails, performing OCR recognition on an electronic document to be archived to generate a first intermediate document;

determining third classification information of the first intermediate document according to the first intermediate document;

and archiving the electronic document to be archived according to the third classification information.

Further, in a preferred embodiment provided herein, the method is used for archiving judicial portfolio.

The present application further provides an electronic document filing apparatus, comprising:

the receiving module is used for receiving the electronic document to be archived;

the image segmentation module is used for analyzing the electronic document to be archived by adopting an image segmentation algorithm to acquire layout structure information;

and the document filing module is used for filing the electronic document to be filed according to the layout structure information.

According to the electronic document filing method, the layout of the electronic document to be filed is analyzed by adopting an image segmentation algorithm, and the OCR recognition is carried out on the area with the key information according to the layout structure, so that the classification and the purpose attribution of the electronic document are realized. The electronic document filing method avoids resource occupation and data redundancy caused by a large amount of OCR recognition, and further improves the precision and efficiency of electronic document filing.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of an electronic document archiving method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of an electronic document filing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

Referring to fig. 1, an electronic document filing method provided in the embodiment of the present application specifically includes the following steps:

s100: an electronic document to be archived is received.

The electronic documents to be filed can be various electronic documents, including official documents, referee documents, administrative determinants, personnel information registration lists, hospital diagnosis and treatment records, paper articles and other various electronic documents to be filed.

S200: and analyzing the electronic document to be filed by adopting an image segmentation algorithm to obtain layout structure information.

Image segmentation (image segmentation) is an important research direction in the field of computer vision, and is an important part of image semantic understanding. Image segmentation refers to a process of dividing an image into several regions having similar properties, and from a mathematical point of view, is a process of dividing an image into mutually disjoint regions. In recent years, with the deep learning technology, the image segmentation technology has been developed dramatically. The electronic document can be segmented according to file elements and distribution through different image segmentation algorithms. Electronic documents to be filed usually have a certain layout structure. For example, the report data generated based on online filling usually has certain standard requirements. For another example, government agencies and documents typically have strict formatting specifications. Both papers and documents have certain format standards. In addition, documents of interest formed in specific fields such as medical treatment, judicial law enforcement, and the like have a fixed format. Therefore, the layout structure can be analyzed by adopting an image segmentation algorithm to obtain layout structure information.

Further, in a preferred embodiment provided by the present application, an image segmentation algorithm is used to analyze an electronic document to be archived, and to obtain layout structure information, which specifically includes:

according to the document page element distribution characteristics subjected to standardization processing, an image segmentation algorithm is adopted to segment the electronic document page into a plurality of layout areas;

acquiring layout structure information according to the distribution characteristics of page elements of an electronic document to be filed, wherein the layout structure information comprises layout categories and coordinate information;

Specifically, different electronic documents often have different layout structures, and different layouts distribute different characters or other content information to show different page element distribution characteristics. For example, titles are usually located in the top area of an electronic document, money drops and time are usually located in the bottom area of the electronic document, and different content elements (e.g., characters, pictures, charts, tables) show different element distribution characteristics on the electronic document.

The method comprises the steps of obtaining the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardization processing, wherein the purpose of the standardization processing is to convert the distribution characteristics of the page elements of the electronic document into characteristics which can be identified by an algorithm. For example, information such as the spatial orientation and size of an area occupied by a certain page element, and the orientation and size of distribution of specific content in the area is standardized to obtain an association array, which is determined as a specific element block that can be divided and identified. And (4) segmenting the electronic document by using a mask rcnn/fastrcnn/u-net algorithm, and segmenting the electronic document into different layout parts according to pixels. The division of the layout area takes the principle of having standardized uniform element characteristics and content information. Different element distribution characteristics have different space orientations (coordinate information) and layout content information, and correspond to different layout types. By aggregating the mapping relationship between the element distribution characteristics of a large number of layout areas and the layout types, the mapping relationship between the element distribution characteristic sample space and the layout type sample space can be determined. And analyzing the element distribution characteristics of the electronic document page to be filed according to the mapping relation between the element distribution characteristic sample space and the layout type sample space, and acquiring layout structure information. The layout structure information includes a layout type and coordinate information.

The maskrcnn, fastrcnn and u-net are all existing image segmentation algorithm tools, and the specific methods thereof are elaborated in relevant documents. And will not be described in detail herein.

Further, in a preferred embodiment provided by the present application, the layout structure information includes a layout type and coordinate information, where the layout type includes at least one of a background, a title, a text, a picture, a table, a header, and a footer.

It is understood that electronic documents usually have layout settings such as background, title, text, header, footer, etc., and there are elements such as text, picture, and table in the document. Different layout areas and position information correspond to different layout types and have different element distribution characteristics. The electronic document is divided according to the above elements, and the layout structure information can be acquired by recognizing the image information and the position coordinate information of the different layout areas. For example, the background usually covers the entire document page area, the header usually is located at the top center of the document, the body usually is relatively regular segmented text, and the table has a distinct regular border. The background, the title, the text, the picture, the table, the header and the footer are analyzed by an image segmentation algorithm, so that the information of different page types and coordinates can be obtained.

S300: and filing the electronic document to be filed according to the layout structure information.

It will be appreciated that different types of electronic documents to be archived, typically have different layouts. For example, official documents usually have a title, a body and a drop, official documents have a title, a receiving unit, a body and a drop, a staff information registry usually is a fixed format tabloid, articles of thesiology usually have article titles or chapter titles at the headers and footnotes below the pages of the document. According to different layout structure information, the primary classification of the documents can be judged, and then the documents are filed according to the classification condition. For the documents which can not be directly classified according to the layout structure information, the layout structure information can be subjected to subsequent processing according to the filing requirement and principle, specific contents related to the layout area are obtained, and then the documents are further filed.

Further, in a preferred embodiment provided in the present application, the archiving the electronic document to be archived according to the layout structure information further includes:

Specifically, according to the layout structure information and the principle of electronic document filing, the layout containing the specific information for determining the filing type of the electronic document is determined as the first area. For example, if an electronic document needs to be archived by date, it is typically determined that the bottom drop portion of the page is the first area of the layout. If an electronic document is to be archived according to text type, it is typically determined that the header portion at the top of the page is the first region of the layout.

After the first area of the layout is determined, the first area of the layout is identified by adopting an OCR technology, content information of the corresponding area is obtained, and first classification information is generated according to the content information. For example, the money drop part at the bottom of the page is determined as a first area of the layout, and the date of the money drop part extracted after recognition is first classification information. Or determining the title part at the top of the page as a first region of the layout, identifying and extracting key fields of the title part related to file types (official documents such as notices and decisions, and official documents such as civil judgment, criminal judgment, administrative judgment, official documents and the like) to generate first classification information.

And according to the first classification information (date, document type and the like), classifying the electronic documents to be archived into corresponding categories or catalogues according to the archiving principle.

It will be appreciated that the header area is typically capable of embodying the most focused, essential information about the archiving of electronic documents. Therefore, the title area is determined as the first area of the layout for OCR recognition, key information related to document classification can be effectively acquired, first classification information is generated, and the electronic document to be filed is classified into a corresponding category or a corresponding catalogue according to the first classification information.

Specifically, a file classifier can be constructed through a machine learning algorithm, and the archiving method of the electronic document is optimized. Classification is a very important method in data mining. The concept of classification is to learn a classification function or construct a classification model based on the existing data, which can map the data records in the database to a certain category, so as to be applied to data prediction. The classifier is a general term of a method for classifying samples in data mining, and comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like.

According to different layout structure information of different electronic documents to be filed, a certain amount of layout structure information and corresponding file classifications are collected, and an algorithm is selected to construct and train a file classifier, so that the document classification is judged according to different layout classifications and coordinate information. Inputting the current electronic document to be archived into a file classifier for judgment, acquiring the file category of the current electronic document, generating second classification information according to the file category of the current electronic document, and classifying the electronic document to be archived into a corresponding category or directory according to the second classification information. For example, in litigation activities, identification documents such as identification cards and corporate licenses are usually provided, and usually, identification cards are copied from front and back pages of an identification card on a single page, so that the identification documents can be quickly identified by a document classifier and classified into corresponding categories or catalogues.

Further, in a preferred embodiment provided by the present application, the inputting the layout structure information into a file classifier, and generating second classification information specifically includes:

determining the category of the electronic document, acquiring the layout structure information of the electronic document, and establishing a mapping relation between the layout structure information and the category of the electronic document;

aggregating the statistical mapping relation between the layout structure information and the category of the electronic document, and determining the mapping relation between the layout structure sample space and the category sample space of the electronic document;

constructing a file classifier according to a mapping relation between a layout structure sample space and an electronic document category sample space;

inputting the layout structure information of the electronic document to be archived into a file classifier to acquire the category of the electronic document to be archived;

generating second classification information according to the classification of the electronic document to be archived;

the file classifier is constructed and optimized through at least one of SVM, random forest and linear regression.

Specifically, different electronic documents to be filed are obtained, document types of the electronic documents are manually identified, page types and coordinate information of different areas are obtained by analyzing page surfaces of the electronic documents through an image segmentation algorithm, and a mapping relation between page structure information and the types of the electronic documents is established; aggregating the statistical mapping relation between the layout structure information and the category of the electronic document, correspondingly inputting the layout category, the coordinate information and the document category into a classifier for training, and determining the mapping relation between the layout structure sample space and the electronic document category sample space. The classifier model is obtained by training through a certain number, for example 1000 batches of data. During the use process, the analysis result of the classifier is continuously adjusted and optimized according to specific conditions, and the judgment accuracy of the classifier is improved.

SVM, random forest and linear regression are all existing general machine learning algorithms, and specific methods thereof are elaborated in relevant documents. And will not be described in detail herein.

Further, in a preferred embodiment provided in the present application, the electronic document filing method further includes:

when the acquisition of the layout structure information fails, performing OCR (optical character recognition) on an electronic document to be archived to generate a first intermediate document;

Specifically, when the electronic document to be archived does not have a layout structure which is easy to segment or cannot acquire layout structure information through a segmentation algorithm, OCR recognition is performed on the electronic document to acquire content information of a corresponding document page, a first intermediate document is generated according to an OCR recognition result, third classification information is generated according to key elements in the content information of the first intermediate document, and the electronic document to be archived is classified into a corresponding category or directory according to the third classification information.

It can be understood that, with the development of judicial informatization, the number of the judicial portfolio also shows a geometric growth trend, the court portfolio material has the characteristics of multiple sources, isomerism, mass and the like, and meanwhile, the court portfolio material has the characteristics of standard and consistent information categories.

The electronic document filing method can be used for filing judicial portfolio. And receiving the judicial portfolio to be archived, analyzing the judicial portfolio to be archived by adopting an image segmentation algorithm, and acquiring layout structure information. And classifying the judicial portfolio to be archived into a corresponding category or directory according to the layout structure information and the archiving principle. For example, by filing according to the category of the files such as appeal, decision book, court trial notes, evidence materials, and the like. Different books have different layout structures. For example, a litigation typically consists of a layout of different regions for a title, litigation request, facts and reasons, statutes, and so forth; the decision book is composed of titles, case numbers, original defended information, decision subject and payment, and the evidence materials usually include the table of the evidence catalogue and various evidence materials. Different documents usually have different layout structures, and the layout structures can be directly obtained through an image segmentation algorithm, so that the categories of the documents are judged and filed. In addition, the filing efficiency can be further improved through the method for constructing the classifier and the machine learning algorithm.

For the referee document, a title part can be determined as a first region of a layout, the title region is identified through an OCR identification technology, detailed information such as case types (civil affairs, criminals and administration), audition levels (judged according to information of a superior complainer or a superior complainer) and the like is obtained, classification information is generated according to related information, and judicial portfolio is filed according to the classification information.

For some electronic documents which can not obtain the layout information, such as handwritten certificates, scanned articles of borrows and other relevant evidence materials, the content information of the documents can be obtained through OCR recognition, intermediate documents are generated according to the content information, the key information in the content of the intermediate documents is extracted, classification information is determined by matching classification rules, and then the electronic documents to be filed are classified into corresponding categories or catalogues according to the classification information.

The present application further provides an electronic document filing apparatus 100, comprising:

a receiving module 11, configured to receive an electronic file to be archived;

the image segmentation module 12 is used for analyzing the electronic file to be archived by adopting an image segmentation algorithm to acquire layout structure information;

and the document filing module 13 is configured to file the electronic file to be filed according to the layout structure information.

By executing the electronic document filing apparatus 100 provided by the present application, an electronic document can be automatically filed. The electronic documents to be filed can be various electronic documents, including official documents, referee documents, administrative determinants, personnel information registration lists, hospital medical records, thesis articles and other various electronic documents needing to be filed.

Image segmentation (image segmentation) is an important research direction in the field of computer vision, and is an important part of image semantic understanding. Image segmentation refers to a process of dividing an image into several regions having similar properties, and from a mathematical point of view, is a process of dividing an image into mutually disjoint regions. In recent years, with the deep learning technology, the image segmentation technology has been developed dramatically. The electronic document can be segmented according to file elements and distribution through different image segmentation algorithms. Electronic documents to be filed usually have a certain layout structure. For example, the tablature data generated based on online filling usually has certain standard requirements. For another example, government agencies and documents typically have strict formatting specifications. Both papers and documents have certain format standards. In addition, documents of interest formed in specific fields such as medical treatment, judicial law enforcement, and the like have a fixed format. Therefore, the image segmentation module 12 can be used to analyze the layout structure of the image to obtain layout structure information.

acquiring the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardized processing;

Specifically, different electronic documents often have different layout structures, and different layouts distribute different characters or other content information to show different page element distribution characteristics. For example, the title is usually located in the top area of the electronic document, the loss and the time are usually located in the bottom area of the electronic document, and different content elements (such as characters, pictures, charts and tables) show different element distribution characteristics on the electronic document.

The method comprises the steps of obtaining the distribution characteristics of page elements of the electronic document to be filed and carrying out standardization processing, wherein the standardization processing aims to convert the distribution characteristics of the page elements of the electronic document into characteristics which can be identified by an algorithm. For example, information such as the spatial orientation and size of an area occupied by a certain page element, and the orientation and size of distribution of specific content in the area is standardized to obtain an association array, which is determined as a specific element block that can be divided and identified. And (4) segmenting the electronic document by using a mask rcnn/fastrcnn/u-net algorithm, and segmenting the electronic document into different layout parts according to pixels. The division of the layout area takes the principle of having standardized uniform element characteristics and content information. Different element distribution characteristics have different space orientations (coordinate information) and layout content information, and correspond to different layout categories. By aggregating the mapping relationships between the element distribution characteristics of a large number of layout regions and the layout categories, the mapping relationship between the element distribution characteristic sample space and the layout category sample space can be determined. And analyzing the element distribution characteristics of the electronic document page to be filed according to the mapping relation between the element distribution characteristic sample space and the layout type sample space to obtain layout structure information. The layout structure information includes layout type and coordinate information.

The maskrnn, fastrcnn and u-net are all existing image segmentation algorithm tools, and specific methods thereof are elaborated in relevant documents. And will not be described in detail herein.

It is understood that electronic documents usually have layout settings such as background, title, text, header, footer, etc., and there are elements such as text, picture, and table in the document. Different layout areas and position information correspond to different layout types and have different element distribution characteristics. The electronic document is divided according to the above elements, and the layout structure information can be acquired by recognizing the image information and the position coordinate information of the different layout areas. For example, the background typically covers the entire document page area, the header is typically centered above the document, the body is typically relatively regular segmented text, and the table has distinct regular borders. The different page types and coordinate information can be obtained by analyzing the background, the title, the text, the picture, the table, the header and the footer through an image segmentation algorithm.

Further, in a preferred embodiment provided in the present application, the document filing module 13 is configured to file the electronic file to be filed according to the layout structure information, and is specifically configured to:

After the first area of the layout is determined, the first area of the layout is identified by adopting an OCR technology, content information of the corresponding area is obtained, and first classification information is generated according to the content information. For example, the money drop part at the bottom of the page is determined as a first area of the layout, and the date of the money drop part extracted after recognition is first classification information. Or determining the title part at the top of the page as a first region of the page, identifying and then extracting key fields of the title part related to file types (documents such as notices and decisions, and judgment documents such as civil judgment, criminal judgment, administrative judgment, judgment and the like) to generate first classification information.

It will be appreciated that the header area is typically capable of embodying the most focused, essential information about the archiving of electronic documents. Therefore, the title area is determined to be the first area of the layout for OCR recognition, key information related to document classification can be effectively acquired, first classification information is generated, and the electronic document to be archived is classified into a corresponding category or a catalog according to the first classification information.

Further, in a preferred embodiment provided in the present application, the document filing module 13 is configured to file the electronic file to be filed according to the layout structure information, and is further configured to:

According to different layout structure information of different electronic documents to be filed, a certain amount of layout structure information and corresponding file classification are collected, and an algorithm is selected to construct and train a file classifier, so that the document classification is judged according to different layout classifications and coordinate information. Inputting the current electronic document to be filed into a file classifier for judgment, acquiring the file category of the current electronic document, generating second classification information according to the file category of the current electronic document, and classifying the electronic document to be filed into a corresponding category or a directory according to the second classification information. For example, in litigation activities, identification documents such as identification cards and corporate licenses need to be provided, and usually, identification cards are copied from front and back pages of a single page, and identification documents can be quickly identified by a document classifier and classified into corresponding categories or directories.

Further, in a preferred embodiment provided in the present application, the inputting the layout structure information into a file classifier to generate second classification information specifically includes:

Specifically, different electronic documents to be filed are obtained, document types of the electronic documents are manually identified, page types and coordinate information of different areas are obtained by analyzing page surfaces of the electronic documents through an image segmentation algorithm, and a mapping relation between page structure information and the types of the electronic documents is established; aggregating the statistical mapping relation between the layout structure information and the category of the electronic document, correspondingly inputting the layout category, the coordinate information and the document category into a classifier for training, and determining the mapping relation between the layout structure sample space and the electronic document category sample space. The classifier model is obtained by training through a certain number, for example, 1000 batches of data. During the use process, the analysis result of the classifier is continuously adjusted and optimized according to specific conditions, and the judgment accuracy of the classifier is improved.

when the layout structure information is failed to be acquired, performing OCR recognition on an electronic document to be archived to generate a first intermediate document;

Further, in a preferred embodiment provided by the present application, the electronic file filing apparatus is used for filing judicial files.

The electronic document filing device can be used for filing judicial portfolio. And receiving the judicial portfolio to be archived, analyzing the judicial portfolio to be archived by adopting an image segmentation algorithm, and acquiring layout structure information. And classifying the judicial portfolio to be archived into a corresponding category or directory according to the layout structure information and the archiving principle. For example, by filing according to the category of the files such as appeal, decision book, court trial notes, evidence materials, and the like. Different books have different layout structures. For example, a litigation typically consists of a layout of different regions for a title, litigation request, facts and reasons, statutes, and so forth; the decision book is composed of titles, case numbers, original defended information, decision subject and payment, and the evidence materials usually include the table of the evidence catalogue and various evidence materials. Different documents usually have different layout structures, and the layout structures can be directly obtained through an image segmentation algorithm, so that the categories of the documents are judged and filed. In addition, the filing efficiency can be further improved through the method for constructing the classifier and the machine learning algorithm.

For the referee document, a title part can be determined as a first region of a layout, the title region is identified by an OCR (optical character recognition) technology, detailed information such as case types (civil, criminal and administrative), audition levels (judged according to information such as a person to be appetized or a person to be appetized) and the like is obtained, classification information is generated according to related information, and judicial portfolio is filed according to the classification information.

In a typical configuration, a computer may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An electronic document archiving method, comprising:

receiving an electronic document to be archived;

adopting an image segmentation algorithm to analyze an electronic document to be filed and acquiring layout structure information, specifically comprising:

acquiring the distribution characteristics of the page elements of the electronic document to be filed and carrying out standardized processing,

according to the document page element distribution characteristics of standardized processing, adopting image segmentation algorithm to segment the electronic document page into several layout areas,

aggregating the mapping relation between the element distribution characteristics of the layout areas and the layout types, determining the mapping relation between the element distribution characteristic sample space and the layout type sample space,

acquiring layout structure information according to the distribution characteristics of page elements of the electronic document to be filed, wherein the layout structure information comprises layout categories and coordinate information,

wherein the image segmentation algorithm comprises at least one of maskrcnn and u-net,

the layout structure information comprises a layout type and coordinate information, wherein the layout type comprises at least one of a background, a title, a text, a picture, a table, a header and a footer;

according to the layout structure information, the electronic document to be archived is archived, which comprises the following steps:

determining a first region of the layout according to the layout structure information,

performing OCR recognition on the first area of the layout to generate first classification information,

archiving the electronic document to be archived according to the first classification information,

the first area of the layout is a title area;

when the layout structure information is failed to be acquired, performing OCR recognition on the electronic document to be archived to generate a first intermediate document, determining third classification information of the first intermediate document according to the first intermediate document, and archiving the electronic document to be archived according to the third classification information.

2. The electronic document filing method according to claim 1, wherein the electronic document to be filed is filed in accordance with the layout structure information, further comprising:

3. The method of claim 2, wherein inputting the layout structure information into a file classifier to generate second classification information comprises:

inputting the layout structure information of the electronic document to be archived into a file classifier to obtain the category of the electronic document to be archived;

4. The electronic document filing method according to claim 1, wherein the method is used for filing a judicial portfolio.

5. An electronic document filing apparatus, comprising:

the image segmentation module is used for analyzing the electronic document to be filed by adopting an image segmentation algorithm to acquire layout structure information, and is specifically used for:

wherein the image segmentation algorithm comprises at least one of maskrnnn and u-net,

the layout structure information comprises layout categories and coordinate information, wherein the layout categories comprise at least one of background, title, text, picture, table, header and footer

The document filing module is used for filing the electronic document to be filed according to the layout structure information, and is specifically used for:

performing OCR recognition on a first area of the layout to generate first classification information,

the first area of the layout is a title area;