CN112199330A

CN112199330A - Mixed document filing method, filing device and storage medium

Info

Publication number: CN112199330A
Application number: CN202011055808.1A
Authority: CN
Inventors: 彭健
Original assignee: Shaoguan Power Supply Bureau Guangdong Power Grid Co Ltd
Current assignee: Shaoguan Power Supply Bureau Guangdong Power Grid Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-08

Abstract

The embodiment of the invention discloses a mixed document filing method, a filing device and a storage medium. The mixed document filing method comprises the following steps: acquiring a mixed document scanned by a printer, wherein the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2; carrying out document separation on the mixed document to obtain N folders, wherein each folder stores all data pages of one document; and sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder. The method can solve the problems of much manual participation, complex operation, low working efficiency and high labor cost in the prior art, and realizes the effects of automatically segmenting, identifying types and archiving the mixed document.

Description

Mixed document filing method, filing device and storage medium

Technical Field

The embodiment of the invention relates to a document automatic classification technology, in particular to a mixed document filing method, a filing device and a storage medium.

Background

With the establishment of enterprise-level systems, information items of local and municipal offices mainly include information maintenance and information repair items, and electronic archiving and classified storage management of such documents, especially monthly period paper settlement documents, is becoming an important work.

At present, the electronic filing operation process of paper documents generally includes that firstly, the filed paper documents are automatically scanned one by one page by one, after the scanning is finished, the whole scanned part is manually and mechanically divided by using a division software, and finally, the divided documents are stored in a pre-established electronic catalog to be used as main accessories of the electronic filing or system process.

However, the whole process of the electronic archiving operation process of the paper document requires manual participation, which results in complex operation, low working efficiency and high cost of human resources.

Disclosure of Invention

The invention provides a mixed document filing method, a filing device and a storage medium, which are used for realizing automatic segmentation, type identification and filing of a mixed document.

In a first aspect, an embodiment of the present invention provides a hybrid document archiving method, where the hybrid document archiving method includes:

acquiring a mixed document scanned by a printer, wherein the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2;

carrying out document separation on the mixed document to obtain N folders, wherein each folder stores all data pages of a document;

and sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder.

Optionally, the N documents are arranged in sequence, and a first mark is provided on a first data page of each document.

Optionally, the performing document separation on the mixed document to obtain N folders includes:

calling an image recognition interface API (application program interface) through a Python script to recognize the mixed document and obtain all data pages provided with the first marks;

and putting data pages from a first data page to a previous page of a second data page into a folder, wherein the first data page and the second data page are two adjacent data pages provided with the first mark, and the first data page is positioned before the second data page.

Optionally, the N documents are arranged in a disordered manner or in a sequential manner, and all data pages of each document are provided with a second mark and a third mark; the second mark is used to indicate to which document the data page belongs, and the third mark is used to indicate the position of the data page in the document.

calling an image recognition interface API through a Python script to recognize the mixed document, putting the data pages with the same second mark into a folder, and sequencing the data pages in the folder according to the third mark.

learning the historical mixed document to obtain a training model;

and carrying out document separation on the mixed document according to the training model to obtain N folders.

Optionally, the sequentially identifying the type of the document stored in each folder includes:

the type of the document stored in each folder is identified in turn based on the elements of all the data pages of a document stored in each folder.

Optionally, the elements include: at least one of a title element, a contract name element, and a settlement month element.

In a second aspect, an embodiment of the present invention further provides a device for filing a hybrid document, where the hybrid document is filed

The gear device comprises:

the mixed document acquisition module is used for acquiring a mixed document scanned by a printer, the mixed document comprises N parts of documents, each part of document comprises at least one data page, and N is more than or equal to 2;

the document separation module is used for carrying out document separation on the mixed document to obtain N folders, and each folder stores all data pages of one document;

and the document identification and storage module is used for sequentially identifying the type of the document stored in each folder and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder.

In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the hybrid document archiving method according to the first aspect.

The invention provides a mixed document filing method, which comprises the following steps: acquiring a mixed document scanned by a printer, wherein the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2; carrying out document separation on the mixed document to obtain N folders, wherein each folder stores all data pages of one document; and sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder. The method can solve the problems of much manual participation, complex operation, low working efficiency and high labor cost in the prior art, and realizes the effects of automatically segmenting, identifying types and archiving the mixed document.

Drawings

FIG. 1 is a flowchart of a hybrid document archiving method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a hybrid document archiving method according to a second embodiment of the present invention;

FIG. 3 is a flowchart of a hybrid document archiving method according to a third embodiment of the present invention;

FIG. 4 is a flowchart of a hybrid document archiving method according to a fourth embodiment of the present invention;

FIG. 5 is a flowchart of a hybrid document archiving method according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of a hybrid document filing apparatus according to a sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a hybrid document archiving method according to an embodiment of the present invention, and referring to fig. 1, the embodiment is applicable to an implementation process of the hybrid document archiving method, and the method may be executed by a hybrid document archiving apparatus, and specifically includes the following steps:

step 100, acquiring a mixed document scanned by a printer, wherein the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2.

The mixed document scanned by the printer may be an electronic document, such as a PDF document, an electronic version picture, and the like. Before acquiring the mixed document scanned by the printer, the method further comprises the following steps: the paper files of various documents to be filed can be automatically scanned by a printer by a human to generate the electronic version of the mixed document.

The mixed document includes N documents, where the N documents may be N documents of the same type, N documents of different types, multiple documents of multiple types, and the like, and are not limited herein.

Step 200, performing document separation on the mixed document to obtain N folders, wherein each folder stores all data pages of a document.

Each document in the mixed document is separated, each separated document is independently stored in one folder, and N folders are obtained, and all data pages of one document are stored in each folder.

And 300, sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder.

The corresponding archive directory folder is established according to the document type, and can be used for storing the documents stored in the identified folder of the corresponding document type.

The working principle of the mixed document filing method is as follows: firstly, paper files of various documents to be filed can be automatically scanned by a printer to generate an electronic mixed document by manpower, the electronic mixed document scanned by the printer is obtained, the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2; then, carrying out document separation on the mixed document to obtain N folders, wherein each folder stores all data pages of a document; and finally, sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder. Therefore, each document in the mixed document can be accurately separated, the type of each document can be identified, and the documents are stored and archived according to the types after the type of each document is identified.

The technical solution of the present embodiment provides a method for filing a hybrid document, where the method for filing a hybrid document includes: acquiring a mixed document scanned by a printer, wherein the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2; carrying out document separation on the mixed document to obtain N folders, wherein each folder stores all data pages of one document; and sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder. Therefore, each document in the mixed document can be accurately separated, the type of each document can be identified, and the documents are stored and archived according to the types after the type of each document is identified. The method can solve the problems of much manual participation, complex operation, low working efficiency and high labor cost in the prior art, and realizes the effects of automatically segmenting, identifying types and archiving the mixed document.

Example two

Fig. 2 is a flowchart of a hybrid document filing method provided in the second embodiment of the present invention, based on the above embodiment, optionally, N documents are arranged in sequence, and the first data page of each document is provided with a first mark.

The sequence of each data page in each document in the mixed document scanned by the printer is not disordered, and the sequence of different documents is not disordered, so that the N documents in the mixed document are arranged in sequence. Thus, the first mark can be set to the first page data of each document for separating the mixed document in a later process. Wherein, the first mark can be obtained by stamping, marking pattern, marking letter symbol, etc. Referring to fig. 2, the hybrid document filing method includes the steps of:

And 110, calling an image recognition interface API through a Python script to recognize the mixed document, and acquiring all data pages with the first marks.

The mixed document can be identified by calling image identification interfaces API provided by various open source websites through Python scripts so as to identify all data pages provided with the first marks.

Step 120, putting the data pages from the first data page to the previous page of the second data page into a folder, wherein the first data page and the second data page are two adjacent data pages provided with the first mark, and the first data page is positioned in front of the second data page, so that the mixed document can be subjected to document separation to obtain N folders.

The first data page may be a top page of a first document of the N documents in the sequence, the second data page may be a top page of a second document of the N documents in the sequence, and so on, and may further include a third data page and a fourth data page …, the third data page may be a top page of a third document of the N documents in the sequence, the fourth data page may be a top page of a fourth document of the N documents in the sequence, and the … nth data page may be a top page of an nth document of the N documents in the sequence. The first data page and the second data page are two adjacent data pages provided with first marks, and the first data page is positioned before the second data page. Similarly, the second data page and the third data page are two adjacent data pages provided with the first marks, the second data page is positioned before the third data page, the third data page and the fourth data page are two adjacent data pages provided with the first marks, the third data page is positioned before the fourth data page, …, the N-1 data page and the Nth data page are two adjacent data pages provided with the first marks, and the N-1 data page is positioned before the Nth data page. Therefore, the first page of each document can be identified, and the mixed document can be subjected to document separation to obtain N folders.

In the technical solution of this embodiment, the working principle of the mixed document filing method is as follows: firstly, paper files of various documents to be filed can be automatically scanned by a printer to generate an electronic mixed document by manpower, the electronic mixed document scanned by the printer is obtained, the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2; then calling an image recognition interface API through a Python script to recognize the mixed document, acquiring all data pages with first marks, and putting the data pages from the first data page to a previous page of a second data page into a folder, wherein the first data page and the second data page are two adjacent data pages with the first marks and the first data page is positioned in front of the second data page, so that the mixed document can be subjected to document separation to obtain N folders; and finally, sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder. Therefore, each document in the mixed document can be accurately separated, the type of each document can be identified, and the documents are stored and archived according to the types after the type of each document is identified.

It should be noted that, in the technical solution of this embodiment, the first mark of the first page of each document may be the same mark symbol or different mark symbols. When the same mark is marked, each document can be identified only by identifying the mark because the N documents are arranged in sequence, so that the separation speed can be improved when the same mark is marked, and the overall efficiency of the mixed document filing is improved.

EXAMPLE III

Fig. 3 is a flowchart of a hybrid document archiving method provided in the third embodiment of the present invention. On the basis of the above embodiment, optionally, N documents are arranged in a disordered manner or in a sequential manner, and all data pages of each document are provided with a second mark and a third mark; the second mark is used to indicate to which document the data page belongs and the third mark is used to indicate the position of the data page in the document.

The sequence of the data pages in each document in the mixed document scanned by the printer may be chaotic or sequential, and the sequence between different documents may be sequential or chaotic, so that N documents in the mixed document may be chaotic or sequential. All data pages of each document are thus provided with a second mark for indicating to which document the data page belongs in particular and a third mark for indicating the position of the data page in the document.

Optionally, referring to fig. 3, the specific steps of the hybrid document archiving method are as follows:

Step 210, calling an image recognition interface API through a Python script to recognize the mixed document, putting the data pages with the same second mark into a folder, and sequencing the data pages in the folder according to the third mark, so that the mixed document can be subjected to document separation to obtain N folders.

Wherein the second flag is used to indicate to which document the data page belongs specifically, the third flag is used to indicate the position of the data page in the document, for example, the second flag may be a1, a2, A3 … An, wherein a1 may indicate the first document, a2 may indicate the second document, …, An may indicate the nth document, the third flag may be a1, a2 … ap, b1, b2, … bm, …, t1, t2, … tn, wherein a1, a2 … ap may indicate the position of each data page in the first document, wherein each data page in the first document has An a1 flag, b1, b2, … bm may indicate the position of each data page in the second document, wherein each data page in the second document has An a 69528, t 867, t 8672, N …, N of the document may indicate the position of each data page in the second document, wherein each data page in the Nth document has An mark. Specifically, the Python script calls an image recognition interface API to recognize the mixed document, put the data pages with the same second mark into a folder, and sort the data pages in the folder according to the third mark, for example, put the data pages with the same a1 mark into the first folder, and sort the data pages in the first folder according to marks such as a1, a2 … ap, and the like; putting the data pages marked by A2 into a second folder, and sorting the data pages in the second folder according to marks such as b1, b2, … bm and the like; …, data pages with the same An mark are all placed into the Nth folder, and the data pages in the Nth folder are sorted according to the marks of t1, t2, … tn, etc. Therefore, the data pages to be contained in each document can be identified and the data pages in each document can be sorted, so that the mixed document can be subjected to document separation to obtain N folders.

In the technical solution of this embodiment, the working principle of the mixed document filing method is as follows: firstly, paper files of various documents to be filed can be automatically scanned by a printer to generate an electronic mixed document by manpower, the electronic mixed document scanned by the printer is obtained, the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2; then calling an image recognition interface API through a Python script to recognize the mixed document, putting the data pages with the same second mark into a folder, and sequencing the data pages in the folder according to a third mark, so that the mixed document can be subjected to document separation to obtain N folders; and finally, sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder. Therefore, each document in the mixed document can be accurately separated, the type of each document can be identified, and the documents are stored and archived according to the types after the type of each document is identified.

Example four

Fig. 4 is a flowchart of a hybrid document archiving method provided in the fourth embodiment of the present invention, and referring to fig. 4, on the basis of the foregoing embodiment, the hybrid document archiving method includes the following specific steps:

And step 310, learning the historical mixed document to obtain a training model.

The training model can be obtained by learning historical electronic documents such as notice, issue documents, request forms, application forms, inspection data, contracts, clean and cheap agreements, technical agreements and the like.

And 320, performing document separation on the mixed document according to the training model to obtain N folders.

The method comprises the steps of inputting a mixed document to be separated into a training model, and carrying out document separation on the mixed document through the training model to obtain N folders.

In the technical solution of this embodiment, the working principle of the mixed document filing method is as follows: firstly, paper files of various documents to be filed can be automatically scanned by a printer to generate an electronic mixed document by manpower, the electronic mixed document scanned by the printer is obtained, the mixed document comprises N documents, each document comprises at least one data page, and N is more than or equal to 2; then, learning the historical mixed document to obtain a training model, and performing document separation on the mixed document according to the training model to obtain N folders; and finally, sequentially identifying the type of the document stored in each folder, and storing the folder into an archive directory folder corresponding to the type of the document stored in the folder. Therefore, each document in the mixed document can be accurately separated, the type of each document can be identified, and the documents are stored and archived according to the types after the type of each document is identified.

EXAMPLE five

Fig. 5 is a flowchart of a hybrid document archiving method provided in the fifth embodiment of the present invention, and referring to fig. 5, on the basis of the foregoing embodiment, the hybrid document archiving method includes the following specific steps:

Step 301, sequentially identifying the type of the document stored in each folder according to the elements of all data pages of a document stored in each folder.

Each folder is stored with a complete document, each document comprises all data pages of the document, elements such as characters are recorded or stored in the data pages, and the type of the document stored in each folder can be identified by identifying the elements in the data pages of each folder. The document types can include electronic documents of notification, issuance, request, application, inspection data, workload and settlement amount confirmation table, contract, technical agreement and the like.

The title element may be a title name of each document, for example, an a item contract book, a B item contract book, an a item request book, a B item notice book, and the like. The contract name element may be a contract name for a contract type document.

It should be noted that, in general, a document includes a title, and thus, the elements include at least one of a title element, a contract name element, and a settlement month element, and at least include the title element. Therefore, when some documents do not have the same-name element and/or the settlement month element, the documents can be prevented from being identified and classified through the title element, so that classification errors can be avoided.

EXAMPLE six

Fig. 6 is a schematic structural diagram of a hybrid document filing apparatus according to a sixth embodiment of the present invention, and referring to fig. 6, the hybrid document filing apparatus 10 includes:

the mixed document acquisition module 11 is used for acquiring a mixed document scanned by the printer, wherein the mixed document comprises N parts of documents, each part of document comprises at least one data page, and N is more than or equal to 2;

the document separation module 12 is configured to perform document separation on the mixed document to obtain N folders, and each folder stores all data pages of a document;

and a document identification and storage module 13, configured to sequentially identify the type of the document stored in each folder, and store the folder in an archive directory folder corresponding to the type of the document stored in the folder.

Alternatively, the N documents are arranged in order, and the first page data page of each document is provided with a first mark.

Optionally, the document separation module 12 is configured to perform document separation on the mixed document to obtain N folders, and includes:

calling an image recognition interface API through a Python script to recognize the mixed document and obtain all data pages provided with first marks;

and putting data pages from the first data page to a previous page of the second data page into a folder, wherein the first data page and the second data page are two adjacent data pages provided with first marks, and the first data page is positioned before the second data page.

Optionally, N documents are arranged in a disordered manner or in a sequential manner, and all data pages of each document are provided with a second mark and a third mark; the second mark is used to indicate to which document the data page belongs and the third mark is used to indicate the position of the data page in the document.

and calling an image recognition interface API through a Python script to recognize the mixed document, putting the data pages with the same second mark into a folder, and sequencing the data pages in the folder according to the third mark.

learning the historical mixed document to obtain a training model;

and carrying out document separation on the mixed documents according to the training model to obtain N folders.

Optionally, the document identification and storage module 13 is configured to identify the type of the document stored in each folder in turn, and includes:

In an aspect of the present embodiment, there is provided a hybrid document filing apparatus including: the mixed document acquisition module is used for acquiring mixed documents scanned by the printer, wherein the mixed documents comprise N parts of documents, each part of document comprises at least one data page, and N is more than or equal to 2; the document separation module is used for carrying out document separation on the mixed document to obtain N folders, and each folder stores all data pages of one document; and the document identification and storage module is used for sequentially identifying the type of the document stored in each folder and storing the folder into the file directory folder corresponding to the type of the document stored in the folder. Therefore, each document in the mixed document can be accurately separated, the type of each document can be identified, and the documents are stored and archived according to the types after the type of each document is identified. Through the device can solve prior art and have artifical participation many, complex operation, work efficiency is low and the great problem of human cost, realized carrying out automatic segmentation, type identification and the effect of filing to mixed document.

EXAMPLE seven

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a hybrid document archiving method, the method including:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the hybrid document archiving method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the hybrid document filing apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A hybrid document archiving method, comprising:

2. The hybrid document filing method of claim 1, wherein the N documents are arranged in sequence, and a first mark is provided to a first data page of each document.

3. The method of claim 2, wherein the separating the documents into N folders comprises:

4. The hybrid document filing method according to claim 1, wherein the N documents are arranged in a chaotic or sequential arrangement, and all data pages of each document are provided with the second mark and the third mark; the second mark is used to indicate to which document the data page belongs, and the third mark is used to indicate the position of the data page in the document.

5. The hybrid document archiving method according to claim 4, wherein the document separating the hybrid document into N folders comprises:

6. The method of claim 1, wherein the separating the documents into N folders comprises:

learning the historical mixed document to obtain a training model;

7. The hybrid document archiving method according to claim 1, wherein the sequentially identifying the type of document stored in each folder comprises:

8. The hybrid document archiving method according to claim 7, wherein the element includes: at least one of a title element, a contract name element, and a settlement month element.

9. A hybrid document archive device, comprising:

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the hybrid document archiving method according to any one of claims 1 to 8.