CN117612182A - Document classification method, device, electronic equipment and medium - Google Patents

Document classification method, device, electronic equipment and medium Download PDF

Info

Publication number
CN117612182A
CN117612182A CN202311604890.2A CN202311604890A CN117612182A CN 117612182 A CN117612182 A CN 117612182A CN 202311604890 A CN202311604890 A CN 202311604890A CN 117612182 A CN117612182 A CN 117612182A
Authority
CN
China
Prior art keywords
classified
document
classification
image
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311604890.2A
Other languages
Chinese (zh)
Inventor
张舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202311604890.2A priority Critical patent/CN117612182A/en
Publication of CN117612182A publication Critical patent/CN117612182A/en
Pending legal-status Critical Current

Links

Abstract

The document classification method, the device, the electronic equipment and the medium can be applied to the technical field of big data. The method comprises the following steps: acquiring a paper document to be classified, digitally scanning the paper document, and acquiring an image to be classified; performing character recognition and image analysis on the images to be classified; selecting images to be classified with a unified ordering format to perform first classification ordering to form a first classification document; selecting images to be classified containing chapter typesetting information to carry out second classification and sorting to form second classification documents; selecting images to be classified with the context semantic sequence relationship to perform third classification sorting to form a third classification document; classifying images to be classified except the first classified document, the second classified document and the third classified document into a fourth classified document; and identifying and sorting the paper documents corresponding to the classified documents, and outputting the classified paper documents.

Description

Document classification method, device, electronic equipment and medium
Technical Field
The present invention relates to the field of big data technologies, and in particular, to a method, an apparatus, an electronic device, and a medium for classifying documents.
Background
With the continuous development of the financial industry, physical notes and paper documents still occupy an important role in daily business processes of banks and financial institutions. These documents include various formats, contracts, application forms, etc., which record important details of financial transactions and customer information. However, paper documents often face a series of challenges in handling and management, including confusion, loss, duplication, and misclassification of documents, which can lead to business delays, inefficiency, and customer service problems.
At present, under the prior art and the process, the processing of entity bills and paper documents still depends on manual sorting and arrangement, and as the processing is a time-consuming, tedious and error-prone work, the manual processing easily causes the documents to be classified or mixed together erroneously, which brings trouble to the subsequent business process; meanwhile, the manual treatment requires a great deal of time and labor, and is low in efficiency, and business delay is easy to cause.
Disclosure of Invention
In view of the above-mentioned problems, according to a first aspect of the present invention, there is provided a document classification method characterized by comprising: acquiring a paper document to be classified, digitally scanning the paper document, and acquiring an image to be classified; performing character recognition and image analysis on the image to be classified to acquire character content information and character position information of the image to be classified; selecting images to be classified with a unified ordering format based on the text position information to perform first classification ordering to form a first classification document; based on the text content information, carrying out first natural language analysis on the images to be classified which do not have the unified ordering format, selecting the images to be classified containing chapter typesetting information, and carrying out second classification ordering to form a second classification document; carrying out second natural language analysis on the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting information, selecting the images to be classified which have a context semantic sequence relationship, and carrying out third classification ordering to form a third classification document; classifying images to be classified except the first classified document, the second classified document and the third classified document into a fourth classified document; and identifying and sorting the paper documents corresponding to the first classified document, the second classified document, the third classified document and the fourth classified document, and outputting the classified paper documents.
According to some exemplary embodiments, the image analysis is performed on the image to be classified, and text position information is obtained, which specifically includes: performing image segmentation and region detection on the image to be classified to obtain target regions at the top, bottom and/or page edge of the image to be classified; extracting characteristics of a target region, wherein the characteristics of the target region comprise region height characteristics, boundary line characteristics, region text style characteristics and region relative position characteristics; and acquiring text position information based on the characteristics of the target area, wherein the text position information is used for judging whether similar text marks and page number information of a header and/or a footer exist or not.
According to some exemplary embodiments, the performing a first natural language analysis on the images to be classified that do not have a unified ordering format based on the text content information, selecting the images to be classified that include chapter layout information, and performing a second classification ordering to form a second classification document, and specifically includes: performing text style analysis, identifier detection and keyword detection on the images to be classified, which do not have the uniform ordering format, and obtaining chapter typesetting information, wherein the chapter typesetting information comprises text styles, identifiers and keywords of chapter typesetting; screening the images to be classified which do not have the unified ordering format based on the chapter typesetting information to obtain a document with a standard chapter typesetting format; performing chapter division based on the chapter typesetting information to obtain chapter division results; and carrying out second classification and sorting on the documents with the standard chapter typesetting format according to the chapter dividing result to form second classification documents.
According to some exemplary embodiments, the performing a second natural language analysis on the images to be classified that do not have a unified ordering format and do not include chapter typesetting, selecting the images to be classified that have a context semantic order relationship, and performing a third classification ordering to form a third classification document, which specifically includes: carrying out context semantic engagement analysis on the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting, and obtaining semantic links and semantic ordering relations of different images to be classified; screening the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting based on the semantic relation to obtain documents with semantic relation; and carrying out third classification ordering on the documents with the semantic relation according to the semantic ordering relation to form a third classification document.
According to some exemplary embodiments, the image to be classified includes image information and an image feature value, the acquiring a paper document to be classified, and digitally scanning the paper document to acquire the image to be classified specifically includes: obtaining the image information based on the digital scanning, wherein the image information comprises outline dimension and color information; and extracting the characteristic value of the image information to obtain an image characteristic value.
According to some exemplary embodiments, the method further comprises: based on the image characteristic values, comparing the image characteristic values of the similar documents in the first classified document, the second classified document and the third classified document, identifying the manually written signature handwriting and seal stamping information, and obtaining a comparison result; and marking the suspected counterfeit file based on the comparison result.
According to some exemplary embodiments, before the identifying and sorting of the paper documents corresponding to the first, second, third and fourth categorized documents, the method further comprises: and carrying out second implicit marking on each image of each of the first classified document, the second classified document, the third classified document and the fourth classified document, wherein the content of the second implicit marking comprises the classification corresponding to the image and the sorting in the classification.
According to some exemplary embodiments, the paper document includes a ticket and a paper file, and before the text recognition is performed on the image to be classified, the method further includes: performing preliminary classification based on the outline size and the color information of each image to be classified; and performing first implicit marking on each image to be classified.
According to some exemplary embodiments, the identifying and sorting the paper documents corresponding to the first categorizing document, the second categorizing document, the third categorizing document and the fourth categorizing document specifically includes: identifying a corresponding categorization and ranking of the paper documents according to the second implicit indicia; and binding according to preset binding requirements.
According to a second aspect of the present invention, there is provided a document classification apparatus, the apparatus comprising: the image acquisition module to be classified is used for: acquiring a paper document to be classified, digitally scanning the paper document, and acquiring an image to be classified; the text content information and text position information acquisition module is used for: performing character recognition and image analysis on the image to be classified to acquire character content information and character position information of the image to be classified; a first categorized document formation module for: selecting images to be classified with a unified ordering format based on the text position information to perform first classification ordering to form a first classification document; a second categorization document forming module for: based on the text content information, carrying out first natural language analysis on the images to be classified which do not have the unified ordering format, selecting the images to be classified which contain chapter typesetting, and carrying out second classification ordering to form second classification documents; a third classification document formation module for: carrying out second natural language analysis on the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting, selecting the images to be classified which have a context semantic order relation, and carrying out third classification ordering to form a third classification document; a fourth categorization document forming module for: classifying images to be classified except the first classified document, the second classified document and the third classified document into a fourth classified document; and a classification module for: and identifying and sorting the paper documents corresponding to the first classified documents, the second classified documents, the third classified documents and the fourth classified documents, and outputting the classified paper documents.
According to some exemplary embodiments, the image to be classified acquisition module may include an image information acquisition unit and a feature value extraction unit.
According to some exemplary embodiments, the image information obtaining unit may be configured to obtain the image information based on the digitized scan, wherein the image information includes an external dimension and color information.
According to some exemplary embodiments, the feature value extracting unit may be configured to perform feature value extraction on the image information to obtain an image feature value.
According to some exemplary embodiments, the text content information and text position information acquisition module may include an image segmentation and region detection unit, a target region feature extraction unit, and a text position information extraction unit.
According to some exemplary embodiments, the image segmentation and region detection unit may be configured to perform image segmentation and region detection on the image to be classified, and obtain a target region of a top, a bottom and/or a page edge of the image to be classified.
According to some exemplary embodiments, the target region feature extraction unit may be configured to extract features of a target region, wherein the features of the target region include a region height feature, a boundary feature, a region text style feature, and a region relative position feature.
According to some exemplary embodiments, the text position information extracting unit may be configured to obtain text position information based on the features of the target area, where the text position information is used to determine whether there is a similar text identifier of a header and/or a footer and page number information.
According to some exemplary embodiments, the second categorized document formation module may include a section layout information acquisition unit, a standard section layout format document acquisition unit, a divided section result acquisition unit, and a second categorized document formation unit.
According to some exemplary embodiments, the chapter type setting information obtaining unit may be configured to perform text style analysis, identifier detection, and keyword detection on the image to be classified that does not have a unified ordering format, to obtain chapter type setting information, where the chapter type setting information includes a text style, an identifier, and a keyword of a chapter type setting.
According to some exemplary embodiments, the standard chapter type setting format document obtaining unit may be configured to filter the images to be classified that do not have a unified ordering format based on the chapter type setting information, to obtain a document having a standard chapter type setting format.
According to some exemplary embodiments, the chapter division result obtaining unit may be configured to perform chapter division based on the chapter layout information to obtain a chapter division result.
According to some exemplary embodiments, the second categorizing document forming unit may be configured to perform a second categorizing and sorting on the documents with the canonical chapter layout format according to the chapter dividing result, to form a second categorizing document.
According to some example embodiments, the third classification document formation module may include a context semantic analysis unit, a semantic relationship document acquisition unit, and a third classification document formation unit.
According to some exemplary embodiments, the contextual semantic analysis unit may be configured to perform contextual semantic link analysis on the images to be classified that do not have a unified ordering format and do not include chapter typesetting, to obtain semantic links and semantic ordering relationships of different images to be classified.
According to some exemplary embodiments, the semantic relation document obtaining unit may be configured to filter the images to be classified that do not have a unified ordering format and do not include chapter typesetting based on the semantic relation, to obtain a document having a semantic relation.
According to some exemplary embodiments, the third classification document forming unit may be configured to perform a third classification ranking on the documents with the semantic relationships according to the semantic ranking relationships, to form a third classification document.
According to some example embodiments, the classification module may include an identification unit and a binding unit.
According to some example embodiments, the identifying unit may be configured to identify a corresponding categorization and ranking of the paper documents according to the second implicit mark.
According to some exemplary embodiments, the sorting unit may be configured to bind according to a preset binding requirement.
According to some example embodiments, the document classification apparatus may further include a second implicit marking module, which may be configured to perform a second implicit marking on each image of each of the first classified document, the second classified document, the third classified document, and the fourth classified document, the content of the second implicit marking including a classification corresponding to the image and a ranking in the classification.
According to some example embodiments, the document classification apparatus may further include a suspected counterfeit file marking module.
According to some exemplary embodiments, the suspected counterfeit document marking module may include an alignment unit and a marking unit.
According to some exemplary embodiments, the comparing unit may be configured to compare image feature values of similar documents in the first classified document, the second classified document, and the third classified document based on the image feature values, identify handwriting and seal stamping information of the manually written signature, and obtain a comparison result.
According to some exemplary embodiments, the marking unit may be configured to mark a suspected counterfeit document based on the comparison result.
According to some example embodiments, the document classification apparatus may further include a preliminary processing module, which may include a preliminary classification unit and a first implicit marking unit.
According to some exemplary embodiments, the preliminary classification unit may be configured to perform preliminary classification based on the outline size and the color information of each of the images to be classified.
According to some exemplary embodiments, the first implicit marking unit may be configured to perform a first implicit marking on each of the images to be classified.
According to a third aspect of the present invention, there is provided an electronic device comprising: one or more processors; and a storage device for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as described above.
According to a fourth aspect of the present invention there is provided a computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to perform a method as described above.
According to a fifth aspect of the present invention there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
One or more of the above embodiments have the following advantages or benefits: according to the document classification method provided by the invention, the document classification can be automatically processed through the application of the digital scanning, the natural language analysis technology and the image feature analysis technology, and meanwhile, the identification and classification of the document content are more reliable, and the risk of error classification is reduced, so that the burden of manual processing can be reduced, and the user experience is improved.
Drawings
The foregoing and other objects, features and advantages of the invention will be apparent from the following description of embodiments of the invention with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates an application scene graph of a document classification method, device, equipment and medium according to an embodiment of the invention.
Fig. 2 schematically shows a flow chart of a document classification method according to an embodiment of the invention.
Fig. 3 schematically shows a flow chart of a method of acquiring an image to be classified according to an embodiment of the invention.
Fig. 4 schematically shows a flow chart of a method of obtaining text position information according to an embodiment of the invention.
FIG. 5 schematically illustrates a flowchart of a method of forming a second categorized document according to an embodiment of the invention.
FIG. 6 schematically illustrates a flowchart of a method of forming a third classified document according to an embodiment of the invention.
FIG. 7 schematically illustrates a flow chart of a method of marking a categorized document according to an embodiment of the invention.
FIG. 8 schematically illustrates a flow chart of a method of identifying and collating corresponding paper documents in accordance with an embodiment of the present invention.
Fig. 9 schematically shows a flowchart of a method of suspected counterfeit document markers, according to an embodiment of the invention.
FIG. 10 schematically shows a flow chart of a method of preliminary categorization according to an embodiment of the invention.
Fig. 11 schematically shows a block diagram of a document classification apparatus according to an embodiment of the present invention.
Fig. 12 schematically shows a block diagram of an electronic device adapted for a document classification method according to an embodiment of the invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
In the technical scheme of the invention, the acquisition, storage, application and the like of the related personal information of the user accord with the regulations of related laws and regulations, necessary security measures are taken, and the public order harmony is not violated.
First, technical terms described herein are explained and illustrated as follows.
Optical character recognition (OCR, optical Character Recognition) is a technique for converting pictures, handwritten documents, or printed documents into machine-readable text. This makes possible digitizing and further processing of the document (e.g. text searching, editing or archiving). OCR technology has found wide application in a variety of fields including, but not limited to, scanning documents, processing invoices, identifying license plates, automated data entry, and the like. OCR techniques first pre-process the input image such as denoising, binarization, and tilt correction. It then splits the text region and line, and then further splits the line into individual characters. Finally, it recognizes each character by matching with a predefined character template or using a machine learning model.
Natural language analysis (Natural Language Processing, NLP) is a branch of the art of artificial intelligence that is dedicated to computer-based devices to understand, interpret and generate text or speech data in natural language. Implementation of NLP techniques typically involves extensive prediction libraries, machine learning algorithms, and deep learning models.
Implicit markers refer to methods and markers for marking information or elements in text or data in a non-obvious manner, which markers are typically used for subsequent processing, sorting, ordering or other automated tasks without interfering with the readability or understandability of the text or information.
In the business processing process of banking and financial institutions, entity notes and documents are very popular, and notes and documents of various standards for various purposes are countless. While today, which is highly developed in digitization, physical notes and physical paper documents remain an important form of business transaction and archiving, these documents are of a wide variety including, but not limited to, contracts, application forms, bills, checks, and draft. At the same time, processing these physical notes and documents also faces a number of challenges.
The handling of physical notes and documents can be affected by various incidents and irregularities, resulting in confusion and irregularity of documents. For example, documents may be damaged during transmission or misclassified during archiving. These problems not only increase the complexity of business processes, but may also cause audit and even legal problems.
At present, the arrangement of messy notes and various paper files is almost finished by manual sorting and sorting, and the messy notes and files are combed manually, so that the following defects exist:
1. The efficiency is low: the messy files are sorted and ordered after being subjected to Zhang Guilei by manpower, so that a great amount of time and labor force are needed, and the efficiency is low;
2. the error rate is high: the bills and the files are often not of simple types and are in sequence even in one type of bills and files, if the bills and the files are a set of files with certain business attributes, bill files with different users or with information recorded in the past are arranged on the files, when the scattered conditions occur, the bills and the files need to be subjected to the operations of font stamp comparison and the like, the bills and the files are sorted by manpower, and the error is very high;
3. no authentication capability: if an illegal person is involved in intentionally manufacturing a file, a false file is mixed in the machine, and a manual arrangement mode cannot be found.
Based on this, an embodiment of the present invention provides a document classification method, which is characterized in that the method includes: acquiring a paper document to be classified, digitally scanning the paper document, and acquiring an image to be classified; performing character recognition and image analysis on the image to be classified to acquire character content information and character position information of the image to be classified; selecting images to be classified with a unified ordering format based on the text position information to perform first classification ordering to form a first classification document; based on the text content information, carrying out first natural language analysis on the images to be classified which do not have the unified ordering format, selecting the images to be classified containing chapter typesetting information, and carrying out second classification ordering to form a second classification document; carrying out second natural language analysis on the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting information, selecting the images to be classified which have a context semantic sequence relationship, and carrying out third classification ordering to form a third classification document; classifying images to be classified except the first classified document, the second classified document and the third classified document into a fourth classified document; and identifying and sorting the paper documents corresponding to the first classified document, the second classified document, the third classified document and the fourth classified document, and outputting the classified paper documents. According to the document classification method provided by the invention, the document classification can be automatically processed through the application of the digital scanning, the natural language analysis technology and the image feature analysis technology, and meanwhile, the identification and classification of the document content are more reliable, and the risk of error classification is reduced, so that the burden of manual processing can be reduced, and the user experience is improved.
It should be noted that the document classification method, device, equipment and medium determined by the present invention can be used in the big data technical field, the financial field, and various fields other than the big data technical field and the financial field. The application fields of the document classification method, the device, the equipment and the medium provided by the embodiment of the invention are not limited.
In the technical scheme of the invention, the related user information (including but not limited to user personal information, user image information, user equipment information, such as position information and the like) and data (including but not limited to data for analysis, stored data, displayed data and the like) are information and data authorized by a user or fully authorized by all parties, and the processing of the related data such as collection, storage, use, processing, transmission, provision, disclosure, application and the like are all conducted according to the related laws and regulations and standards of related countries and regions, necessary security measures are adopted, no prejudice to the public welfare is provided, and corresponding operation inlets are provided for the user to select authorization or rejection.
Fig. 1 schematically illustrates an application scene graph of a document classification method, device, equipment and medium according to an embodiment of the invention.
As shown in fig. 1, an application scenario 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that, the document classification method provided by the embodiment of the present invention may be generally performed by the server 105. Accordingly, the document classification apparatus provided in the embodiment of the present invention may be generally disposed in the server 105. The document classification method provided by the embodiment of the present invention may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the document classification apparatus provided by the embodiment of the present invention may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 2 schematically shows a flow chart of a document classification method according to an embodiment of the invention.
As shown in fig. 2, the document classification method 200 of this embodiment may include operations S210 to S270.
In operation S210, a paper document to be classified is acquired, and the paper document is digitally scanned to acquire an image to be classified.
In embodiments of the present invention, a high definition image scanning device may be employed to capture high resolution images of documents to ensure clarity of the text and images, facilitating subsequent OCR and image analysis.
According to embodiments of the present invention, the paper document may have different colors, and thus, a color scanning function may be employed to identify an appearance feature of the document, such as a color mark or highlight.
Fig. 3 schematically shows a flow chart of a method of acquiring an image to be classified according to an embodiment of the invention.
As shown in fig. 3, the method for acquiring an image to be classified according to this embodiment may include operations S310 to S320, where the operations S310 to S320 can at least partially perform the above-described operation S210.
In operation S310, the image information is obtained based on the digitized scan, wherein the image information includes outline size and color information.
In embodiments of the invention, the color information may include a color pattern, a dominant color, and a color distribution of the image to distinguish different portions of the document or identify a particular type of document.
In operation S320, feature value extraction is performed on the image information, and an image feature value is obtained.
In embodiments of the present invention, the image feature values may include texture features, shape features, color features, and statistical features. In particular, texture features include identifying texture patterns in an image, such as dots, lines, blocks, etc., that help identify a particular type of document; the shape features are the shape, boundary and outline in the document, and feature information about the appearance of the document can be extracted; the color features may be used to extract color feature values, such as color histograms or color averages, for individual regions or elements; the statistical features may include statistical information of average brightness, contrast, standard deviation, etc. of the image.
In an embodiment of the present invention, the extraction of the image feature values may use image processing and computer vision techniques, such as feature detection algorithms, filters, color segmentation, and the like. The extracted feature values may be used for identification, classification, and ranking of documents. For example, if the system identifies a group of documents with similar color and shape characteristics, it may categorize them into the same category, thereby enabling automated classification and sorting. This helps to improve the efficiency of document management and reduces the need for human intervention.
Referring back to fig. 2, in operation S220, text recognition and image analysis are performed on the image to be classified, and text content information and text position information of the image to be classified are obtained.
In embodiments of the present invention, text in an image may be recognized and converted to text format by OCR technology. Before character recognition, the image to be classified can be preprocessed, including noise removal and brightness and contrast adjustment of the image, so that characters can be clearly recognized.
In the embodiment of the invention, the text position information can be obtained by combining an image analysis method with an OCR technology.
Fig. 4 schematically shows a flow chart of a method of obtaining text position information according to an embodiment of the invention.
As shown in fig. 4, the method for acquiring text position information according to this embodiment may include operations S410 to S430, where operations S410 to S430 can at least partially perform operation S220 described above.
In operation S410, image segmentation and region detection are performed on the image to be classified, and a target region of the top, bottom and/or page edge of the image to be classified is obtained.
In embodiments of the present invention, image processing and computer vision algorithms may be used to segment the image, dividing it into different regions, including top, bottom and page edges; and identifying the target region in the image by a boundary detection method, a color or texture analysis method and the like of the region.
In operation S420, features of a target region are extracted, wherein the features of the target region include a region height feature, a boundary feature, a region text style feature, and a region relative position feature.
In embodiments of the present invention, the region height feature is represented as the height of each target region, which may provide information about the document layout, such as headers and footers, which are typically within a particular height range of the document; the boundary features represent horizontal and vertical boundaries within the document that may separate different regions such as header, body, and footer; the region text style feature represents text styles within a region, such as font, font size, thickness, color, etc., to identify text in headers, footers, and body; the region relative position feature represents position information of the target region with respect to the entire document image, such as the up-down position, the left-right position, and the like of the region in the image.
In operation S430, text position information is acquired based on the features of the target area, where the text position information is used to determine whether there is a similar text identifier of a header and/or a footer and page number information.
In an embodiment of the invention, the system may calculate the location information of the target area based on the extracted features. For example, by analyzing the region height characteristics, it may be determined whether it is in the header, footer portion. In addition, text content in the target area acquired by the OCR technology can be further applied, so that the similar text mark which can be used for judging whether the header and/or the footer exist or not and the text position information of the page number information can be acquired.
Referring back to fig. 2, in operation S230, images to be classified having a uniform ranking format are selected for the first classification ranking based on the text position information, and a first classification document is formed.
In an embodiment of the present invention, based on the text position information, the system may analyze the text layout and typesetting in the document to detect whether there is a uniform ranking format, which may include detecting whether there is a uniform header and footer, the same font style, a specific text identifier, etc. For documents having a uniform ranking format, the system may rank according to a specified ranking rule (e.g., in the order of page numbers of the same header document) to form a first categorized document.
In operation S240, based on the text content information, the images to be classified that do not have the unified ordering format are subjected to a first natural language analysis, and the images to be classified that contain chapter layout information are selected to be subjected to a second classification ordering, so as to form a second classification document.
FIG. 5 schematically illustrates a flowchart of a method of forming a second categorized document according to an embodiment of the invention.
As shown in fig. 5, the method of forming the second categorized document of this embodiment may include operations S510 to S540, and the operations S510 to S540 may at least partially perform the above-described operation S240.
In operation S510, text style analysis, identifier detection and keyword detection are performed on the images to be classified, which do not have the unified ordering format, to obtain chapter typesetting information, where the chapter typesetting information includes text styles, identifiers and keywords of chapter typesetting.
In an embodiment of the present invention, the first natural language analysis may include text style analysis, identifier detection, and keyword detection. The text style analysis is to analyze text in an image to be classified, wherein the text style analysis is to analyze text style characteristics including fonts, word sizes, thicknesses, colors and the like, and is helpful to identify style differences of sections such as chapter titles, subtitle and text in a document; at the same time, identifiers in the document, such as chapter numbers, title symbols, etc., can also be detected to identify chapter structures; in addition, keywords or terms in the document may also be identified to determine the topic or content of the chapter.
In operation S520, the images to be classified that do not have the unified ordering format are filtered based on the chapter layout information, and a document having a canonical chapter layout format is obtained.
In the embodiment of the invention, based on the extracted chapter typesetting information, the system can detect whether the document has a canonical chapter typesetting format or not, and the method can be realized by extracting typesetting characteristics of the document in the chapter typesetting information and comparing the typesetting characteristics with a matching canonical format.
In operation S530, chapter division is performed based on the chapter layout information, and chapter division results are obtained.
In an embodiment of the present invention, based on the extracted chapter layout information, the system may determine the scope and boundaries of each chapter in the document, thereby performing chapter division. For example, the titles of the individual chapters may be identified for subsequent chapter identification and ordering.
In operation S540, the documents with the standard chapter type setting format are subjected to a second classification sorting according to the chapter division result to form a second classification document.
Referring back to fig. 2, in operation S250, the images to be classified that do not have the unified ranking format and do not include chapter layout information are subjected to a second natural language analysis, and the images to be classified that have the context semantic order relationship are selected to be subjected to a third ranking, forming a third ranking document.
FIG. 6 schematically illustrates a flowchart of a method of forming a third classified document according to an embodiment of the invention.
As shown in fig. 6, the method of forming the third classified document of the embodiment may include operations S610 to S630, and the operations S610 to S630 may at least partially perform the above-described operation S250.
In operation S610, performing context semantic link analysis on the images to be classified, which do not have a unified ordering format and do not include chapter typesetting, to obtain semantic links and semantic ordering relationships of different images to be classified.
In an embodiment of the invention, the second natural language analysis may include contextual semantic engagement analysis. Specifically, the text content extracted from the image to be classified can be subjected to natural language processing, including word segmentation, part-of-speech tagging, syntactic analysis and the like, so as to understand the basic grammar and structure of the text; and analyzing context semantic relationships in the text, including logical relationships between paragraphs, correlations between keywords, and the like, so as to obtain semantic relationships between text contents, including common topics, keyword repetition, context logical relationships, and the like, and semantic ordering relationships, including logical order, paragraph ordering, sentence ordering, and the like.
In operation S620, based on the semantic relation, the images to be classified that do not have a unified ordering format and do not include chapter typesetting are filtered, and documents with semantic relation are obtained.
In operation S630, the documents with the semantic relationships are subjected to a third classification ranking according to the semantic ranking relationships, so as to form a third classification document.
Referring back to fig. 2, the images to be classified other than the first, second, and third classified documents are classified as fourth classified documents in operation S260.
In addition, in order to further improve the efficiency of classifying the documents, so that the documents can be directly identified and classified by using implicit marks when the documents are classified by the system next time, the embodiment of the invention also provides a method for marking the classified documents.
FIG. 7 schematically illustrates a flow chart of a method of marking a categorized document according to an embodiment of the invention.
As shown in fig. 7, the method of marking a categorized document of this embodiment may include operation S710.
In operation S710, a second implicit mark is performed on each image of each of the first, second, third, and fourth classified documents, the content of the second implicit mark including the classification corresponding to the image and the ranking in the classification.
In embodiments of the present invention, for each document image, the system may make a second implicit flag indicating the specific categorization to which the image belongs, i.e., a first categorizing document, a second categorizing document, a third categorizing document, or a fourth categorizing document, which flag aids the system in correctly placing each image in its corresponding document category to ensure that the documents are organized and managed by categorization; at the same time, the second implicit mark can also represent the sorting position of the images in the category to which the images belong, which can be page numbers, sequence numbers or other sorting marks, so as to help ensure that the images in the document are arranged in the correct sequence, thereby allowing the system to more easily manage the position and sorting of the images when the document needs to be rearranged, updated or edited.
Referring back to fig. 2, in operation S270, the paper documents corresponding to the first, second, third, and fourth classified documents are identified and sorted, and the sorted paper documents are output.
According to the embodiment of the present invention, the output of the final paper document can be performed by the classification apparatus capable of applying the document classification method provided by the embodiment of the present invention. Specifically, the device may set a single output port component to output the classified first classified document, second classified document, third classified document or fourth classified document, respectively, or may set a plurality of output ports to output the first classified document, the second classified document, the third classified document or the fourth classified document simultaneously.
It should be noted that the classification device listed here is only exemplary, and is not intended to limit the device structure capable of implementing the document classification method of the present invention, i.e., the device implementing the document classification method of the present invention may also include other output structures.
FIG. 8 schematically illustrates a flow chart of a method of identifying and collating corresponding paper documents in accordance with an embodiment of the present invention.
As shown in fig. 8, the method of recognizing and sorting corresponding paper documents of this embodiment may include operations S810 to S820, and the operations S810 to S820 may perform at least partially the operation S270.
In operation S810, a corresponding categorization and ranking of the paper documents is identified according to the second implicit mark.
In embodiments of the present invention, by reading the tag information, the system can accurately place each document image in its corresponding category and arrange it in the correct order of ordering.
In operation S820, stapling is performed according to a preset stapling requirement.
In the embodiment of the invention, the preset binding requirement can comprise integral binding, namely, related document images can be combined into an integral document, so that the storage, the transmission and the retrieval are convenient; it is also possible to bind separately, i.e. to bind only the documents that need to be bound.
In addition, the comparison and recognition of the similar documents and the identification of the manually written signature and the seal information can be performed based on the image characteristic values, so that the document safety is enhanced, and the credibility of the document management is improved.
Fig. 9 schematically shows a flowchart of a method of suspected counterfeit document markers, according to an embodiment of the invention.
As shown in fig. 9, the method of suspected counterfeit document marking of the embodiment may include operations S910 to S920.
In operation S910, based on the image feature values, image feature values of similar documents in the first classified document, the second classified document and the third classified document are compared, and the manually written signature handwriting and seal information are identified, so as to obtain a comparison result.
In embodiments of the invention, the system may use image feature values for comparison for documents having the same classification. These feature values may be visual features, color distribution, shape, etc. of the image, as well as OCR recognition results of the text image, with the aim of finding similarities to determine if the same manually written signature or stamp information is present, which helps to identify copies or versions of the same document.
In operation S920, a mark of the suspected counterfeit document is performed based on the comparison result.
In embodiments of the present invention, once similarities are found, the system further analyzes the image to identify the manually written signature writing and stamp information, which may include identifying signatures, handwritten text, stamp patterns, etc., which may be used to verify the authenticity and legitimacy of the document.
In addition, before character recognition is performed on the images to be classified, and character content information of the images to be classified is extracted, preliminary classification can be performed on the basis of the outline size and the color information of each image to be classified.
FIG. 10 schematically shows a flow chart of a method of preliminary categorization according to an embodiment of the invention.
As shown in fig. 10, the method of preliminary classification of this embodiment may include operation S1010 and operation S1020.
In operation S1010, preliminary classification is performed based on the outline size and color information of each of the images to be classified.
In the embodiment of the invention, the outline dimension and the color information of each image to be classified provide some basic characteristics, and particularly, the outline dimension and the color information of the entity bill and the paper document have large difference and can be used for preliminary classification. For example, a large image may belong to a paper document; the color information may also provide information related to the kind of document.
In operation S1020, a first implicit marking is performed on each of the images to be classified.
In embodiments of the present invention, the first implicit tag may be represented as a preliminary categorized tag or symbol. For example, if the system determines that the image paper document is initially categorized, a label "paper document" may be added.
According to the document classification method provided by the invention, the document classification can be automatically processed through the application of the digital scanning, the natural language analysis technology and the image feature analysis technology, and meanwhile, the identification and classification of the document content are more reliable, and the risk of error classification is reduced, so that the burden of manual processing can be reduced, and the user experience is improved. Specifically, the following beneficial effects are brought:
1. the defect of the traditional method for sorting messy paper notes and files by manpower is overcome, the manual sorting, recognition and binding of the files in the business handling process by a banking manager are greatly simplified, the files can be quickly sorted under the condition that burst files are scattered, the manpower is saved, and the working efficiency is improved;
2. in the process of arranging the file sequence, formatting, classifying and comparing are adopted, and a contextual natural language analysis technology is carried out on character information obtained through OCR character recognition, so that the sequence recognition is more accurate;
3. The method is beneficial to processing large-scale document data, and can cope with the situation that a large number of documents need to be managed;
4. the adopted image feature collection and recognition technology can effectively extract image features on paper files, can extract and compare identity verification information such as manual signatures, seal stamps and the like in the files, ensures that counterfeit files are found through an intelligent means, and avoids errors caused by manual processing;
5. all the first-pass identification documents are marked with implicit marks, and the next time the system arrangement is needed, the direct classification and sorting arrangement can be conveniently carried out through the identification of the implicit marks.
Based on the document classification method, the invention also provides a document classification device. The device will be described in detail below with reference to fig. 11.
Fig. 11 schematically shows a block diagram of a document classification apparatus according to an embodiment of the present invention.
As shown in fig. 11, the document classification apparatus 1100 according to this embodiment includes an image to be classified acquisition module 1110, a text content information and text position information acquisition module 1120, a first classified document forming module 1130, a second classified document forming module 1140, a third classified document forming module 1150, a fourth classified document forming module 1160, and a classification module 1170.
The image to be classified acquiring module 1110 may be configured to acquire a paper document to be classified, and digitally scan the paper document to acquire an image to be classified. In an embodiment, the image obtaining module 1110 to be classified may be used to perform the operation S210 described above, which is not described herein.
The text content information and text position information obtaining module 1120 may be configured to perform text recognition and image analysis on the image to be classified, to obtain text content information and text position information of the image to be classified. In an embodiment, the text content information and text position information obtaining module 1120 may be configured to perform the operation S220 described above, which is not described herein.
The first classified document forming module 1130 may be configured to select images to be classified having a unified ranking format for performing a first classified ranking based on the text position information, to form a first classified document. In an embodiment, the first categorized document forming module 1130 may be configured to perform the operation S230 described above, which is not described herein.
The second categorizing document forming module 1140 may be configured to perform a first natural language analysis on the images to be categorized that do not have a unified ranking format based on the text content information, and select the images to be categorized that include chapter typesetting to perform a second categorizing ranking, so as to form a second categorizing document. In an embodiment, the second categorizing document forming module 1140 can be used to perform the operation S240 described above, which is not described herein.
The third classification document forming module 1150 may be configured to perform a second natural language analysis on the images to be classified that do not have a unified ordering format and do not include chapter layout, and select the images to be classified that have a context semantic order relationship to perform a third classification ordering, so as to form a third classification document. In an embodiment, the third classification document formation module 1150 may be configured to perform the operation S250 described above, which is not described herein.
The fourth categorized document formation module 1160 may be configured to categorize images to be categorized other than the first categorized document, the second categorized document, and the third categorized document into a fourth categorized document. In an embodiment, the fourth categorized document forming module 1160 may be configured to perform the operation S260 described above, which is not described herein.
The classification module 1170 may be configured to identify and sort paper documents corresponding to the first classified document, the second classified document, the third classified document, and the fourth classified document, and output the sorted paper documents. In an embodiment, the classification module 1170 may be configured to perform the operation S270 described above, which is not described herein.
According to an embodiment of the present invention, the image obtaining module 1110 to be classified may include an image information obtaining unit and a feature value extracting unit.
The image information acquisition unit may be configured to acquire the image information based on the digitized scan, wherein the image information includes external dimensions and color information. In an embodiment, the image information obtaining unit may be configured to perform the operation S310 described above, which is not described herein.
The feature value extraction unit may be configured to perform feature value extraction on the image information, to obtain an image feature value. In an embodiment, the feature value extracting unit may be configured to perform the operation S320 described above, which is not described herein.
According to an embodiment of the present invention, the text content information and text position information obtaining module 1120 may include an image segmentation and region detection unit, a target region feature extraction unit, and a text position information extraction unit.
The image segmentation and region detection unit can be used for carrying out image segmentation and region detection on the image to be classified, and acquiring target regions of the top, the bottom and/or the page edge of the image to be classified. In an embodiment, the image segmentation and region detection unit may be configured to perform the operation S410 described above, which is not described herein.
The target region feature extraction unit may be configured to extract features of a target region, where the features of the target region include a region height feature, a boundary feature, a region text style feature, and a region relative position feature. In an embodiment, the target area feature extraction unit may be configured to perform the operation S420 described above, which is not described herein.
The text position information extraction unit may be configured to obtain text position information based on the characteristics of the target area, where the text position information is used to determine whether there is a similar text identifier of a header and/or a footer and page number information. In an embodiment, the text position information extracting unit may be configured to perform the operation S430 described above, which is not described herein.
According to an embodiment of the present invention, the second categorized document forming module 1140 may include a chapter typesetting information obtaining unit, a standard chapter typesetting format document obtaining unit, a chapter division result obtaining unit, and a second categorized document forming unit.
The chapter typesetting information obtaining unit may be configured to perform text style analysis, identifier detection, and keyword detection on the images to be classified that do not have a unified ordering format, to obtain chapter typesetting information, where the chapter typesetting information includes text styles, identifiers, and keywords of chapter typesetting. In an embodiment, the chapter type setting information obtaining unit may be configured to perform the operation S510 described above, which is not described herein.
The standard chapter typesetting format document obtaining unit may be configured to screen the images to be classified that do not have a unified ordering format based on the chapter typesetting information, to obtain a document having a standard chapter typesetting format. In an embodiment, the canonical chapter typesetting format document obtaining unit may be configured to perform the operation S520 described above, which is not described herein.
The chapter division result obtaining unit may be configured to perform chapter division based on the chapter layout information to obtain a chapter division result. In an embodiment, the chapter result obtaining unit may be configured to perform the operation S530 described above, which is not described herein.
The second categorizing document forming unit may be configured to perform a second categorizing and sorting on the documents with the standard chapter type setting format according to the chapter dividing result, to form a second categorizing document. In an embodiment, the second categorized document forming unit may be configured to perform the operation S540 described above, which is not described herein.
According to an embodiment of the present invention, the third classification document forming module 1150 may include a context semantic analysis unit, a semantic relationship document obtaining unit, and a third classification document forming unit.
The context semantic analysis unit can be used for carrying out context semantic link analysis on the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting, and acquiring semantic links and semantic ordering relations of different images to be classified. In an embodiment, the contextual semantic analysis unit may be configured to perform the operation S610 described above, which is not described herein.
The semantic relation document acquisition unit may be configured to screen the images to be classified that do not have a unified ordering format and do not include chapter typesetting based on the semantic relation, to acquire documents having semantic relation. In an embodiment, the semantic relationship document obtaining unit may be configured to perform the operation S620 described above, which is not described herein.
The third classification document forming unit may be configured to perform third classification sorting on the documents with the semantic relationships according to the semantic sorting relationships to form third classification documents. In an embodiment, the third classification document forming unit may be configured to perform the operation S630 described above, which is not described herein.
According to an embodiment of the present invention, the classification module 1170 may include an identification unit and a binding unit.
The identification unit may be configured to identify a corresponding categorization and ranking of the paper documents according to the second implicit indicia. In an embodiment, the identifying unit may be configured to perform the operation S810 described above, which is not described herein.
The sorting unit may be used for stapling according to preset stapling requirements. In an embodiment, the sorting unit may be configured to perform the operation S820 described above, which is not described herein.
According to an embodiment of the present invention, the document classification apparatus 1100 may further include a second implicit marking module, where the second implicit marking module may be configured to perform a second implicit marking on each image of each of the first classified document, the second classified document, the third classified document, and the fourth classified document, and content of the second implicit marking includes a classification corresponding to the image and a ranking in the classification. In an embodiment, the second implicit marking module may also be used to perform the operation S710 described above, which is not described herein.
According to an embodiment of the present invention, the document classification apparatus 1100 may further include a suspected counterfeit file marking module.
According to an embodiment of the present invention, the suspected counterfeit document marking module may include an alignment unit and a marking unit.
The comparison unit can be used for comparing the image characteristic values of the similar documents in the first classification document, the second classification document and the third classification document based on the image characteristic values, identifying the manually written signature handwriting and seal stamping information and obtaining a comparison result. In an embodiment, the comparing unit may be configured to perform the operation S910 described above, which is not described herein.
The marking unit may be configured to mark a suspected counterfeit document based on the comparison result. In an embodiment, the marking unit may be used to perform the operation S920 described above, which is not described herein.
According to an embodiment of the present invention, the document classification apparatus 1100 may further include a preliminary processing module, which may include a preliminary classification unit and a first implicit marking unit.
The preliminary classification unit may be configured to perform preliminary classification based on the outline size and color information of each of the images to be classified. In an embodiment, the preliminary classifying unit may be configured to perform the operation S1010 described above, which is not described herein.
The first implicit marking unit may be configured to perform a first implicit marking on each of the images to be classified. In an embodiment, the first implicit flag unit may be used to perform the operation S1020 described above, which is not described herein.
According to an embodiment of the present invention, any of the image acquisition module 1110 to be classified, the text content information and text position information acquisition module 1120, the first classified document forming module 1130, the second classified document forming module 1140, the third classified document forming module 1150, the fourth classified document forming module 1160, and the classification module 1170 may be combined in one module to be implemented, or any one of the modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the image acquisition module 1110 to be categorized, the text content information and text location information acquisition module 1120, the first categorized document formation module 1130, the second categorized document formation module 1140, the third categorized document formation module 1150, the fourth categorized document formation module 1160, and the categorization module 1170 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or any other reasonable manner of integrating or packaging the circuitry, or in hardware or firmware, or in any one of or a suitable combination of any of the three implementations of software, hardware, and firmware. Alternatively, at least one of the image acquisition module 1110 to be classified, the text content information and text position information acquisition module 1120, the first classified document formation module 1130, the second classified document formation module 1140, the third classified document formation module 1150, the fourth classified document formation module 1160, and the classification module 1170 may be at least partially implemented as a computer program module, which when executed, may perform the corresponding functions.
Fig. 12 schematically shows a block diagram of an electronic device adapted for a document classification method according to an embodiment of the invention.
As shown in fig. 12, the electronic apparatus 1200 according to the embodiment of the present invention includes a processor 1201 which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. The processor 1201 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 1201 may also include on-board memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing the different actions of the method flow according to embodiments of the invention.
In the RAM 1203, various programs and data required for the operation of the electronic apparatus 1200 are stored. The processor 1201, the ROM 1202, and the RAM 1203 are connected to each other through a bus 1204. The processor 1201 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 1202 and/or the RAM 1203. Note that the program may be stored in one or more memories other than the ROM 1202 and the RAM 1203. The processor 1201 may also perform various operations of the method flow according to embodiments of the present invention by executing programs stored in the one or more memories.
According to an embodiment of the invention, the electronic device 1200 may also include an input/output (I/O) interface 1205, the input/output (I/O) interface 1205 also being connected to the bus 1204. The electronic device 1200 may also include one or more of the following components connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1210 so that a computer program read out therefrom is installed into the storage section 1208 as needed.
The present invention also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present invention.
According to embodiments of the present invention, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the invention, the computer-readable storage medium may include ROM 1202 and/or RAM 1203 and/or one or more memories other than ROM 1202 and RAM 1203 described above.
Embodiments of the present invention also include a computer program product comprising a computer program containing program code for performing the method shown in the flowcharts. The program code means for causing a computer system to carry out the methods provided by embodiments of the present invention when the computer program product is run on the computer system.
The above-described functions defined in the system/apparatus of the embodiment of the present invention are performed when the computer program is executed by the processor 1201. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the invention.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program can also be transmitted, distributed over a network medium in the form of signals, and downloaded and installed via a communication portion 1209, and/or from a removable medium 1211. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. The above-described functions defined in the system of the embodiment of the present invention are performed when the computer program is executed by the processor 1201. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the invention.
According to embodiments of the present invention, program code for carrying out computer programs provided by embodiments of the present invention may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or in assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The embodiments of the present invention are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the invention, and such alternatives and modifications are intended to fall within the scope of the invention.

Claims (13)

1. A method of classifying documents, the method comprising:
acquiring a paper document to be classified, digitally scanning the paper document, and acquiring an image to be classified;
performing character recognition and image analysis on the image to be classified to acquire character content information and character position information of the image to be classified;
selecting images to be classified with a unified ordering format based on the text position information to perform first classification ordering to form a first classification document;
based on the text content information, carrying out first natural language analysis on the images to be classified which do not have the unified ordering format, selecting the images to be classified containing chapter typesetting information, and carrying out second classification ordering to form a second classification document;
Carrying out second natural language analysis on the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting information, selecting the images to be classified which have a context semantic sequence relationship, and carrying out third classification ordering to form a third classification document;
classifying images to be classified except the first classified document, the second classified document and the third classified document into a fourth classified document; and
and identifying and sorting the paper documents corresponding to the first classified documents, the second classified documents, the third classified documents and the fourth classified documents, and outputting the classified paper documents.
2. The method according to claim 1, wherein the image analysis is performed on the image to be classified to obtain text position information, and the method specifically comprises:
performing image segmentation and region detection on the image to be classified to obtain target regions at the top, bottom and/or page edge of the image to be classified;
extracting characteristics of a target region, wherein the characteristics of the target region comprise region height characteristics, boundary line characteristics, region text style characteristics and region relative position characteristics; and
based on the characteristics of the target area, character position information is acquired, and the character position information is used for judging whether similar character identifiers of headers and/or footers and page number information exist or not.
3. The method according to claim 1 or 2, wherein the performing a first natural language analysis on the images to be classified that do not have a unified ranking format based on the text content information, selecting the images to be classified that include chapter layout information, and performing a second classification ranking to form a second classification document, specifically includes:
performing text style analysis, identifier detection and keyword detection on the images to be classified, which do not have the uniform ordering format, and obtaining chapter typesetting information, wherein the chapter typesetting information comprises text styles, identifiers and keywords of chapter typesetting;
screening the images to be classified which do not have the unified ordering format based on the chapter typesetting information to obtain a document with a standard chapter typesetting format;
performing chapter division based on the chapter typesetting information to obtain chapter division results; and
and carrying out second classification and sorting on the documents with the standard chapter typesetting format according to the chapter dividing result to form second classification documents.
4. The method according to claim 3, wherein the performing a second natural language analysis on the images to be classified that do not have a unified ordering format and do not include chapter typesetting, selecting the images to be classified that have a context semantic order relationship, and performing a third classification ordering to form a third classification document, specifically includes:
Carrying out context semantic engagement analysis on the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting, and obtaining semantic links and semantic ordering relations of different images to be classified;
screening the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting based on the semantic relation to obtain documents with semantic relation; and
and carrying out third classification ordering on the documents with the semantic relation according to the semantic ordering relation to form a third classification document.
5. The method according to claim 1, 2 or 4, wherein the image to be classified includes image information and image feature values, the acquiring the paper document to be classified, digitally scanning the paper document, and acquiring the image to be classified, specifically includes:
obtaining the image information based on the digital scanning, wherein the image information comprises outline dimension and color information; and
and extracting the characteristic value of the image information to obtain the image characteristic value.
6. The method of claim 5, wherein the method further comprises:
based on the image characteristic values, comparing the image characteristic values of the similar documents in the first classified document, the second classified document and the third classified document, identifying the manually written signature handwriting and seal stamping information, and obtaining a comparison result; and
And marking suspected counterfeit files based on the comparison result.
7. The method of claim 1, wherein prior to said identifying and sorting the paper documents corresponding to the first categorized document, the second categorized document, the third categorized document, and the fourth categorized document, the method further comprises:
and carrying out second implicit marking on each image of each of the first classified document, the second classified document, the third classified document and the fourth classified document, wherein the content of the second implicit marking comprises the classification corresponding to the image and the sorting in the classification.
8. The method according to claim 7, wherein the identifying and sorting the paper documents corresponding to the first classified document, the second classified document, the third classified document and the fourth classified document specifically comprises:
identifying a corresponding categorization and ranking of the paper documents according to the second implicit indicia; and
and binding according to preset binding requirements.
9. The method of claim 5, wherein the paper document comprises a ticket and a paper document, and wherein prior to the text recognition of the image to be classified, extracting text content information of the image to be classified, the method further comprises:
Performing preliminary classification based on the outline size and the color information of each image to be classified; and
and carrying out first implicit marking on each image to be classified.
10. A document classification apparatus, the apparatus comprising:
the image acquisition module to be classified is used for: acquiring a paper document to be classified, digitally scanning the paper document, and acquiring an image to be classified;
the text content information and text position information acquisition module is used for: performing character recognition and image analysis on the image to be classified to acquire character content information and character position information of the image to be classified;
a first categorized document formation module for: selecting images to be classified with a unified ordering format based on the text position information to perform first classification ordering to form a first classification document;
a second categorization document forming module for: based on the text content information, carrying out first natural language analysis on the images to be classified which do not have the unified ordering format, selecting the images to be classified which contain chapter typesetting, and carrying out second classification ordering to form second classification documents;
a third classification document formation module for: carrying out second natural language analysis on the images to be classified which do not have a uniform ordering format and do not contain chapter typesetting, selecting the images to be classified which have a context semantic order relation, and carrying out third classification ordering to form a third classification document;
A fourth categorization document forming module for: classifying images to be classified except the first classified document, the second classified document and the third classified document into a fourth classified document; and
a classification module for: and identifying and sorting the paper documents corresponding to the first classified documents, the second classified documents, the third classified documents and the fourth classified documents, and outputting the classified paper documents.
11. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-9.
12. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1 to 9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 9.
CN202311604890.2A 2023-11-28 2023-11-28 Document classification method, device, electronic equipment and medium Pending CN117612182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311604890.2A CN117612182A (en) 2023-11-28 2023-11-28 Document classification method, device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311604890.2A CN117612182A (en) 2023-11-28 2023-11-28 Document classification method, device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN117612182A true CN117612182A (en) 2024-02-27

Family

ID=89959299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311604890.2A Pending CN117612182A (en) 2023-11-28 2023-11-28 Document classification method, device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN117612182A (en)

Similar Documents

Publication Publication Date Title
US20210124919A1 (en) System and Methods for Authentication of Documents
US11195006B2 (en) Multi-modal document feature extraction
US9552516B2 (en) Document information extraction using geometric models
US11810070B2 (en) Classifying digital documents in multi-document transactions based on embedded dates
US9626555B2 (en) Content-based document image classification
US8064703B2 (en) Property record document data validation systems and methods
Clausner et al. The ENP image and ground truth dataset of historical newspapers
Cruz et al. Local binary patterns for document forgery detection
US20170287252A1 (en) Counterfeit Document Detection System and Method
US20070217692A1 (en) Property record document data verification systems and methods
US20040015775A1 (en) Systems and methods for improved accuracy of extracted digital content
KR20060044691A (en) Method and apparatus for populating electronic forms from scanned documents
CN112508011A (en) OCR (optical character recognition) method and device based on neural network
CN111914729A (en) Voucher association method and device, computer equipment and storage medium
Sirajudeen et al. Forgery document detection in information management system using cognitive techniques
CN114511866A (en) Data auditing method, device, system, processor and machine-readable storage medium
CN112508000B (en) Method and equipment for generating OCR image recognition model training data
CN112487982A (en) Merchant information auditing method, system and storage medium
CN111462388A (en) Bill inspection method and device, terminal equipment and storage medium
KR20180126352A (en) Recognition device based deep learning for extracting text from images
US20070217691A1 (en) Property record document title determination systems and methods
CN117612182A (en) Document classification method, device, electronic equipment and medium
Kumar et al. Line based robust script identification for indianlanguages
US20220044048A1 (en) System and method to recognise characters from an image
CN114443834A (en) Method and device for extracting license information and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination