CN116503889B

CN116503889B - File and electronic file processing method, device, equipment and storage medium

Info

Publication number: CN116503889B
Application number: CN202310066156.9A
Authority: CN
Inventors: 梅洪; 潘珮源; 胡晨; 陈金鹏
Original assignee: Suzhou Industrial Park Hangxing Information Technology Service Co ltd
Current assignee: Suzhou Industrial Park Hangxing Information Technology Service Co ltd
Priority date: 2023-01-18
Filing date: 2023-01-18
Publication date: 2024-01-19
Anticipated expiration: 2043-01-18
Also published as: CN116503889A

Abstract

The application discloses a method, a device, equipment and a storage medium for processing archives and electronic files. The method comprises the following steps: acquiring a file to be processed, wherein the file to be processed is a paper file and is obtained after electronic processing; processing the file to be processed by using an identification model to obtain first data; processing the first data by using an analysis model to obtain second data; the second data is determined by the structural information of the file to be processed after being arranged and combined and/or information is supplemented; the second data is appointed as the characteristic information of the paper file to execute the information input operation, the information extraction and input can be efficiently and correctly carried out on the paper file, the information can be converted into usable data and then digitally stored, the processing input efficiency of the file is greatly improved, and meanwhile, the processing cost is reduced.

Description

File and electronic file processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a storage medium for processing files and electronic documents.

Background

With the rapid development of digital technology, it has become a major trend to replace traditional papered archive/file storage with electronic archive/file archiving. The existing electronic files or the electronic files can be recorded by scanning the paper files and then identifying the paper files so as to acquire file information and recording the file information. However, these pieces of information cannot be used directly and can be stored only by themselves. Additional calibration and analysis procedures are required for further use. This results in a lower utilization efficiency of the electronic file. Moreover, different types of files require specific adaptations, which further reduces the processing efficiency of the electronic files.

Disclosure of Invention

The technical problem to be solved by the embodiment of the application is how to accurately and efficiently carry out digital processing and information input on the paper file.

In order to solve the above problems, the present application discloses a method, a device, equipment and a storage medium for processing archives and electronic documents.

According to a first aspect of the present application, a archive processing method is provided. The method comprises the following steps: acquiring a file to be processed, wherein the file to be processed is a paper file, and the electronic file is obtained or an original electronic file after electronic processing; processing the file to be processed by using an identification model to obtain first data; processing the first data by using an analysis model to obtain second data; the second data is determined by the structural information of the file to be processed after being arranged and combined and/or information is supplemented; and designating the second data as characteristic information of the paper archive or the original electronic file to execute information input operation.

According to some embodiments of the present application, the electronization process includes: digitally scanning the paper file to obtain a scanned image; performing an image preprocessing operation on the scanned image, comprising: one or more of graying, binarizing, decontaminating, tilt detecting and correcting.

According to some embodiments of the present application, the recognition model includes a machine learning model based on a word recognition algorithm, and the acquiring the first data includes: performing character recognition on the file to be processed by using the recognition model to acquire character information carried by the file to be processed; and designating the text information as the first data.

According to some embodiments of the application, the analytical model comprises a machine learning model based on a keyword extraction algorithm; the acquiring the second data includes: extracting keywords from the first data by using the analysis model to obtain structural information included in the first data; the second data is determined based on the structured information.

According to some embodiments of the present application, the file to be processed at least includes a document file and a city building file, which are obtained after electronic processing, and the original electronic file includes a text electronic document; the analysis models comprise a first analysis model corresponding to the document file and a second analysis model corresponding to the urban construction file; the processing the first data using an analytical model includes: determining the file type of the processing file; keyword extraction of the first data using the first analysis model or the second analysis model is determined based on the profile type.

According to some embodiments of the present application, when the file to be processed is obtained based on a document file, the structured information includes at least a person, a time, a place, an event, and the like; the determining the second data based on the structured information includes: constructing independent sentences for summarizing the agency archives based on the structured information, wherein the independent sentences comprise the steps of sequentially combining the structured information and adding connective words into adjacent structured information; designating the independent sentence as the second data.

According to some embodiments of the present application, when the file to be processed is obtained based on a civil engineering archive, the structured information includes all information carried on the civil engineering archive; the determining the second data based on the structured information includes: constructing a data table; and filling the data table based on the structured information to acquire the second data, wherein the step of sequentially filling the data table based on the ordering order of the structured information on the urban construction archive.

According to some embodiments of the present application, the method further comprises: updating the identification model based on the first data; updating the analytical model based on the second data.

According to a second aspect of the present application, a archive processing device is provided. The device comprises: the acquisition module is configured to acquire a file to be processed, wherein the file to be processed is a paper file, and the paper file is obtained or an original electronic file after electronic processing; the first processing module is configured to process the file to be processed by using the identification model so as to acquire first data; a second processing module configured to process the first data using an analytical model to obtain second data; the second data is determined by the structural information of the file to be processed after being arranged and combined and/or information is supplemented; and the execution module is configured to designate the second data as characteristic information of the paper file to execute information input operation.

According to some embodiments of the application, to perform the electronic processing, the obtaining module is configured to: digitally scanning the paper file to obtain a scanned image; performing an image preprocessing operation on the scanned image, comprising: one or more of graying, binarizing, decontaminating, tilt detecting and correcting.

According to some embodiments of the application, the recognition model includes a machine learning model based on a word recognition algorithm, and to obtain the first data, the first processing module is configured to: performing character recognition on the file to be processed by using the recognition model to acquire character information carried by the file to be processed; and designating the text information as the first data.

According to some embodiments of the application, the analytical model comprises a machine learning model based on a keyword extraction algorithm; to obtain the second data, the second processing module is configured to: extracting keywords from the first data by using the analysis model to obtain structural information included in the first data; the second data is determined based on the structured information.

According to some embodiments of the present application, the file to be processed at least includes a document file and a city building file, which are obtained after electronic processing, and the original electronic file includes a text electronic document; the analysis model comprises a first analysis model corresponding to a document file or a text electronic document and a second analysis model corresponding to a city building file; to process the first data using an analytical model, the second processing module is configured to: determining the file type of the processing file; keyword extraction of the first data using the first analysis model or the second analysis model is determined based on the profile type.

According to some embodiments of the present application, when the file to be processed is obtained based on a document file, the structured information includes at least a person, a time, a place, an event, and the like; to determine the second data based on the structured information, the second processing module is configured to: constructing independent sentences for summarizing the agency archives based on the structured information, wherein the independent sentences comprise the steps of sequentially combining the structured information and adding connective words into adjacent structured information; designating the independent sentence as the second data.

According to some embodiments of the present application, when the file to be processed is obtained based on a civil engineering archive, the structured information includes all information carried on the civil engineering archive; to determine the second data based on the structured information, the second processing module is configured to: constructing a data table; and filling the data table based on the structured information to acquire the second data, wherein the step of sequentially filling the data table based on the ordering order of the structured information on the urban construction archive.

According to some embodiments of the present application, the apparatus further comprises an update module configured to: updating the identification model based on the first data; updating the analytical model based on the second data.

According to a third aspect of the present application, an apparatus is provided. The apparatus comprises a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method as described above.

According to a fourth aspect of the present application, a computer readable storage medium is provided. The storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.

According to the archive processing method disclosed by the application, information extraction and input can be efficiently and correctly carried out on the archive, the archive is stored after being converted into usable data, the processing input efficiency of the archive is greatly improved, and meanwhile, the processing cost is reduced.

Drawings

The present application will be further illustrated by way of example embodiments, which will be described in detail with reference to the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is an exemplary flow chart of a method of processing an archive or electronic file, shown in accordance with some embodiments of the present application;

FIG. 2 is an exemplary block diagram of a data processing system for archive or electronic file processing, shown in accordance with some embodiments of the present application;

FIG. 3 is an exemplary functional block diagram of a data processing system for archive or electronic file processing, shown in accordance with some embodiments of the present application.

Detailed Description

In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other forms than those described herein and similar modifications can be made by those skilled in the art without departing from the spirit of the application, and therefore the application is not to be limited to the specific embodiments disclosed below.

It will be understood that when an element is referred to as being "mounted" to another element, it can be directly mounted to the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like are used herein for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" and/or "as used herein includes any and all combinations of one or more of the associated listed items.

Some preferred embodiments of the present application are described below with reference to the accompanying drawings. It should be noted that the following description is for illustrative purposes and is not intended to limit the scope of the present application.

FIG. 1 is an exemplary flow chart of a method of processing an archive or electronic file, according to some embodiments of the present application. In some embodiments, processing method 100 may be performed by data processing system 200. For example, the processing method 100 may be stored in a storage device (e.g., a self-contained memory unit or an external memory device of the data processing system 200) in the form of a program or instructions that, when executed, may implement the processing method 100. As shown in fig. 1, the processing method 100 may include the following operations.

Step 110, obtain the file to be processed. In some embodiments, this step may be performed by the acquisition module 210.

In some embodiments, the file to be processed may be obtained after being electronically processed based on a paper file. For example, the acquisition module 210 may digitally scan the paper document to obtain a scanned image. For example, an image acquisition device such as a camera or a scanner is called to scan the paper file, and the scanned image is generated by converting the content contained in the paper file after identifying the content. The format of the scanned image may be DOC, DOCX, XLS, XLSX, DBF, BMP, JPG, TIFF, GIF, PNG, CAD, TXT, PDF, OFD or other file formats, as determined by the file conversion algorithm or software used. In some embodiments, the file to be processed may also be an original electronic file. Such as office automation electronic files. Such as various institutions, organizations, etc., use electronic documents produced by office automation systems in daily offices.

It is appreciated that paper documents are not all of a uniform specification. For example, paper documents may have color and black and white pieces, paper sizes may be different, and there may be spots or partial defects on the paper, etc. Moreover, the scanned image obtained by digitally scanning the paper file can come in and go out from the original document. For example, tilting, warping, and the like occur. Thus, after the scanned image is acquired, the acquisition module 210 may perform an image preprocessing operation on the scanned image to acquire a final file to be processed. In some embodiments, the image preprocessing operations may include one or more of graying, binarizing, decontamination, tilt detection, and correction. When the paper file is a color part, the interference information included in the color pixel summary needs to be filtered out. Accordingly, the scanned image generated by the scanning can be subjected to the gradation processing to convert the pixel points described by the three-dimensional information (for example, RGB) into the pixel points described by the one-dimensional information (for example, gray). In this way, the color file is converted into a gray file, so that interference information is removed, contrast is enhanced, and subsequent processing is facilitated. After the original color file is subjected to gray level processing, binarization processing can be further performed so as to further separate the text from the background. Exemplary binarization methods may include, but are not limited to, global thresholding methods (e.g., fixed thresholding methods, maximum inter-class variance methods, etc.), local thresholding methods (e.g., adaptive thresholding algorithms, niblack algorithms, sauvola algorithms), erosion, dilation, open operations, closed operations, etc., or any combination thereof. In one possible embodiment, the binarization process can be performed on the document after the grayscaled process using the global threshold method, because the electronic process on the paper file is performed on the whole paper. A binary image consisting of black (1, or 0) and white (0, or 255), i.e., a black-and-white file, can be obtained. Therefore, when the paper file is a color piece, the scanned image can be in the same image property as the scanned image obtained when the paper file is a black and white piece after the gray processing and the binarization processing, and the subsequent unified processing is facilitated.

It is known that paper documents, as printed text, are mostly composed of horizontal (or vertical) lines of text (or columns of text) parallel to the page edges. That is, the tilt angle between the text content and the page edge is zero. However, during the scanning process of the paper document, whether by manual scanning or automatic scanning by a machine, the resulting scanned image may be inclined with respect to the text content. And tilting can have a significant impact on subsequent character segmentation, recognition, image compression, and the like. In order to ensure the correctness of the subsequent processing, inclination detection and correction are required to ensure that the text content part on the scanned image (for example, obtained after the scanning of the paper document of the black and white piece) or the image subjected to the gray processing and the binarization processing (for example, obtained after the scanning image obtained after the scanning of the paper document of the color piece) is horizontal and vertical, thereby ensuring the correctness of the subsequent processing. For example, the boundary of the text content may be determined first using an edge detection algorithm, and then the inclination angle may be determined using a method such as Radon transform, hough transform, linear regression method, or the like. Finally, the tilt correction is performed by rotating the device with the radiation change based on the tilt angle.

After the inclination correction is completed, if there is a stain or soil on the paper document, a decontamination treatment is required. For example, the image may be subject to object detection to determine a stain/spot area, and then pixel replacement of the stain/spot area may be performed using pixel values of the background area (e.g., black and white pixels 1 and 0 and 255). After the decontamination treatment is completed, the occurrence of a rare treatment result in the subsequent process can be prevented.

And 120, processing the file to be processed by using the identification model to acquire first data. In some embodiments, this step may be performed by the first processing module 220.

In some embodiments, the recognition model may include a machine learning model based on a word recognition algorithm, including but not limited to CRNN, 2D-CTC, attention, ACE, SVTR, R ² AM, SAR, trOCR, etc. For example, the first processing module 220 may train the initial recognition model based on a number of sample images that have been subjected to the aforementioned image preprocessing and corresponding truth values (i.e., text content). And reversely adjusting parameters of the model according to the difference between the output and the true value of the model. After a preset condition is met (e.g., the training round reaches a predetermined number of times, or the accuracy of the recognition result is above 99.5%), training may be stopped. The final model is the identification model. For another example, the recognition model may be a trained model that is opened by a common model platform for use by a user. The first processing module 220 may schedule the API interfaces of these common model platforms to upload the pending files for subsequent word recognition processing. It is known that the types of files are various, such as document files, personnel files, litigation files, accounting files, judicial files, foreign exchange files, urban construction files, patent files, etc., and the format and content of each file are different. Therefore, for accurately recognizing characters of various files, the recognition model can be formed by combining a plurality of sub-models. A corresponding sub-model can be used for processing each type of archive so as to achieve the purpose of accurate identification.

In some embodiments, the first processing module 220 may perform text recognition on the to-be-processed file by using the recognition model to obtain text information carried by the to-be-processed file. The text information may include all text-related information on the document to be processed. For example, the specific content of the file, the text in an attached signature, the text manually drawn on the file, such as a signature, etc., or any combination thereof. The first processing module 220 may call different sub-models of the recognition model to process different types of profiles to quickly and accurately obtain processing results. The first processing module 220 may designate the text information as the first data.

In some embodiments, the recognition model may be pre-trained and stored in an on-board memory unit or an off-board memory device of data processing system 200. The first processing module 220 may be in data communication with these self-contained memory units or external memory devices to obtain the identification model.

And 130, processing the first data by using an analysis model to acquire second data. In some embodiments, this step may be performed by the second processing module 230.

In some embodiments, the analytical model may include a machine learning model based on keyword extraction algorithms, including unsupervised keyword extraction algorithms (e.g., statistical feature-based keyword extraction such as TF-IDF algorithm, word graph model-based keyword extraction such as TextRank algorithm, topic model-based keyword extraction such as LDA algorithm), supervised keyword extraction algorithms (e.g., word2Vec Word cluster-based keyword extraction algorithm, information gain keyword extraction algorithm, mutual information keyword extraction algorithm, chi-square inspection keyword extraction algorithm, tree model-based keyword extraction algorithm such as decision tree or random forest, etc.). The second processing module 230 may perform keyword extraction on the first data using the analysis model to obtain structural information included in the first data.

In some embodiments, the file to be processed may at least include a document file and a civil file, which are obtained after the electronic processing, and the original electronic file includes a text electronic document. The structured information may also be different when the document to be processed is acquired based on different types of paper archives. Thus, the analysis model may also have different sub-models for processing for different types of documents to be processed (i.e. types of paper documents). For example, the analysis model may include at least a first analysis model corresponding to a paperwork archive or a text electronic document and a second analysis model corresponding to a downtown archive. The second processing module 230 may first determine a profile type of the file to be processed, and then determine to use the first analysis model or the second analysis model to perform keyword extraction on the first data based on the profile type. The determination of the archive type may be based on the type of recognition model that processes the document to be processed. For example, assuming that the recognition model for processing the document to be processed is a sub-model corresponding to processing a document dossier, the dossier type may be a document dossier. Assuming that the identification model for processing the document to be processed is a sub-model corresponding to processing a civil archive, the archive type may be a civil archive. The determination of the profile type may be based on externally entered auxiliary information. For example, before the processing of the file to be processed is started, the profile type of the file to be processed may be simultaneously input by an external operator from the data interface of the data processing system 200, so as to be used at this time to invoke a different analysis model to implement keyword extraction of the first data.

In some embodiments, when the file to be processed is obtained based on a document dossier, the structured information may include a plurality of keywords for indicating primary content of the document dossier. By way of example, people, time, place, event, etc. may be included. The second processing module 230 may extract the keywords in the first data using the first analysis model and construct independent sentences for summarizing the agent profiles based on the keywords. For example, assume that the structured information includes Zhang three, 1 month and 1 day, training institutions, and training for 10 days. The first analysis model may sequentially combine these structured information and add connective words to the adjacent structured information to obtain an independent sentence (i.e., the second data) summarizing the file to be processed. For example, the first analytical model may output training of 1 month, 1 day, three days from the beginning of the training facility for 10 days. The added connective may include start, at, start.

In some embodiments, when the file to be processed is obtained based on a civil engineering archive, the structured information may include all information carried on the civil engineering archive. It can be known that various information such as urban construction units, engineering names, reference numbers, sub item names, sub item numbers, design professions, drawing numbers, drawing names, drawings, dates and the like are displayed in the urban construction archive. Additionally, various signatures may be included on the urban construction archive, as well as manual signatures on the archive, such as the signature of the engineering-related responsible person and the personnel signature of the corresponding portion on the completion signature. This information is essential to the urban set of files. Moreover, the urban construction archive is generally presented in a tabular form. The second processing module 230 may extract various types of information (e.g., header, rank data) in the table using the second analytical model. And populating the information into a constructed data form (e.g., an Excel spreadsheet), and taking the data-populated data form as the second data. The form of the constructed data table may be consistent with the form of the urban set-up archive. The second analytical model may be populated in the constructed data table in turn based on the ordering order of the structured information in the tables of the urban construction archive (e.g., what rows and what columns in the process table). The filled data table may be designated as the second data.

And step 140, designating the second data as characteristic information of the paper file to execute information input operation. In some embodiments, this step may be performed by the execution module 240.

In some embodiments, the characteristic information may refer to information that can be used to indicate the primary content or content summary of the paper archive. For example, when the paper file is a document file, the second data serving as the characteristic information is a keyword/abstract of the document file, which can concisely and clearly indicate the matters mentioned by the document file. When the paper file is a urban construction file, the second data serving as the characteristic information is completely consistent with the information carried by the urban construction file, and the key information is displayed and is not omitted. The execution module 240 can directly record the characteristic information in real time, so as to realize real-time storage of data.

In the application, the information input of the paper files can be automatically processed, the automatic regulation and information induction of the files are realized by using a machine learning model, and the instant storage of the information is realized. The file processing is efficient.

In some embodiments, the first data and the second data may also be used to update the identification model and the analysis model, respectively. For example, the first data and the second data may be used as training samples to update the recognition model and the analysis model. So that the applicability and accuracy of the two models are improved.

It should be noted that the above description of the steps in fig. 1 is only for illustration and description, and does not limit the application scope of the present specification. Various modifications and changes to the steps of fig. 1 may be made by those skilled in the art under the guidance of this specification. However, such modifications and variations are still within the scope of the present description.

FIG. 2 is an exemplary block diagram of a data processing system according to some embodiments of the present description. The data processing system can realize automatic processing and information input of files. As shown in fig. 2, data processing system 200 may include an acquisition module 210, a first processing module 220, a second processing module 230, an execution module 240, and an update module 250.

The acquisition module 210 may acquire the file to be processed. The file to be processed can be obtained after electronic processing based on a paper file. For example, the acquisition module 210 may digitally scan the paper file to obtain a scanned image. For example, an image acquisition device such as a camera or a scanner is called to scan the paper file, and the scanned image is generated by converting the content contained in the paper file after identifying the content. The format of the scanned image may be DOC, DOCX, XLS, XLSX, DBF, BMP, JPG, TIFF, GIF, PNG, CAD, TXT, PDF, OFD or other file formats, as determined by the file conversion algorithm or software used. The acquisition module 210 may perform image preprocessing on the scanned image to acquire the file to be processed. The image preprocessing operations may include one or more of graying, binarizing, decontaminating, tilt detection and correction. When the paper file is a color part, the interference information included in the color pixel summary needs to be filtered out. Accordingly, the acquisition module 210 may perform a graying process on the scan image generated by the scan and perform a binarizing process to further separate text from background. The acquisition module 210 may also perform tilt detection and correction in order to ensure the correctness of the subsequent processing. After the inclination correction is completed, if there is a stain or soil on the paper document, the acquisition module 210 may also perform a decontamination process to prevent the occurrence of a subtle processing result in the subsequent process.

The first processing module 220 may process the file to be processed using the identification model to obtain the first data. The recognition model may include a machine learning model based on a word recognition algorithm, including but not limited to CRNN, 2D-CTC, attention, ACE, SVTR, R ² AM, SAR, trOCR, etc. The first processing module 220 may perform text recognition on the file to be processed by using the recognition model to obtain text information carried by the file to be processed. The text information may include all text-related information on the document to be processed. For example, the specific content of the file, the text in an attached signature, the text manually drawn on the file, such as a signature, etc., or any combination thereof. The first processing module 220 may designate the text information as the first data.

The second processing module 230 may process the first data using an analytical model to obtain second data. The analytical model may include a machine learning model based on keyword extraction algorithms, including unsupervised keyword extraction algorithms or supervised keyword extraction algorithms. The second processing module 230 may perform keyword extraction on the first data using the analysis model to obtain structural information included in the first data. The second processing module 230 may process the first data using different analysis models for different types of documents to be processed (i.e. types of paper documents). The files to be processed at least comprise file files and urban construction files, which are obtained after electronic processing. The analytical model may include at least a first analytical model corresponding to a paperwork archive and a second analytical model corresponding to a downtown archive. The second processing module 230 may first determine a profile type of the file to be processed, and then determine to use the first analysis model or the second analysis model to perform keyword extraction on the first data based on the profile type. When the file to be processed is obtained based on a document file, the structured information may include a plurality of keywords for indicating main contents of the document file. The second processing module 230 may extract the keywords in the first data using the first analysis model and construct independent sentences for summarizing the agent profiles based on the keywords. For example, the second processing module 230 may sequentially combine the structured information and add a connective to the adjacent structured information to obtain an independent sentence summarizing the file to be processed. When the file to be processed is obtained based on a urban construction archive, the structured information may include all information carried on the urban construction archive. The second processing module 230 may extract various types of information (e.g., header, rank data) in the table using the second analytical model. And populating the information into a constructed data form (e.g., an Excel spreadsheet), and taking the data-populated data form as the second data.

The execution module 240 may be configured to designate the second data as characteristic information of the paper document to perform an information entry operation. The characteristic information may refer to information that can be used to indicate the main content or content summary of the paper archive. The execution module 240 can directly record the characteristic information in real time, so as to realize real-time storage of data.

The update module 250 may be used to update the identification model and the analysis model. The update module 240 may re-update the identification model and the analysis model using the first data and the second data as training samples. So that the applicability and accuracy of the two models are improved.

Additional description of the above modules may be found in the flow chart section of the present application, e.g., fig. 1.

It should be understood that the system shown in fig. 2 and its modules may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may then be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system of the present specification and its modules may be implemented not only with hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also with software executed by various types of processors, for example, and with a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above description of the modules is for convenience of description only and is not intended to limit the present description to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the principles of the system, various modules may be combined arbitrarily or a subsystem may be constructed in connection with other modules without departing from such principles. For example, the first processing module 220 and the second processing module 230 may be different processing units under the same processing module, for acquiring the first data and the second data, respectively. For another example, each module may share one memory module, or each module may have a respective memory module. Such variations are within the scope of the present description.

Fig. 3 is an exemplary block diagram of a processing device, shown in accordance with some embodiments of the present application. Processing device 300 may include any of the components used to implement the systems described in embodiments of the present application. For example, processing device 300 may be implemented in hardware, software programs, firmware, or a combination thereof. For example, processing device 300 may implement data processing system 200. For convenience, only one processing device is depicted, but implementing the computing functions described in embodiments of the present application may be implemented in a distributed manner by a set of similar platforms to distribute the processing load of the system.

In some embodiments, processing device 300 may include a processor 310, a memory 320, input/output components 330, and communication ports 340. In some embodiments, processor (e.g., CPU) 310 may execute program instructions in the form of one or more processors. In some embodiments, memory 320 includes different forms of program memory and data memory, such as a hard disk, read-only memory (ROM), random Access Memory (RAM), etc., for storing a wide variety of data files for processing and/or transmission by a computer. In some embodiments, input/output component 330 may be used to support input/output between processing device 900 and other components. In some embodiments, communication port 340 may be connected to a network for enabling data communication. An exemplary processing device may include program instructions stored in read-only memory (ROM), random Access Memory (RAM), and/or other types of non-transitory storage media for execution by processor 310. The methods and/or processes of the embodiments of the present description may be implemented in the form of program instructions. The processing device 300 may also receive the programs and data disclosed in the present application through network communication.

For ease of understanding, only one processor is schematically depicted in fig. 3. However, it should be noted that the processing device 300 in the embodiments of the present specification may include a plurality of processors, and thus the operations and/or methods described in the embodiments of the present specification as being implemented by one processor may also be implemented by a plurality of processors collectively or individually. For example, if in this specification the processor of the processing device 300 performs steps a and B, it should be understood that steps a and B may also be performed jointly or independently by two different processors of the processing device 300 (e.g., a first processor performing step a, a second processor performing step B, or both the first and second processors jointly performing steps a and B).

Having described the basic concepts herein, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.

Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.

Furthermore, those skilled in the art will appreciate that the various aspects of the specification can be illustrated and described in terms of several patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the present description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the specification may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media.

The computer storage medium may contain a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take on a variety of forms, including electro-magnetic, optical, etc., or any suitable combination thereof. A computer storage medium may be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated through any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or a combination of any of the foregoing.

The computer program code necessary for operation of portions of the present description may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, c#, vb net, python and the like, a conventional programming language such as C language, visual Basic, fortran 2003, perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, ruby and Groovy, or other programming languages and the like. The program code may execute entirely on the user's computer or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the use of services such as software as a service (SaaS) in a cloud computing environment.

Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.

In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims

1. A method for processing archives and electronic documents, the method comprising:

acquiring a file to be processed, wherein the file to be processed is obtained by performing electronic processing on a paper file or an original electronic file, and at least comprises a document file obtained by performing electronic processing or a text electronic file;

processing the file to be processed by using an identification model to obtain first data;

processing the first data by using an analysis model to obtain second data; the second data is determined by the structural information of the file to be processed after being arranged and combined and/or information is supplemented, and the structural information at least comprises characters, time, places and events; the determining the second data includes:

Constructing independent sentences for summarizing the files to be processed based on the structured information, wherein the independent sentences comprise the steps of sequentially combining the structured information and adding connective words into adjacent structured information;

designating the independent sentence as the second data;

and designating the second data as characteristic information of the paper archive or the original electronic file to execute information input operation.

2. A method of processing files and electronic documents according to claim 1, wherein the electronic processing comprises:

digitally scanning the paper file to obtain a scanned image;

performing an image preprocessing operation on the scanned image, comprising: one or more of graying, binarizing, decontaminating, tilt detecting and correcting.

3. A method of processing archives and electronic documents as claimed in claim 1, wherein the recognition model includes a machine learning model based on a word recognition algorithm, the obtaining the first data includes:

performing character recognition on the file to be processed by using the recognition model to acquire character information carried by the file to be processed;

and designating the text information as the first data.

4. A method of processing archives and electronic documents as claimed in claim 1, wherein the analysis model includes a machine learning model based on a keyword extraction algorithm; the acquiring the second data includes:

extracting keywords from the first data by using the analysis model to obtain structural information included in the first data;

the second data is determined based on the structured information.

5. The method for processing files and electronic documents according to claim 4, wherein the files to be processed further comprise files obtained after the urban construction are processed electronically; the analysis model comprises a first analysis model corresponding to a document file or a text electronic document and a second analysis model corresponding to a city building file; the processing the first data using an analytical model includes:

determining the file type of the processing file;

keyword extraction of the first data using the first analysis model or the second analysis model is determined based on the profile type.

6. The archive and electronic file processing method according to claim 5, wherein when the file to be processed is obtained based on a civil archive, the structured information includes all information carried on the civil archive; the determining the second data based on the structured information includes:

Constructing a data table;

and filling the data table based on the structured information to acquire the second data, wherein the step of sequentially filling the data table based on the ordering order of the structured information on the urban construction archive.

7. A method of processing archives and electronic documents as claimed in claim 1, further comprising:

updating the identification model based on the first data;

updating the analytical model based on the second data.

8. A device for processing archives and electronic documents, the device comprising:

the file processing module is configured to obtain a file to be processed, wherein the file to be processed is obtained after electronic processing of a paper file or an original electronic file, and at least comprises a document file obtained after electronic processing or a text electronic file;

the first processing module is configured to process the file to be processed by using the identification model so as to acquire first data;

a second processing module configured to process the first data using an analytical model to obtain second data; the second data is determined by the structural information of the file to be processed after being arranged and combined and/or information is supplemented, and the structural information at least comprises characters, time, places and events; to determine the second data, the second processing module is further configured to:

designating the independent sentence as the second data;

and the execution module is configured to designate the second data as characteristic information of the paper archive or the original electronic file to execute information input operation.

9. The archive and electronic file processing device of claim 8, wherein to perform the electronic processing, the acquisition module is configured to:

digitally scanning the paper file to obtain a scanned image;

10. The archive and electronic file processing device of claim 8, wherein the recognition model comprises a machine learning model based on a word recognition algorithm, and wherein to obtain the first data, the first processing module is configured to:

And designating the text information as the first data.

11. A archival and electronic document processing device according to claim 8, wherein the analysis model includes a machine learning model based on a keyword extraction algorithm; to obtain the second data, the second processing module is configured to:

the second data is determined based on the structured information.

12. The archive and electronic file processing device according to claim 11, wherein the file to be processed further comprises a city file obtained after electronic processing; the analysis model comprises a first analysis model corresponding to a document file or a text electronic document and a second analysis model corresponding to a city building file; to process the first data using an analytical model, the second processing module is configured to:

determining the file type of the processing file;

13. A archive and electronic file processing device according to claim 11, wherein when the file to be processed is obtained based on a civil archive, the structured information includes all information carried on the civil archive; to determine the second data based on the structured information, the second processing module is configured to:

Constructing a data table;

14. The archive and electronic file processing device of claim 8, further comprising an update module configured to:

updating the identification model based on the first data;

updating the analytical model based on the second data.

15. An archive and electronic document processing device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method according to any of claims 1-7.

16. A computer readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1-7.