CN113190502A - Archive management method based on deep learning - Google Patents
Archive management method based on deep learning Download PDFInfo
- Publication number
- CN113190502A CN113190502A CN202110489471.3A CN202110489471A CN113190502A CN 113190502 A CN113190502 A CN 113190502A CN 202110489471 A CN202110489471 A CN 202110489471A CN 113190502 A CN113190502 A CN 113190502A
- Authority
- CN
- China
- Prior art keywords
- file
- picture
- archive
- files
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/113—Details of archiving
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/103—Workflow collaboration or project management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Abstract
The invention belongs to the technical field of machine learning, and particularly relates to a file management method based on deep learning.
Description
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a file management method based on deep learning.
Background
At present, a digital platform is not supported in file management, file compliance inspection is still carried out in a manual mode, the number of files is large, a large amount of manpower and time need to be invested, and the standardization and the integrity cannot be well guaranteed. Archival data unstructured data accounts for about 90%, and critical information data and knowledge are usually hidden in these unstructured data.
Deep Learning (DL) is a new research direction in the field of machine Learning, and by Learning the intrinsic rule and the representation hierarchy using unstructured data as sample data, a terminal device has an analysis Learning capability and can recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and the effect achieved in the aspect of image recognition is far better than that achieved in the prior related art. In the prior art, deep learning is widely applied to the fields of search technology, data mining, machine learning, machine translation, natural language processing and the like.
In the aspect of image recognition, the power grid industry archive data is generally subjected to line scanning on characters in an archive through OCR to obtain unstructured data, and distorted text lines, skew, noise and other defects common in scanned images and digital photos in the archive data may reduce recognition quality, so that the recognition error rate is high.
Disclosure of Invention
Based on the problem that no digital platform is supported in the existing file management, the invention aims to explore and obtain a file management method based on deep learning.
The invention discloses a deep learning-based archive management method, which is characterized by comprising the following steps of:
1) collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, and judging whether the configuration file reaches the maximum volume number; and sequentially searching next file directories which do not reach the maximum volume number or automatically creating file directories, continuously finishing filing operation, storing paper file files into a picture format by matching with external equipment comprising a scanner, creating and automatically uploading the directories in a file server according to file related information by the system, automatically judging that a plurality of pages of documents are uploaded if the scanning page is not closed and the scanning is continuously carried out, automatically recording storage paths of the files and corresponding electronic file numbers by the system after the uploading, and associating the electronic files with the pictures to realize the acquisition and the uploading of the electronic files.
2) And (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; the system comprises the steps of verifying an electronic file according to integrity in a compliance rule by using a deep learning model, reading file information by using an identification method, scanning characters in the file by using an OCR (optical character recognition) method to obtain unstructured data, performing operations such as noise reduction, binarization, data cleaning and the like on the data, inputting valuable data into a decision tree to construct a model, modifying parameters after comparing the structure authenticity, adjusting weights to finally obtain a check model, confirming the file information by the check model, automatically sequencing and sorting the file information according to a set assembly template, realizing intelligent and automatic assembly of the file, and realizing one-key assembly into a book according to user requirements;
3) searching and classifying archives: the intelligent one-key volume grouping function realizes structured management on orderly and regular intelligent volumes of the files passing the inspection; by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, a big data intelligent analysis report is generated in real time.
4) And (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; setting a file value destruction standard score, and providing a destruction prompt and suggestion for files with scores lower than the destruction standard value; the big data is utilized to carry out overall analysis and mining of value information, and the incidence relation among projects is displayed in a targeted manner, and the economic benefit and the social benefit brought by the projects are counted and analyzed.
The identification method in the step 2) also comprises a picture identification method based on deep learning; for picture files existing in the electronic files, the Keras deep learning framework and the VGG16 network model are used for realizing content recognition of the picture files in a picture searching mode, and feature extraction is carried out on the content in the picture files.
Further, the graph searching is realized by the following steps:
1) extracting image features of the archive photo: in order to accurately extract the image characteristics of each photo file, the size of each photo file needs to be uniformly preprocessed, Keras is loaded with a pre-trained VGG16 model to extract the characteristic information of each image on a full connection layer, the information of each image is represented as a vector after being extracted, the image characteristics of a plurality of pictures corresponding to each picture are aggregated to form a vector set, and the vector set is integrated with an archive picture image characteristic index database.
2) Extracting image features of the picture to be retrieved: finding out the photo matched with the content of the picture to be retrieved in the picture archive, wherein the key is to zoom and cut the picture by using an Image Data Generator tool for accurately sampling the characteristic information represented by the picture to be retrieved, and extracting an Image characteristic information vector set after processing the picture into a plurality of pictures with the sizes of 224 multiplied by 224.
3) Comparing the image features, and returning a retrieval result: comparing the image feature information vector set of the picture to be retrieved with the information in the image feature index database of the archive photo by using a locality sensitive hashing algorithm, wherein the greater the similarity, the higher the content similarity of the two pictures. Apache Tomcat is deployed as a graph search graph Web search application, the picture to be searched is uploaded and a search result is returned on the front support, a back-end system uses a Keras deep learning framework to perform feature extraction on images in a photo archive one by one to obtain an image feature index library, in order to search out a picture set similar to a query picture, a local Hirsch algorithm is used for constructing an index library for a feature vector set for extracting all archive pictures, and the search speed is accelerated by comparing the feature vector set with the features of the picture to be searched.
The invention carries out intelligent file management based on the deep learning technology, improves the utilization of file resources and the intelligent management level of files, reduces the management cost and ensures the integrity, authenticity, reliability and usability of science and technology files.
Detailed Description
Example 1: the archive management method is implemented as follows.
Collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, judging whether the configuration file reaches the maximum volume number, sequentially searching the next file directory which does not reach the maximum volume number or automatically creating the file directory, continuously finishing the filing operation, storing a paper file into a picture format by matching with external equipment comprising a scanner, firstly carrying out OCR character extraction on project information according to project text data, carrying out regular preprocessing and noise reduction preprocessing on the unstructured data or semi-structured data, then inputting the data into a decision tree, carrying out data analysis according to specific service logic and obtaining an analysis result, and adjusting parameters and weights after manually intervening and comparing the result. Then inputting a large amount of data into an artificial neural network, carrying out data set training, taking a part of data sets as test data sets, and forming a model after judging that the offset of the result is within an error range through testing. The system creates a directory in the file server according to the file related information and automatically uploads the directory, if the scanning page is not closed and the scanning is continued, the system automatically judges that the multi-page document is uploaded, after the uploading, the system automatically records the storage path of the document and the file number of the electronic file corresponding to the storage path, associates the electronic file with the picture, and realizes the acquisition and uploading of the electronic file.
And (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; verifying the integrity of the electronic file according to the compliance rules by using a deep learning model, and reading the file information by a system through an identification method, wherein the identification method comprises but is not limited to an OCR character identification method; for picture files existing in the electronic files, the Keras deep learning framework and the VGG16 network model are used for realizing content recognition of the picture files in a picture searching mode, and feature extraction is carried out on the content in the picture files. According to the archives information of discerning, carry out the automatic sequencing arrangement according to the assembly template that sets up, realize the automatic assembly of archives intelligence, improve archives assembly work efficiency, derive as required and bind, reduce the assembly and bind the cost.
Searching and classifying archives: by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, a big data intelligent analysis report is generated in real time.
And (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; and setting a file value destruction standard score, and providing a destruction reminding and suggestion for files with scores lower than the destruction standard value.
Claims (2)
1. The archive management method based on deep learning is characterized by being realized through the following steps:
(1) collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, and judging whether the configuration file reaches the maximum volume number; searching next file catalog which does not reach the maximum volume number or automatically creating a file catalog according to the sequence, continuously finishing filing operation, storing paper file files into a picture format by matching with external equipment comprising a scanner, creating the catalog in a file server according to file related information by a system and automatically uploading the catalog, automatically judging that a plurality of pages of documents are uploaded if scanning is continuously carried out without closing a scanning page, automatically recording the storage path of the files and the file numbers of the corresponding electronic files by the system after uploading, and associating the electronic files with the pictures to realize acquisition and uploading of the electronic files;
(2) and (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; the system comprises the steps of verifying an electronic file according to integrity in a compliance rule by using a deep learning model, reading file information by using an identification method, scanning characters in the file by using an OCR (optical character recognition) method to obtain unstructured data, performing operations such as noise reduction, binarization, data cleaning and the like on the data, inputting valuable data into a decision tree to construct a model, modifying parameters after comparing the structure authenticity, adjusting weights to finally obtain a check model, confirming the file information by the check model, automatically sequencing and sorting the file information according to a set assembly template, realizing intelligent and automatic assembly of the file, and realizing one-key assembly into a book according to user requirements;
(3) searching and classifying archives: the intelligent one-key volume grouping function realizes structured management on orderly and regular intelligent volumes of the files passing the inspection; by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, generating a big data intelligent analysis report in real time;
(4) and (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; setting a file value destruction standard score, and providing a destruction prompt and suggestion for files with scores lower than the destruction standard value; the great number is utilized to carry out overall analysis and mining of value information, and the incidence relation among projects is displayed in a targeted manner, and the economic benefit and the social benefit brought by the projects are counted and analyzed.
2. The archive management method based on deep learning of claim 1, wherein the recognition method of step 2) further comprises a picture recognition method based on deep learning; for picture files existing in an electronic archive, a Keras deep learning framework and a VGG16 network model are used for realizing content recognition of the picture archive in a picture searching mode, and feature extraction is carried out on contents in the picture archive, wherein the picture searching mode is realized through the following steps:
(1) extracting image features of the archive photo: in order to accurately extract the image characteristics of each photo file, the size of each photo file needs to be uniformly preprocessed, Keras is used for extracting the characteristic information of each image on a full connection layer by loading a pre-trained VGG16 model, the information of each image is represented as a vector after being extracted, the image characteristics of a plurality of pictures corresponding to each picture are aggregated to form a vector set, and the vector set is integrated with an archive picture image characteristic index database;
(2) extracting image features of the picture to be retrieved: finding a photo matched with the content of the picture to be retrieved in the picture archive library, wherein the key is that the picture is zoomed and cut by using an Image Data Generator tool for accurately sampling the characteristic information represented by the picture to be retrieved, and the Image characteristic information vector set is extracted after the picture is processed into a plurality of pictures with the size of 224 multiplied by 224;
(3) comparing the image features, and returning a retrieval result: comparing the image feature information vector set of the picture to be retrieved with the information in the image feature index database of the archive photo by using a locality sensitive hashing algorithm, wherein the greater the similarity, the higher the content similarity of the two pictures.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110102810 | 2021-01-26 | ||
CN2021101028108 | 2021-01-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113190502A true CN113190502A (en) | 2021-07-30 |
Family
ID=76983856
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110489471.3A Pending CN113190502A (en) | 2021-01-26 | 2021-05-06 | Archive management method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113190502A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113792169A (en) * | 2021-09-16 | 2021-12-14 | 烟台市蓬莱区档案馆 | Digital archive management method and system based on big data application |
CN114386078A (en) * | 2022-03-22 | 2022-04-22 | 武汉汇德立科技有限公司 | BIM-based construction project electronic archive management method and device |
CN116401212A (en) * | 2023-06-07 | 2023-07-07 | 东营市第二人民医院 | Personnel file quick searching system based on data analysis |
CN116934285A (en) * | 2023-09-15 | 2023-10-24 | 济南泰格电子技术有限公司 | Visual intelligent system and equipment for realizing digitization and entity file management |
-
2021
- 2021-05-06 CN CN202110489471.3A patent/CN113190502A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113792169A (en) * | 2021-09-16 | 2021-12-14 | 烟台市蓬莱区档案馆 | Digital archive management method and system based on big data application |
CN113792169B (en) * | 2021-09-16 | 2022-05-10 | 烟台市蓬莱区档案馆 | Digital archive management method and system based on big data application |
CN114386078A (en) * | 2022-03-22 | 2022-04-22 | 武汉汇德立科技有限公司 | BIM-based construction project electronic archive management method and device |
CN116401212A (en) * | 2023-06-07 | 2023-07-07 | 东营市第二人民医院 | Personnel file quick searching system based on data analysis |
CN116401212B (en) * | 2023-06-07 | 2023-08-11 | 东营市第二人民医院 | Personnel file quick searching system based on data analysis |
CN116934285A (en) * | 2023-09-15 | 2023-10-24 | 济南泰格电子技术有限公司 | Visual intelligent system and equipment for realizing digitization and entity file management |
CN116934285B (en) * | 2023-09-15 | 2023-12-22 | 济南泰格电子技术有限公司 | Visual intelligent system and equipment for realizing digitization and entity file management |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113190502A (en) | Archive management method based on deep learning | |
US6501855B1 (en) | Manual-search restriction on documents not having an ASCII index | |
CN110188077B (en) | Intelligent classification method and device for electronic files, electronic equipment and storage medium | |
JPH07262224A (en) | Preservation/processing method of document image | |
WO2012156774A1 (en) | Method and apparatus for detecting visual words which are representative of a specific image category | |
CN106815605B (en) | Data classification method and equipment based on machine learning | |
CN111860524A (en) | Intelligent classification device and method for digital files | |
Tian et al. | Image classification based on the combination of text features and visual features | |
Arief et al. | Automated extraction of large scale scanned document images using Google vision OCR in apache Hadoop environment | |
Dixit et al. | A survey on document image analysis and retrieval system | |
CN116663549B (en) | Digitized management method, system and storage medium based on enterprise files | |
Rigaud et al. | Toward speech text recognition for comic books | |
CN112464907A (en) | Document processing system and method | |
CN115329169A (en) | Archive filing calculation method based on deep neural model | |
Marinai | A survey of document image retrieval in digital libraries | |
Abbas et al. | Intelligent Document Finding using Optical Character Recognition and Tagging | |
TW202207109A (en) | Document management method and system for engineering project | |
Bhagat et al. | Complex document classification and integration with indexing | |
Krishnan et al. | Content level access to Digital Library of India pages | |
CN115033543B (en) | Self-service government affair data storage system and self-service government affair terminal | |
Naïve et al. | Efficient Accreditation Document Classification Using Naïve Bayes Classifier | |
Öncevarlık et al. | Two Level Document Image Classification | |
CN117493645B (en) | Big data-based electronic archive recommendation system | |
Surmieda | OCR-Enhanced Digital Asset Management System: Prototype Design and Construction | |
Cecchetto et al. | ADVANCE: Automated Document Validation Aid with Nlp and Computer vision for fields Extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |