CN113190502A - Archive management method based on deep learning - Google Patents

Archive management method based on deep learning Download PDF

Info

Publication number
CN113190502A
CN113190502A CN202110489471.3A CN202110489471A CN113190502A CN 113190502 A CN113190502 A CN 113190502A CN 202110489471 A CN202110489471 A CN 202110489471A CN 113190502 A CN113190502 A CN 113190502A
Authority
CN
China
Prior art keywords
file
picture
archive
files
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110489471.3A
Other languages
Chinese (zh)
Inventor
刘伊玲
杨坤丽
柯燕
周琼凤
吴冬梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Center of Yunnan Power Grid Co Ltd
Original Assignee
Information Center of Yunnan Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Center of Yunnan Power Grid Co Ltd filed Critical Information Center of Yunnan Power Grid Co Ltd
Publication of CN113190502A publication Critical patent/CN113190502A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention belongs to the technical field of machine learning, and particularly relates to a file management method based on deep learning.

Description

Archive management method based on deep learning
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a file management method based on deep learning.
Background
At present, a digital platform is not supported in file management, file compliance inspection is still carried out in a manual mode, the number of files is large, a large amount of manpower and time need to be invested, and the standardization and the integrity cannot be well guaranteed. Archival data unstructured data accounts for about 90%, and critical information data and knowledge are usually hidden in these unstructured data.
Deep Learning (DL) is a new research direction in the field of machine Learning, and by Learning the intrinsic rule and the representation hierarchy using unstructured data as sample data, a terminal device has an analysis Learning capability and can recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and the effect achieved in the aspect of image recognition is far better than that achieved in the prior related art. In the prior art, deep learning is widely applied to the fields of search technology, data mining, machine learning, machine translation, natural language processing and the like.
In the aspect of image recognition, the power grid industry archive data is generally subjected to line scanning on characters in an archive through OCR to obtain unstructured data, and distorted text lines, skew, noise and other defects common in scanned images and digital photos in the archive data may reduce recognition quality, so that the recognition error rate is high.
Disclosure of Invention
Based on the problem that no digital platform is supported in the existing file management, the invention aims to explore and obtain a file management method based on deep learning.
The invention discloses a deep learning-based archive management method, which is characterized by comprising the following steps of:
1) collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, and judging whether the configuration file reaches the maximum volume number; and sequentially searching next file directories which do not reach the maximum volume number or automatically creating file directories, continuously finishing filing operation, storing paper file files into a picture format by matching with external equipment comprising a scanner, creating and automatically uploading the directories in a file server according to file related information by the system, automatically judging that a plurality of pages of documents are uploaded if the scanning page is not closed and the scanning is continuously carried out, automatically recording storage paths of the files and corresponding electronic file numbers by the system after the uploading, and associating the electronic files with the pictures to realize the acquisition and the uploading of the electronic files.
2) And (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; the system comprises the steps of verifying an electronic file according to integrity in a compliance rule by using a deep learning model, reading file information by using an identification method, scanning characters in the file by using an OCR (optical character recognition) method to obtain unstructured data, performing operations such as noise reduction, binarization, data cleaning and the like on the data, inputting valuable data into a decision tree to construct a model, modifying parameters after comparing the structure authenticity, adjusting weights to finally obtain a check model, confirming the file information by the check model, automatically sequencing and sorting the file information according to a set assembly template, realizing intelligent and automatic assembly of the file, and realizing one-key assembly into a book according to user requirements;
3) searching and classifying archives: the intelligent one-key volume grouping function realizes structured management on orderly and regular intelligent volumes of the files passing the inspection; by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, a big data intelligent analysis report is generated in real time.
4) And (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; setting a file value destruction standard score, and providing a destruction prompt and suggestion for files with scores lower than the destruction standard value; the big data is utilized to carry out overall analysis and mining of value information, and the incidence relation among projects is displayed in a targeted manner, and the economic benefit and the social benefit brought by the projects are counted and analyzed.
The identification method in the step 2) also comprises a picture identification method based on deep learning; for picture files existing in the electronic files, the Keras deep learning framework and the VGG16 network model are used for realizing content recognition of the picture files in a picture searching mode, and feature extraction is carried out on the content in the picture files.
Further, the graph searching is realized by the following steps:
1) extracting image features of the archive photo: in order to accurately extract the image characteristics of each photo file, the size of each photo file needs to be uniformly preprocessed, Keras is loaded with a pre-trained VGG16 model to extract the characteristic information of each image on a full connection layer, the information of each image is represented as a vector after being extracted, the image characteristics of a plurality of pictures corresponding to each picture are aggregated to form a vector set, and the vector set is integrated with an archive picture image characteristic index database.
2) Extracting image features of the picture to be retrieved: finding out the photo matched with the content of the picture to be retrieved in the picture archive, wherein the key is to zoom and cut the picture by using an Image Data Generator tool for accurately sampling the characteristic information represented by the picture to be retrieved, and extracting an Image characteristic information vector set after processing the picture into a plurality of pictures with the sizes of 224 multiplied by 224.
3) Comparing the image features, and returning a retrieval result: comparing the image feature information vector set of the picture to be retrieved with the information in the image feature index database of the archive photo by using a locality sensitive hashing algorithm, wherein the greater the similarity, the higher the content similarity of the two pictures. Apache Tomcat is deployed as a graph search graph Web search application, the picture to be searched is uploaded and a search result is returned on the front support, a back-end system uses a Keras deep learning framework to perform feature extraction on images in a photo archive one by one to obtain an image feature index library, in order to search out a picture set similar to a query picture, a local Hirsch algorithm is used for constructing an index library for a feature vector set for extracting all archive pictures, and the search speed is accelerated by comparing the feature vector set with the features of the picture to be searched.
The invention carries out intelligent file management based on the deep learning technology, improves the utilization of file resources and the intelligent management level of files, reduces the management cost and ensures the integrity, authenticity, reliability and usability of science and technology files.
Detailed Description
Example 1: the archive management method is implemented as follows.
Collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, judging whether the configuration file reaches the maximum volume number, sequentially searching the next file directory which does not reach the maximum volume number or automatically creating the file directory, continuously finishing the filing operation, storing a paper file into a picture format by matching with external equipment comprising a scanner, firstly carrying out OCR character extraction on project information according to project text data, carrying out regular preprocessing and noise reduction preprocessing on the unstructured data or semi-structured data, then inputting the data into a decision tree, carrying out data analysis according to specific service logic and obtaining an analysis result, and adjusting parameters and weights after manually intervening and comparing the result. Then inputting a large amount of data into an artificial neural network, carrying out data set training, taking a part of data sets as test data sets, and forming a model after judging that the offset of the result is within an error range through testing. The system creates a directory in the file server according to the file related information and automatically uploads the directory, if the scanning page is not closed and the scanning is continued, the system automatically judges that the multi-page document is uploaded, after the uploading, the system automatically records the storage path of the document and the file number of the electronic file corresponding to the storage path, associates the electronic file with the picture, and realizes the acquisition and uploading of the electronic file.
And (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; verifying the integrity of the electronic file according to the compliance rules by using a deep learning model, and reading the file information by a system through an identification method, wherein the identification method comprises but is not limited to an OCR character identification method; for picture files existing in the electronic files, the Keras deep learning framework and the VGG16 network model are used for realizing content recognition of the picture files in a picture searching mode, and feature extraction is carried out on the content in the picture files. According to the archives information of discerning, carry out the automatic sequencing arrangement according to the assembly template that sets up, realize the automatic assembly of archives intelligence, improve archives assembly work efficiency, derive as required and bind, reduce the assembly and bind the cost.
Searching and classifying archives: by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, a big data intelligent analysis report is generated in real time.
And (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; and setting a file value destruction standard score, and providing a destruction reminding and suggestion for files with scores lower than the destruction standard value.

Claims (2)

1. The archive management method based on deep learning is characterized by being realized through the following steps:
(1) collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, and judging whether the configuration file reaches the maximum volume number; searching next file catalog which does not reach the maximum volume number or automatically creating a file catalog according to the sequence, continuously finishing filing operation, storing paper file files into a picture format by matching with external equipment comprising a scanner, creating the catalog in a file server according to file related information by a system and automatically uploading the catalog, automatically judging that a plurality of pages of documents are uploaded if scanning is continuously carried out without closing a scanning page, automatically recording the storage path of the files and the file numbers of the corresponding electronic files by the system after uploading, and associating the electronic files with the pictures to realize acquisition and uploading of the electronic files;
(2) and (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; the system comprises the steps of verifying an electronic file according to integrity in a compliance rule by using a deep learning model, reading file information by using an identification method, scanning characters in the file by using an OCR (optical character recognition) method to obtain unstructured data, performing operations such as noise reduction, binarization, data cleaning and the like on the data, inputting valuable data into a decision tree to construct a model, modifying parameters after comparing the structure authenticity, adjusting weights to finally obtain a check model, confirming the file information by the check model, automatically sequencing and sorting the file information according to a set assembly template, realizing intelligent and automatic assembly of the file, and realizing one-key assembly into a book according to user requirements;
(3) searching and classifying archives: the intelligent one-key volume grouping function realizes structured management on orderly and regular intelligent volumes of the files passing the inspection; by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, generating a big data intelligent analysis report in real time;
(4) and (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; setting a file value destruction standard score, and providing a destruction prompt and suggestion for files with scores lower than the destruction standard value; the great number is utilized to carry out overall analysis and mining of value information, and the incidence relation among projects is displayed in a targeted manner, and the economic benefit and the social benefit brought by the projects are counted and analyzed.
2. The archive management method based on deep learning of claim 1, wherein the recognition method of step 2) further comprises a picture recognition method based on deep learning; for picture files existing in an electronic archive, a Keras deep learning framework and a VGG16 network model are used for realizing content recognition of the picture archive in a picture searching mode, and feature extraction is carried out on contents in the picture archive, wherein the picture searching mode is realized through the following steps:
(1) extracting image features of the archive photo: in order to accurately extract the image characteristics of each photo file, the size of each photo file needs to be uniformly preprocessed, Keras is used for extracting the characteristic information of each image on a full connection layer by loading a pre-trained VGG16 model, the information of each image is represented as a vector after being extracted, the image characteristics of a plurality of pictures corresponding to each picture are aggregated to form a vector set, and the vector set is integrated with an archive picture image characteristic index database;
(2) extracting image features of the picture to be retrieved: finding a photo matched with the content of the picture to be retrieved in the picture archive library, wherein the key is that the picture is zoomed and cut by using an Image Data Generator tool for accurately sampling the characteristic information represented by the picture to be retrieved, and the Image characteristic information vector set is extracted after the picture is processed into a plurality of pictures with the size of 224 multiplied by 224;
(3) comparing the image features, and returning a retrieval result: comparing the image feature information vector set of the picture to be retrieved with the information in the image feature index database of the archive photo by using a locality sensitive hashing algorithm, wherein the greater the similarity, the higher the content similarity of the two pictures.
CN202110489471.3A 2021-01-26 2021-05-06 Archive management method based on deep learning Pending CN113190502A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110102810 2021-01-26
CN2021101028108 2021-01-26

Publications (1)

Publication Number Publication Date
CN113190502A true CN113190502A (en) 2021-07-30

Family

ID=76983856

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110489471.3A Pending CN113190502A (en) 2021-01-26 2021-05-06 Archive management method based on deep learning

Country Status (1)

Country Link
CN (1) CN113190502A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792169A (en) * 2021-09-16 2021-12-14 烟台市蓬莱区档案馆 Digital archive management method and system based on big data application
CN114386078A (en) * 2022-03-22 2022-04-22 武汉汇德立科技有限公司 BIM-based construction project electronic archive management method and device
CN116401212A (en) * 2023-06-07 2023-07-07 东营市第二人民医院 Personnel file quick searching system based on data analysis
CN116934285A (en) * 2023-09-15 2023-10-24 济南泰格电子技术有限公司 Visual intelligent system and equipment for realizing digitization and entity file management

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792169A (en) * 2021-09-16 2021-12-14 烟台市蓬莱区档案馆 Digital archive management method and system based on big data application
CN113792169B (en) * 2021-09-16 2022-05-10 烟台市蓬莱区档案馆 Digital archive management method and system based on big data application
CN114386078A (en) * 2022-03-22 2022-04-22 武汉汇德立科技有限公司 BIM-based construction project electronic archive management method and device
CN116401212A (en) * 2023-06-07 2023-07-07 东营市第二人民医院 Personnel file quick searching system based on data analysis
CN116401212B (en) * 2023-06-07 2023-08-11 东营市第二人民医院 Personnel file quick searching system based on data analysis
CN116934285A (en) * 2023-09-15 2023-10-24 济南泰格电子技术有限公司 Visual intelligent system and equipment for realizing digitization and entity file management
CN116934285B (en) * 2023-09-15 2023-12-22 济南泰格电子技术有限公司 Visual intelligent system and equipment for realizing digitization and entity file management

Similar Documents

Publication Publication Date Title
CN113190502A (en) Archive management method based on deep learning
US6501855B1 (en) Manual-search restriction on documents not having an ASCII index
CN110188077B (en) Intelligent classification method and device for electronic files, electronic equipment and storage medium
JPH07262224A (en) Preservation/processing method of document image
WO2012156774A1 (en) Method and apparatus for detecting visual words which are representative of a specific image category
CN106815605B (en) Data classification method and equipment based on machine learning
CN111860524A (en) Intelligent classification device and method for digital files
Tian et al. Image classification based on the combination of text features and visual features
Arief et al. Automated extraction of large scale scanned document images using Google vision OCR in apache Hadoop environment
Dixit et al. A survey on document image analysis and retrieval system
CN116663549B (en) Digitized management method, system and storage medium based on enterprise files
Rigaud et al. Toward speech text recognition for comic books
CN112464907A (en) Document processing system and method
CN115329169A (en) Archive filing calculation method based on deep neural model
Marinai A survey of document image retrieval in digital libraries
Abbas et al. Intelligent Document Finding using Optical Character Recognition and Tagging
TW202207109A (en) Document management method and system for engineering project
Bhagat et al. Complex document classification and integration with indexing
Krishnan et al. Content level access to Digital Library of India pages
CN115033543B (en) Self-service government affair data storage system and self-service government affair terminal
Naïve et al. Efficient Accreditation Document Classification Using Naïve Bayes Classifier
Öncevarlık et al. Two Level Document Image Classification
CN117493645B (en) Big data-based electronic archive recommendation system
Surmieda OCR-Enhanced Digital Asset Management System: Prototype Design and Construction
Cecchetto et al. ADVANCE: Automated Document Validation Aid with Nlp and Computer vision for fields Extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination