CN117251587A

CN117251587A - Intelligent information mining method for digital archives

Info

Publication number: CN117251587A
Application number: CN202311534225.0A
Authority: CN
Inventors: 李燕强; 齐少华; 马国伟; 张泽宇
Original assignee: Beijing Yinduo Shuzhi Archives Technology Industry Development Co ltd
Current assignee: Beijing Yinduo Shuzhi Archives Technology Industry Development Co ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2023-12-19

Abstract

The invention discloses a digital archive intelligent information mining method, which relates to the technical field of digital archive mining and comprises the following steps: step one, data preprocessing; step two, classifying files; step three, extracting file information; step four, file marking; step five, file abstracts; step six, file analysis; step seven, record files, the advantage of the invention is: when a person inputs the digital file into the database, the digital file is set, so that the digital file can be set into a digital file which can be extracted in file analysis and a digital file which cannot be extracted, the digital file information of the same type in the database is extracted through file analysis, the hidden information in the file is conveniently found out through analysis between a plurality of digital files and the existing file, the hidden information in the file is searched by utilizing a search engine, and the effects of quickly consulting and understanding the hidden information in the digital file by the person are achieved.

Description

Intelligent information mining method for digital archives

Technical Field

The invention relates to the technical field of digital archive mining, in particular to an intelligent information mining method for digital archives.

Background

The digital archives have the characteristics of digitizing collection resources, networking information organization and transmission, expanding service range, sharing information resources, facilitating information retrieval and the like, and refer to an information space for storing and utilizing archival information resources, and are a digital archives group consisting of a plurality of archives resource groups, an archives information resource processing center and archives user groups.

The digital archives are a collection of a content management system, an integrated system and a digital information long-term storage system, and serve as digital archives taking unstructured data such as electronic files, archives and other information resources as main management objects, not only play a role of a data center, but also play a role of issuing and utilizing, but also have functions of orderly processing and integrated management, the orderly processing and management process comprises the whole process of collecting, creating, confirming, converting, archiving, managing, issuing and utilizing and the like covering file life cycle management practices, the integrated process comprises comprehensive, fusion and integration into a whole and integrated meaning, and the integrated management theory is applied to the whole process covering file information resource life cycle management practices in terms of the digital archives, namely the integrated theory is taken as a guide in the management idea, the integrated mechanism is taken as a core in the management action, the limit among management business flow mechanisms is broken through in the management view, and various archives information resource elements are treated in the whole management and optimized management level, the degree of the various archives information elements is improved, the authenticity, the integrity and the integrity of information resources are improved, and the integrated service demands are provided for users.

However, the existing digital file mining mode is inconvenient for people to review the digital file, hidden information in digital file information is reviewed in time and is inconvenient to understand, and the problem that people miss the hidden information is possibly caused.

Disclosure of Invention

The invention aims to provide a digital archive intelligent information mining method.

In order to solve the problems set forth in the background art, the invention provides the following technical scheme: an intelligent information mining method for digital files comprises the following steps,

preprocessing data in a digital file, reducing noise of audio and video in the digital file, performing morphological reduction on a text document and a picture, and extracting text data in the digital file;

classifying files, namely classifying text data in the digital files according to predefined categories;

step three, file information extraction, namely extracting key information and attributes from the digital file;

step four, marking the file, namely marking the specific meaning identified in the digital file;

step five, file abstracts are extracted from texts, pictures, audio and video in the digital files, and content in the inner wall of the digital files is extracted based on a statistical method and a graph model;

step six, file analysis, namely analyzing the relativity and rules of text data in a plurality of digital files, providing information of different rules in the same digital file which is classified in the step two, searching the information of the different rules through a search engine, sorting the searched information, and simplifying and postfix the sorted information behind characters of the different rules;

step seven, file records are recorded, personnel consult the data file information and upload the data file information to the database, and meanwhile, the corresponding files are classified, so that information is conveniently called from the inside of the database during file analysis in step six, and rule information of different files in the database is increased to be perfected.

As a further aspect of the invention: and removing irrelevant symbol icons and irrelevant words in the audio, video, documents and the words extracted from the pictures in the digital file by a person in the step one.

As a further aspect of the invention: in the second step, the personnel can carry out text classification on the digital files of different types through a machine learning algorithm.

As a further aspect of the invention: and thirdly, extracting rule information and expression information in the digital file.

As a further aspect of the invention: and in the fourth step, the name information, the place name information and the time information in the digital file are marked, so that the key information in the digital file can be quickly consulted when personnel quickly look up the file.

As a further aspect of the invention: in the fifth step, important contents in the text are identified by using a natural language processing and machine learning algorithm through the quality of the algorithm and the training data by using TextRank, BERT, GPT software, irrelevant details are removed, and the accuracy of abstract extraction in the digital file is improved.

As a further aspect of the invention: and when the digital files are analyzed, extracting the digital files classified in the step two, analyzing related information in the digital files, and distinguishing the digital files in analysis from the digital files in classification.

As a further aspect of the invention: after analyzing the different rule information, searching the information with different meanings in the digital file by using a search engine, and then marking the searched information at the rear of the different information after simplifying, so that people can conveniently and quickly preview the hidden information in the digital file when browsing the file.

As a further aspect of the invention: and in the seventh step, the digital files of the database are correspondingly classified into an adjustable digital file and an non-adjustable digital file, so that the adjustable digital file is conveniently adjusted during file analysis in the sixth step, the database is perfected, and the accuracy of the digital file analysis is improved.

By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that:

when a person inputs a digital archive template into a database, setting the digital archive, so that the digital archive is set into a digital archive which can be extracted in archive analysis and a digital archive which cannot be extracted, extracting digital archive information of the same type in the database through archive analysis, and conveniently improving the hidden information existing in the archive through analysis between a plurality of digital archives and the existing archive, thereby facilitating the person to search the hidden information in the archive, searching the hidden information in the archive by utilizing a search engine, and marking the searched hidden information behind the information, so that the effect of quickly searching and understanding the hidden information in the digital archive by the person is achieved;

according to the invention, the digital archives are respectively converted into the documents through the second step, the third step, the fourth step and the fifth step, the rule information and the expression information in the digital archives are extracted by using the marks, so that when personnel review archives information, special information is observed in time, the personnel name information, the place name information and the time information are marked, the personnel can quickly move corresponding positions when the personnel review the personnel name information, the place name information and the time information conveniently, the personnel can quickly preview the whole archives by using abstract extraction, whether the archives are needed or not is observed, and the personnel can quickly review archives;

after the file analysis is completed, personnel consult the file and mark whether the hidden information is needed, when the hidden information in the file can not be used as a reference file, the personnel records the digital file information in a database of the information which can not be called, otherwise, the digital file information is recorded in the information database which can be called, and when the file is analyzed, the software compares the file which can be called in the database with the existing file data, so that the analysis accuracy of the hidden information in the digital file is improved.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Embodiment 1 referring to fig. 1, the present invention provides a technical solution: an intelligent information mining method for digital files comprises the following steps,

step seven, file recording, personnel consult the data file information and upload the data file information to the database, classify corresponding files at the same time, call information from the database in the step six when file analysis is convenient, and increase rule information of different files in the database to perfect.

Referring to fig. 1, the present invention provides a technical solution: in the fifth step, important contents in texts are identified by using natural language processing and machine learning algorithms through the quality of algorithms and training data by using TextRank, BERT, GPT software, irrelevant details are removed, the accuracy of abstracting in digital files is improved, when analysis is carried out through the digital files, the digital files classified in the second step are extracted, relevant information in a plurality of digital files is analyzed, when the digital files in analysis are distinguished from the digital files in classification, after analysis is carried out on different rule information, information with different meanings in the digital files is searched by using a search engine, then the searched information is simplified and marked behind different information, and therefore, people can conveniently and rapidly preview hidden information in the digital files when browsing the files;

in this embodiment, the digital file is processed through data preprocessing, so that the processing procedures of later classification, extraction and analysis of the digital file are improved, personnel classify the digital file according to different categories of information in the digital file, then file information is utilized to extract, personnel search file information later, and after the analysis of the digital file is completed, personnel can classify the digital file in advance and can not classify the digital file in advance according to the analysis result in the digital file.

When the digital file template is input into the database by personnel during use, the digital file is set, so that the digital file can be set into a digital file which can be extracted in file analysis and a digital file which cannot be extracted, the digital file information of the same type in the database is extracted through file analysis, the hidden information existing in the file is conveniently improved through analysis between a plurality of digital files and the existing file, the hidden information in the file is conveniently consulted by personnel, the hidden information in the file is searched by utilizing a search engine, and the searched hidden information is marked behind the information.

In a second embodiment, referring to fig. 1, a method for mining digital archive intelligent information includes the following steps,

Referring to fig. 1, the first personnel remove irrelevant symbol icons and irrelevant words in the extracted words in the digital files, the second personnel can perform text classification on different types of digital files through a machine learning algorithm, the third personnel extract rule information and expression information in the digital files, the fourth personnel mark name information and place name information in the digital files, so that when the personnel quickly look up the files, key information in the digital files can be quickly looked up, the fifth personnel can identify important contents in the text through the quality of algorithm and training data by using TextRank, BERT, GPT software through natural language processing and machine learning algorithm, irrelevant details are removed, and the accuracy of abstract extraction in the digital files is improved;

in this embodiment, a large number of files which can be called are stored in the database, and the files are called by software such as TextRank, BERT, GPT.

When the file information is used, the digital file is rotated and the document is immediately read by the step two, the step three, the step four and the step five, the rule information and the expression information in the digital file are extracted by the marks, so that when people review the file information, special information is observed in time, people name information, place name information and time information are marked, people can quickly move corresponding positions when reviewing the person name information, place name information and time information conveniently, and people can quickly preview the whole file by the abstract extraction, so that whether the file is needed is observed.

In a third embodiment, referring to fig. 1, the present invention provides a technical solution: an intelligent information mining method for digital files comprises the following steps,

Referring to fig. 1, in step seven, the digital files of the database are classified into an adjustable digital file and an non-adjustable digital file, so that the adjustable digital file can be conveniently adjusted during file analysis in step six, the database is perfected, and the accuracy of digital file analysis is improved.

In this embodiment, when a person needs to review the corresponding digital file, the person can query the digital file in the database.

When the digital file is analyzed, software can be compared with the existing file data from the database, so that the analysis accuracy of the hidden information in the digital file is improved.

Working principle:

firstly, personnel transmit a digital file to data preprocessing, and then the data preprocessing removes irrelevant symbol icons and irrelevant words in the audio, video, documents and characters extracted from pictures in the digital file, so that the digital file can be classified and analyzed later;

secondly, classifying and marking the processed files, identifying and marking key information in the files according to the classified files, extracting the content in the whole files by utilizing file abstracts, and facilitating the quick reference of personnel after extraction on the whole information of the digital files;

and finally, the file information in the database, which is correspondingly classified, is called, so that a large number of files and the existing files are conveniently analyzed, hidden information is analyzed, the hidden information is searched by utilizing a search engine, the searched information is simplified and marked behind the corresponding hidden information, people can conveniently and timely understand when looking up different information, then people can compare the files with other files to place, and the comparison accuracy of the files in the database to the existing files in the later period is conveniently improved.

The front, rear, left, right, up and down are all based on fig. 1 in the drawings of the specification, the face of the device facing the observer is defined as front, the left side of the observer is defined as left, and so on, according to the viewing angle of the person.

In the description of the present invention, it should be understood that the terms "center," "longitudinal," "lateral," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the scope of the present invention.

It should be noted that, the device structure and the drawings of the present invention mainly describe the principle of the present invention, in terms of the technology of the design principle, the arrangement of the power mechanism, the power supply system, the control system, etc. of the device is not completely described, and on the premise that the person skilled in the art understands the principle of the present invention, the specific details of the power mechanism, the power supply system and the control system can be clearly known, the control mode of the application file is automatically controlled by the controller, and the control circuit of the controller can be realized by simple programming of the person skilled in the art;

the standard parts used in the method can be purchased from the market, and can be customized according to the description of the specification and the drawings, the specific connection modes of the parts are conventional means such as mature bolts, rivets and welding in the prior art, the machines, the parts and the equipment are conventional models in the prior art, and the structures and the principles of the parts are all known by the skilled person through technical manuals or through conventional experimental methods.

The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, and yet fall within the scope of the invention.

Claims

1. A digital archive intelligent information mining method is characterized in that: comprises the steps of,

2. The method for mining digital archive intelligence information according to claim 1, wherein: and removing irrelevant symbol icons and irrelevant words in the audio, video, documents and the words extracted from the pictures in the digital file by a person in the step one.

3. The method for mining digital archive intelligence information according to claim 1, wherein: in the second step, the personnel can carry out text classification on the digital files of different types through a machine learning algorithm.

4. The method for mining digital archive intelligence information according to claim 1, wherein: and thirdly, extracting rule information and expression information in the digital file.

5. The method for mining digital archive intelligence information according to claim 1, wherein: and in the fourth step, the name information, the place name information and the time information in the digital file are marked, so that the key information in the digital file can be quickly consulted when personnel quickly look up the file.

6. The method for mining digital archive intelligence information according to claim 1, wherein: in the fifth step, important contents in the text are identified by using a natural language processing and machine learning algorithm through the quality of the algorithm and the training data by using TextRank, BERT, GPT software, irrelevant details are removed, and the accuracy of abstract extraction in the digital file is improved.

7. The method for mining digital archive intelligence information according to claim 1, wherein: and when the digital files are analyzed, extracting the digital files classified in the step two, analyzing related information in the digital files, and distinguishing the digital files in analysis from the digital files in classification.

8. The method for intelligent information mining of digital archives as set forth in claim 7, wherein: after analyzing the different rule information, searching the information with different meanings in the digital file by using a search engine, and then marking the searched information at the rear of the different information after simplifying, so that people can conveniently and quickly preview the hidden information in the digital file when browsing the file.

9. The method for intelligent information mining of digital archives as set forth in claim 6, wherein: and in the seventh step, the digital files of the database are correspondingly classified into an adjustable digital file and an non-adjustable digital file, so that the adjustable digital file is conveniently adjusted during file analysis in the sixth step, the database is perfected, and the accuracy of the digital file analysis is improved.