CN117076692A

CN117076692A - File online management method and system

Info

Publication number: CN117076692A
Application number: CN202310850095.5A
Authority: CN
Inventors: 李冬泉; 刘煜; 刘晓雨; 王龙韬; 李勇; 陈树文; 张露潆; 周聪; 安琪; 周桐
Original assignee: Huaneng Shandong Power Generation Co Ltd; Huaneng Information Technology Co Ltd; Shandong Rizhao Power Generation Co Ltd
Current assignee: Huaneng Shandong Power Generation Co Ltd; Huaneng Information Technology Co Ltd; Shandong Rizhao Power Generation Co Ltd
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-11-17

Abstract

The application discloses a file online management method and system, comprising the following steps: acquiring file text information of a file to be stored; processing the file text information to obtain a candidate keyword set of the file to be stored; determining the weight of the candidate keywords in the candidate keyword set, and determining the keyword set according to the weight of the candidate keywords; determining the file type of the file to be stored according to the keyword set based on a preset file type library, and marking the file type of the file to be stored; and storing the files to be stored in the corresponding archives according to the file category marks. And realizing the online identification of the files and storing the files.

Description

File online management method and system

Technical Field

The present application relates to the technical field of archive management systems, and in particular, to a method and a system for online archive management.

Background

The object of archive management is archive, the service object is archive user, the basic contradiction to be solved is the contradiction between the conditions of archive dispersion, disorder, quality impurity, large quantity, solitary and the like and the conditions of concentrated, systematic, high-quality, special-purpose and wide-range requirements of the society utilization archive. The meeting degree of the society on the file demands depends on the increasing level of file management, and the file management level is adapted to the increasing social demands. Both are in the process of constant contradiction from discomfort to adaptation, thereby pushing the archive management work forward. From a history of thousands of years of archive management, archive management has evolved from a non-standalone system to a standalone system, from simple management to complex management, from empirical management to scientific management, from manual management to computer management, from a closed system to an open system.

The file management system is used for standardizing and managing various files and files by establishing a unified standard, and can comprise the steps of standardizing file management of each service system, constructing a complete file resource information sharing service platform, supporting information processing (including acquisition, handover reception, archiving, storage management, borrowing utilization, compiling development and distribution and the like) of the whole file management process, realizing file streamline acquisition and other functions, gradually converting a service management mode into a service management mode, taking a service model as a service management basis, and establishing a service flow and a data flow on a system platform taking the service as the model.

Along with the high-speed development of information technology, file management informatization is also more and more mature, but the current file management system still needs to judge file types in terms of storage, stores according to the file types, and is complex in process and not convenient enough. Therefore, a method and a system for on-line managing files are needed to realize on-line identification of files and storage.

Disclosure of Invention

The application aims to provide a file online management method and system, which can realize online identification of file types and storage of files, and is simple, convenient and quick to online archive.

The application provides an on-line archive management method, which comprises the following steps:

acquiring file text information of a file to be stored;

processing the file text information to obtain a candidate keyword set of the file to be stored;

determining the weight of the candidate keywords in the candidate keyword set, and determining the keyword set according to the weight of the candidate keywords;

determining the file type of the file to be stored according to the keyword set based on a preset file type library, and marking the file type of the file to be stored;

and storing the files to be stored in the corresponding archives according to the file category marks.

In some embodiments of the present application, obtaining archival text information of an archive to be stored includes:

scanning and identifying the file to be stored to obtain all text information of the file to be stored;

and carrying out text noise reduction processing on all the text information to obtain archive text information of the archive to be stored.

In some embodiments of the present application, processing the archive text information to obtain a candidate keyword set of an archive to be stored includes:

the stop word list is preset and used,

based on the stop word list, word segmentation is carried out on the archive text information, and a word segmentation set is obtained;

and removing repeated word segmentation in the word segmentation set to obtain a candidate keyword set of the file to be stored.

In some embodiments of the present application, determining weights for candidate keywords in the set of candidate keywords comprises:

obtaining the frequency of candidate keywords appearing in the archive text information;

acquiring the inverse text frequency of the candidate keywords;

and determining the weight of the candidate keyword according to the candidate keyword frequency and the inverse text frequency.

In some embodiments of the application, the weights of the candidate keywords are calculated according to the following formula:

wherein,representing the number of candidate keywords i in the archive text information j +.>Representing the number of all candidate keywords in the archive text information j +.>Representing the number of files in the archive, +.>The number of files containing candidate keywords i is indicated.

In some embodiments of the present application, determining a keyword set based on the weights of the candidate keywords includes:

determining the association degree of the candidate keywords and the archive category according to the weight of the candidate keywords;

comparing the association degree of the candidate keywords and the archive categories with the target association degree;

if the association degree of the candidate keywords and the archive category is larger than the target association degree, determining the candidate keywords as keywords;

and integrating the keywords to obtain a keyword set.

In some embodiments of the present application, the preset weights correspond to a set W [ W1, W2, W3, …, wm ], wherein m=1, 2, 3, 4, …, m, W1 is a first preset weight, W2 is a second preset weight, W3 is a third preset weight, wm is an mth preset weight, and W1 < W2 < W3 < … < Wm;

a preset association determining group R [ R1, R2, R3, …, rm ], wherein m=1, 2, 3, 4, …, m, R1 is a first preset association, R2 is a second preset association, R3 is a third preset association, rm is an mth preset association, and R1 < R2 < R3 < … < Rm;

acquiring the weight w of the candidate keywords, and setting the association degree of the candidate keywords and the archive category according to the relation between the weight w of the candidate keywords and each preset weight;

when W is less than W1, setting the first preset association degree R1 as the association degree of the keywords and the archive category;

when W1 is less than or equal to W2, setting the second preset association degree R2 as the association degree of the keywords and the archive category;

when W2 is less than or equal to W3, setting the third preset association degree R3 as the association degree of the keywords and the archive category;

……；

when Wm-1 is less than or equal to w and less than Wm, setting the mth preset association Rm as the association between the keywords and the archive category.

In some embodiments of the application, the archive class library comprises: file category and several keywords;

the archive category corresponds to a plurality of keywords and is used for determining the archive category of the archive to be stored according to the keywords.

In some embodiments of the application, the profile categories include party work categories, administrative management categories, business management categories, production technology management categories, product categories, scientific technology research categories, infrastructure categories, equipment instrumentation categories, accounting profiles categories, and department of staff profiles categories.

An archive online management system comprising:

the archive is used for storing archives;

the file text information acquisition unit is used for acquiring file text information of the file to be stored;

a keyword determining unit for determining keywords of the archive text information;

the file type determining unit is used for determining the file type of the file to be stored;

the keyword determining unit processes the archive text information acquired by the acquiring unit to acquire a candidate keyword set of an archive to be stored; determining the weight of the candidate keywords in the candidate keyword set, and determining the keyword set according to the weight of the candidate keywords;

the file type determining unit determines the file type of the file to be stored according to the keyword set based on a preset file type library, and marks the file type of the file to be stored;

and the archives store the archives to be stored in the corresponding archives according to the archive category marks.

The application provides an on-line archive management method, which comprises the following steps: acquiring file text information of a file to be stored; processing the file text information to obtain a candidate keyword set of the file to be stored; determining the weight of the candidate keywords in the candidate keyword set, and determining the keyword set according to the weight of the candidate keywords; determining the file type of the file to be stored according to the keyword set based on a preset file type library, and marking the file type of the file to be stored; and storing the files to be stored in the corresponding archives according to the file category marks.

The file text information of the file to be stored is processed to obtain the keyword set, the file type is determined and stored according to the keyword set, accurate judgment of the file type is achieved, meanwhile, the file type of the file to be stored is marked, and storage of the file is achieved conveniently and rapidly.

The technical scheme of the application is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a flow chart of an on-line archive management method according to the present application;

FIG. 2 is a schematic diagram of an on-line archive management system according to the present application.

Detailed Description

The technical scheme of the application is further described below through the attached drawings and the embodiments.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the application. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, means, components, and/or combinations thereof, but do not exclude other elements or items. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", etc. refer to the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, and are merely relational terms determined to facilitate description of the structural relationships of the various components or elements of the application, and are not meant to be limiting of the application. Terms such as "fixedly attached," "connected," "coupled," and the like are to be construed broadly and refer to either a fixed connection or an integral or removable connection; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the terms in the present application can be determined according to circumstances by a person skilled in the relevant art or the art, and is not to be construed as limiting the present application.

Examples

The application provides an on-line archive management method, as shown in FIG. 1, comprising the following steps:

s1, acquiring file text information of a file to be stored.

S2, processing the archive text information to obtain a candidate keyword set of the archive to be stored.

S3, determining the weight of the candidate keywords in the candidate keyword set, and determining the keyword set according to the weight of the candidate keywords.

S4, determining the file category of the file to be stored according to the keyword set based on a preset file category library, and marking the file category of the file to be stored.

S5, storing the files to be stored in the corresponding archives according to the file category marks.

and scanning and identifying the file to be stored to obtain all text information of the file to be stored.

In this embodiment, the text denoising technique is used to purify and simplify the original text, and common denoising includes: symbol noise reduction and text noise reduction.

Symbol noise reduction: 1. changing the full-angle symbol to a half-angle symbol, e.g., full-angle space to half-angle space; 2. the special symbol is replaced by a common symbol, such as "(1) (9) (8) (5)" by "1985"; 3. the use of simplified symbols, such as replacing tab symbols with spaces, replacing brackets and brackets with brackets, replacing a comma with a comma, and the like, can even change all symbols into commas and periods to simplify the method to the greatest extent.

Text noise reduction: 1. correction of wrongly written words, for example, changing 'gas-water' into 'steam-water', 'other' into 'other', etc.; 2. simplified and complex conversion, for example, "Guojia" is changed to "national", etc.; 3. the term unification is used, for example, "Saint Barbara division" is changed to "Saint Bara division", etc.

presetting a stop word list.

And based on the stop word list, word segmentation is carried out on the archive text information, and a word segmentation set is obtained.

In this embodiment, the Stop Words refer to that in information retrieval, in order to save storage space and improve searching efficiency, certain Words or Words are automatically filtered before or after processing natural language data (or text), and these Words or Words are called Stop Words. The stop words are manually input and are not automatically generated, and a stop word list is formed by the generated stop words.

When the archives text information is segmented, the word is not ignored because the same word appears, so the obtained segmented word set contains repeated words, the repeated segmented words need to be removed, and only one word is reserved.

and obtaining the frequency of candidate keywords appearing in the archive text information.

And obtaining the inverse text frequency of the candidate keywords.

In this embodiment, TF-IDF is a statistical method, TF is Term Frequency (Term Frequency), and IDF is an inverse text Frequency index (Inverse Document Frequency). To evaluate the importance of a word to one of the documents in a document set or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus.

In this embodiment, when the number of files containing the candidate keyword i is 0, the denominator is 0, so if the candidate keyword is not in the file repository, the denominator is automatically set to 1.

and determining the association degree of the candidate keywords and the archive category according to the weight of the candidate keywords.

And comparing the association degree of the candidate keywords and the archive category with the target association degree.

And if the association degree of the candidate keywords and the archive category is greater than the target association degree, determining the candidate keywords as keywords.

And integrating the keywords to obtain a keyword set.

In some embodiments of the present application, the preset weights correspond to a set W [ W1, W2, W3, …, wm ], wherein m=1, 2, 3, 4, …, m, W1 is a first preset weight, W2 is a second preset weight, W3 is a third preset weight, wm is an mth preset weight, and W1 < W2 < W3 < … < Wm.

The preset association degree determining group R [ R1, R2, R3, …, rm ], wherein m=1, 2, 3, 4, …, m, R1 is a first preset association degree, R2 is a second preset association degree, R3 is a third preset association degree, rm is an mth preset association degree, and R1 is more than R2 and less than R3 and less than … and less than Rm.

And obtaining the weight w of the candidate keywords, and setting the association degree of the candidate keywords and the archive category according to the relation between the weight w of the candidate keywords and each preset weight.

When W is less than W1, setting the first preset association degree R1 as the association degree of the keywords and the archive category.

When W1 is less than or equal to W2, setting the second preset association degree R2 as the association degree of the keywords and the archive category.

And when W2 is less than or equal to W3, setting the third preset association degree R3 as the association degree of the keywords and the archive category.

……。

In some embodiments of the application, the archive class library comprises: archive category and several keywords.

An archive online management system, as shown in fig. 2, includes:

and the archive is used for storing the archive.

And the acquisition unit is used for acquiring the archive text information of the archive to be stored.

And the keyword determining unit is used for determining keywords of the archive text information.

And the archive category determining unit is used for determining the archive category of the archive to be stored.

The keyword determining unit processes the archive text information acquired by the acquiring unit to acquire a candidate keyword set of an archive to be stored; and determining the weight of the candidate keywords in the candidate keyword set, and determining the keyword set according to the weight of the candidate keywords.

The archive category determining unit determines the archive category of the archive to be stored according to the keyword set based on a preset archive category library, and marks the archive category of the archive to be stored.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application and not for limiting it, and although the present application has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the application can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the application.

The system provided in the foregoing embodiment is only exemplified by the division of the foregoing functional modules, and in practical applications, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present application are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps related to the embodiments of the present application are merely for distinguishing the respective modules or steps, and are not to be construed as unduly limiting the present application.

Those of skill in the art will appreciate that the various illustrative modules, method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the program(s) corresponding to the software modules, method steps, may be embodied in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not intended to be limiting.

Claims

1. An on-line archive management method is characterized by comprising the following steps:

acquiring file text information of a file to be stored;

2. A method of on-line archive management according to claim 1, wherein obtaining archive text information of an archive to be stored comprises:

3. The archive online management method of claim 1, wherein processing the archive text information to obtain a candidate keyword set of an archive to be stored comprises:

the stop word list is preset and used,

4. An archive online management method according to claim 1, wherein determining weights of candidate keywords in the candidate keyword set comprises:

acquiring the inverse text frequency of the candidate keywords;

5. An archive online management method according to claim 4, wherein the weights of the candidate keywords are calculated according to the following formula:

；

6. An archive online management method according to claim 1, wherein determining a keyword set according to the weights of the candidate keywords comprises:

and integrating the keywords to obtain a keyword set.

7. An archive online management method according to claim 6, wherein,

the preset weights correspond to groups W [ W1, W2, W3, …, wm ], wherein m=1, 2, 3, 4, …, m, W1 is a first preset weight, W2 is a second preset weight, W3 is a third preset weight, wm is an mth preset weight, W1 is less than W2 is less than W3 is less than … is less than Wm;

……；

8. An archive online management method according to claim 1, wherein the archive class library comprises: file category and several keywords;

9. An on-line archive management method as claimed in claim 8, wherein the archive categories include party work categories, administrative management categories, management categories, production technology management categories, product categories, scientific and technical research categories, infrastructure categories, equipment instruments categories, accounting archive categories, and department of staff archive categories.

10. An archive online management system, comprising:

the archive is used for storing archives;