CN116166621A - Metadata extraction method and system for block chain data sharing - Google Patents

Metadata extraction method and system for block chain data sharing Download PDF

Info

Publication number
CN116166621A
CN116166621A CN202211546321.2A CN202211546321A CN116166621A CN 116166621 A CN116166621 A CN 116166621A CN 202211546321 A CN202211546321 A CN 202211546321A CN 116166621 A CN116166621 A CN 116166621A
Authority
CN
China
Prior art keywords
data
information
extracting
file
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211546321.2A
Other languages
Chinese (zh)
Inventor
鲜开强
王小龙
巫乾军
章劲秋
张才俊
王德玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Co ltd Customer Service Center
State Grid Jiangsu Electric Power Co ltd Marketing Service Center
State Grid Electric Power Research Institute
Original Assignee
State Grid Co ltd Customer Service Center
State Grid Jiangsu Electric Power Co ltd Marketing Service Center
State Grid Electric Power Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Co ltd Customer Service Center, State Grid Jiangsu Electric Power Co ltd Marketing Service Center, State Grid Electric Power Research Institute filed Critical State Grid Co ltd Customer Service Center
Priority to CN202211546321.2A priority Critical patent/CN116166621A/en
Publication of CN116166621A publication Critical patent/CN116166621A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/144Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A meta information extraction method and system for block chain data sharing is characterized in that the method comprises the following steps: step 1, acquiring application data and reading personalized information from a user, and filling the application data based on the personalized information of the user; step 2, classifying the filled application data by adopting a predefined classification rule base so as to respectively acquire structured data, semi-structured data and unstructured data; and 3, extracting keywords from the structured data, extracting features from the semi-structured data and the unstructured data, generating meta-information through the keyword extraction and the feature extraction, and linking the meta-information. The method is clear and ingenious in conception, and specific feature extraction is provided for the features of different types of data.

Description

Metadata extraction method and system for block chain data sharing
Technical Field
The invention relates to the field of data processing, in particular to a meta-information extraction method and system for block chain data sharing.
Background
At present, with the continuous development of the blockchain data sharing technology, more and more blockchain data storage networks are proposed, but the blocks of the storage and sharing schemes only comprise the basic block abstract and the position information, lack of abstract description of the content of the block data, have poor readability of the block data, are difficult to retrieve, and are not beneficial to further deepening application of the blockchain data storage sharing.
In power systems, power data is often provided with very specific and special features, for example, analog, switching data, and the data size of the data generated by the data is substantially fixed in the same time period. Constant data, etc. have significant structural characteristics. Many of the image data, audio data, and video data based on the power system include geographic position information or information related to the power equipment.
However, the prior art lacks not only abstractions and descriptions of general-purpose data block processing, but also processing of such power data. The method has the advantages that the electric power data is difficult to retrieve and trace after being transmitted and uplink, the readability is poor, the quick calling and further analysis processing of the data are not facilitated, and the availability and the application processing speed of the data are greatly reduced.
On the other hand, for the integrated processing of general data, the method for extracting the information in the data in the prior art is mainly a mode integrated middleware method, and mainly comprises a middleware and a wrapper, wherein each data source corresponds to one wrapper, and the middleware interacts with each data source through the wrapper to generate data meta-information. The user may issue a query request to the middleware on a global mode basis. The middleware mode can integrate data sources in a non-database form, and is high in instantaneity. However, the method has poor expansibility because different wrappers are needed for different data source systems.
In addition, existing blockchain data storage and sharing systems face a variety of data files, including structured data, unstructured data and semi-structured data, and the blockchain system generally only performs hash fingerprint generation on the data on the basis of the data, and does not realize automatic extraction of meta-information, so that users need to manually add the meta-information one by one. In addition, data retrieval often requires participation by multiple blockchain nodes, such as local data queries, limited in performance, due to too little data available on the blockchain.
Therefore, the method in the prior art is not only difficult to formulate a reasonable data feature extraction mode for the personalized data, but also can not provide a more universal feature extraction process for the general data content.
In view of the foregoing, there is a need for a method and system for extracting meta information for blockchain data sharing.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a metadata extraction method and a system for block chain data sharing, which are used for filling application data through personalized information of a user, realizing data classification by adopting a classification rule base, and extracting characteristics of different types of data so as to realize metadata generation and uplink.
The invention adopts the following technical scheme.
The first aspect of the present invention relates to a meta information extraction method for block chain data sharing, which comprises the following steps: step 1, acquiring application data and reading personalized information from a user, and filling the application data based on the personalized information of the user; step 2, classifying the filled application data by adopting a predefined classification rule base so as to respectively acquire structured data, semi-structured data and unstructured data; and 3, extracting keywords from the structured data, extracting features from the semi-structured data and the unstructured data, generating meta-information through the keyword extraction and the feature extraction, and linking the meta-information.
Preferably, the personalized information of the user and the filled application data comprise: data file name, data path, data owner, data size, data time, data type, data description, data security level, file suffix, data tag.
Preferably, the predefined classification rule base implements the definition of data types based on MIME standards; and extracting the file header from the filled application data, and checking the file header based on a predefined classification rule base, thereby realizing classification of the filled application data.
Preferably, the method for checking the file header based on the predefined classification rule base comprises the following steps: the association relation among the data suffix, the data type and the data signature is defined in the classification rule base; when it is recognized that the information in the header of the application data corresponds to a certain data signature, the application data is assigned to the data type corresponding to the certain data signature.
Preferably, the data types are respectively matched to structured data, semi-structured data and unstructured data; and, the structured data includes csv files and web form files; the semi-structured data comprises Email files, xml files, html files, json files, log files, yaml files and ini files; unstructured data includes text files, audio files, video files, image files.
Preferably, the feature extraction process of the text file is as follows: constructing a corpus, segmenting the original corpus in the corpus by adopting a bi-directional matching algorithm, and extracting the inverse document frequency of the segmented words by adopting an IDF algorithm; sequentially calculating importance of the segmented words contained in the text file to be extracted, and taking a plurality of segmented words with highest sequence as key words of the text file; and sequentially calculating the importance of each sentence contained in the text file to be extracted, and taking the sentences with the highest ranks as abstracts of the text.
Preferably, the method for calculating the importance of the segmentation comprises the following steps: calculating the maximum distance and the occurrence frequency of the word in the text file to be extracted, wherein the maximum distance is determined based on the distance between the first occurrence and the last occurrence of the word in the text file to be extracted; the importance of the term is calculated based on the maximum distance, the frequency of occurrence, and the inverse document frequency of the term.
Preferably, the importance P of the word segmentation i Is that
Figure BDA0003980095190000031
Wherein, IDF i For the inverse document frequency of the ith segmentation word in the corpus, L i The maximum distance of the 0 th word in the text file to be extracted is L i =last i -first i +1,last i First, the last place of occurrence of a word i For the first appearance position of the word, N is the total number of the word contained in the text file to be extracted, F i The frequency of occurrence of the ith word in the text file to be extracted is represented by F i =m i /N,m i The number of occurrences of the ith segmentation word in the text file to be extracted.
Preferably, the importance S of the jth sentence j The method comprises the following steps:
Figure BDA0003980095190000032
wherein n is j P, which is the number of words in the j-th sentence j,i For the ith word in the jth sentenceImportance in the text file is extracted.
Preferably, in the feature extraction process of the image file, geographic information, resolution and bit depth of the image data are extracted, and the geographic information, resolution and bit depth are stored in an extension field of the image file.
Preferably, in the feature extraction process of the audio file, ID3 data of the audio file is extracted, and the ID3 data is stored in an extension field of the audio file.
Preferably, after the user's modification to the meta-information is obtained, the meta-information is formed into block data and uploaded into the blockchain.
The second aspect of the present invention relates to a meta information extraction system for block chain-oriented data sharing, where the system is configured to implement the steps of the meta information extraction method for block chain-oriented data sharing in the first aspect of the present invention; the system comprises a filling module, a classifying module and an extracting module; the filling module is used for collecting application data and reading personalized information from a user, and filling the application data based on the personalized information of the user; the classification module is used for classifying the filled application data by adopting a predefined classification rule base so as to respectively acquire structured data, semi-structured data and unstructured data; and the extraction module is used for extracting keywords from the structured data, extracting features from the semi-structured data and the unstructured data, generating meta-information through the keyword extraction and the feature extraction, and linking the meta-information.
Compared with the prior art, the metadata extraction method and system for the block chain data sharing can fill application data through personalized information of users, realize data classification by adopting a classification rule base, and then perform feature extraction on different types of data, thereby realizing metadata generation and uplink. The method is clear, ingenious in conception and strong in universality, can provide specific feature extraction aiming at the features of different types of data, greatly improves the accuracy of feature extraction, enables the generation process of metadata to be more independent and efficient, and provides a good foundation for the subsequent data analysis and processing process after the data is uploaded.
The beneficial effects of the invention also include:
1. the method can be applied to structured, unstructured and semi-structured data (especially for audio, text and image data) in the block chain data uplink process, has strong scheme universality, can effectively and automatically extract data meta-information, simplifies the data uplink complexity, enriches the data content on the chain, and is beneficial to improving the block chain data retrieval efficiency and retrieval precision.
2. The invention builds a simple and easy-to-use meta-information automatic extraction mechanism by establishing the data classification rule base and the characteristic reference base, can realize the automatic complement of the application data meta-information under the block chain, improves the block data content, and improves the sharing and searching convenience of the block chain data.
3. The meta information extraction method can quickly generate the user data meta information, is beneficial to improving the accuracy of the on-chain data during inquiry and retrieval, and avoids frequent access to the off-chain data.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a method for extracting meta information for blockchain data sharing according to the present invention;
FIG. 2 is a schematic diagram of extracting characteristics of a Chinese document in a meta-information extraction method for sharing blockchain data according to the present invention;
fig. 3 is a schematic block diagram of a meta information extraction system for blockchain data sharing according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. The described embodiments of the invention are only some, but not all, embodiments of the invention. All other embodiments of the invention not described herein, which are obtained from the embodiments described herein, should be within the scope of the invention by those of ordinary skill in the art without undue effort based on the spirit of the present invention.
Fig. 1 is a schematic diagram illustrating steps of a meta information extraction method for blockchain data sharing according to the present invention. As shown in fig. 1, the first aspect of the present invention relates to a meta information extraction method for block chain data sharing, where the method includes steps 1 to 3.
And step 1, acquiring application data, reading personalized information from a user, and filling the application data based on the personalized information of the user.
It can be understood that the application data in the present invention may be extracted from various servers at the application end or various devices at the user end that need to perform the data linking. The application data can be from one application or a plurality of applications, and the applications can realize the data analysis, storage and operation of the decentralization of the data through the current blockchain network and realize the transmission and interaction of the data through a communication network transmission mode.
It will be appreciated that the user here may be personalized information from an application, and thus an application is here considered as an independent user. Of course, the present invention does not exclude the client device of the real user under an application as an independent user, so that the data content can be associated and matched with the personalized information of the user.
Specifically, the personalized information of the user and the filled application data comprise: data file name, data path, data owner, data size, data time, data type, data description, data security level, file suffix, data tag.
Wherein the data file name is the file name of the data to be shared, such as "records. Txt"; the data path is the storage path of the data to be shared in the user's local storage system, such as
https://10.90.X.x:9023/file/ae83ff23; the data owner is the user's ID; the data size is the size of the file to be shared; the data time is the latest modification time of the file to be shared; the data types are data types corresponding to the data to be shared, and comprise audio, documents, images, videos, files, other data and the like; the data description is a detailed description text of the data, such as "this file is a company XX data record … … of month 3 of 2022"; the file suffix is the suffix of the data file, such as ". Txt"; the data security class is a user-defined data security level, including three types of 0 (fully open), 1 (authorized sharing), and 2 (private), such as "0"; the data tag is a plurality of text tags set by the user, such as "business failure, north center". The data description and data tag in this information may be empty, complemented in a subsequent step.
The invention may also generate a data summary in accordance with the foregoing. Specifically, the data digest may be obtained by means of SHA256 (secure hash algorithm 256). The algorithm of the abstract can be: random asymmetric key pair RSA for pre-generated data content
(Rivest, shamir, adleman) key and send the public key to the user's local storage module, thereby pulling the encrypted data stream. After decryption of the private key, SHA256 fingerprint computation of the data is completed, thereby generating a digest field.
And step 2, classifying the filled application data by adopting a predefined classification rule base, so as to respectively obtain structured data, semi-structured data and unstructured data.
The function of the classification rule base generated in advance in the invention is to reasonably classify various application data processed in the step 1, so that the data are distributed into different types, and the invention can realize the process of reasonably extracting the characteristics according to the types of the data.
Preferably, the predefined classification rule base implements the definition of data types based on MIME (Multipurpose Internet Mail Extensions, multipurpose internet mail extension type) standards; and extracting the file header from the filled application data, and checking the file header based on a predefined classification rule base, thereby realizing classification of the filled application data.
In particular, the predefined classification rule base may be defined based on the standard implementation described above. Specifically, for the suffix names of different original data files, if the classification rule base identifies that the suffix names are recorded in the classification rule base, the classification of the data files can be realized according to the predefined types.
Table 1 is the content of the classification rule base in one embodiment. As shown in table 1, the correspondence between different fields is included in the classification rule base. For this embodiment, it performs an associative mapping of the file suffix field suffix, the data type field type, and the data signature field signature. In this way, the classification rule base can implement classification for a plurality of different files.
Figure BDA0003980095190000061
Figure BDA0003980095190000071
TABLE 1 Classification rules library
Preferably, the method for checking the file header based on the predefined classification rule base comprises the following steps: the association relation among the data suffix, the data type and the data signature is defined in the classification rule base; when it is recognized that the information in the header of the application data corresponds to a certain data signature, the application data is assigned to the data type corresponding to the certain data signature.
It can be understood that in the present invention, header information of an application data file may be extracted and corresponding to signature information in table 2, and if a corresponding signature is found, the data file may be allocated to a data type represented in table 2. In addition, through the suffix of the data file, whether the data is the corresponding data type can be searched or judged.
Preferably, the data types are respectively matched to structured data, semi-structured data and unstructured data; and, the structured data includes csv files and web form files; the semi-structured data comprises Email files, xml files, html files, json files, log files, yaml files and ini files; unstructured data includes text files, audio files, video files, image files.
It will be appreciated that the data types in the present invention are respectively corresponding to different types of data, while the structured data, the semi-structured data and the unstructured data respectively correspond to different feature extraction modes.
And 3, extracting keywords from the structured data, extracting features from the semi-structured data and the unstructured data, generating meta-information through the keyword extraction and the feature extraction, and linking the meta-information.
It will be appreciated that since structured data is typically stored in the form of a strict data table or the like, extraction of key fields may be performed simply therefrom, and important features of the structured data may be obtained.
However, the process of feature extraction is complex for both semi-structured and unstructured data. In particular, since file type data such as text, image, audio and video do not have strict structured data with standardized structures, the subsequent processing analysis process of the data can be greatly influenced and improved by accurately extracting the characteristics of the file.
Fig. 2 is a schematic diagram of feature extraction of a file in a meta information extraction method for blockchain data sharing according to the present invention. As shown in fig. 2, the feature extraction process of the text file is preferably as follows: constructing a corpus, segmenting the original corpus in the corpus by adopting a bi-directional matching algorithm, and extracting the inverse document frequency of the segmented words by adopting an IDF algorithm; sequentially calculating importance of the segmented words contained in the text file to be extracted, and taking a plurality of segmented words with highest sequence as key words of the text file; and sequentially calculating the importance of each sentence contained in the text file to be extracted, and taking the sentences with the highest ranks as abstracts of the text.
In particular, compared with the abundant self-contained meta information tags of audio class and image class data, the meta information of text class data is usually directly contained in the data description, and keywords and topics contained in the text need to be discovered through specific means.
For text data, the invention mainly extracts keywords of the text and abstract description text for completing data description and data tag fields. In one embodiment of the present invention, power data is taken as an example for explanation. Firstly, constructing a dictionary base based on an open source corpus and a power industry corpus, segmenting words of an original corpus based on a two-way matching algorithm, and pre-calculating global importance of each segmented word, namely Inverse Document Frequency (IDF) through an optimized TF-IDF (term frequency-inverse text frequency index) algorithm; then, for the new text, the relative importance degree of the text of each word can be calculated, so that k keywords with highest ranking are obtained; and finally, calculating the average importance degree of each sentence based on the relative importance degree of each keyword in the sentences, and taking the sentence with the highest score as the text abstract.
It will be appreciated that the inverse document frequency IDF of the word segmentation in the present invention i The calculation formula of (2) can be
IDF i =w i *ln(D+1)/(d i +1)
Wherein the weight of the ith word in the proprietary word stock can be w i The value can be a number between 0 and 1, and the default value is 1. In addition, D is the size of the corpus, and D i It may be the number of times the word segment appears in the corpus.
Preferably, the method for calculating the importance of the segmentation comprises the following steps: calculating the maximum distance and the occurrence frequency of the word in the text file to be extracted, wherein the maximum distance is determined based on the distance between the first occurrence and the last occurrence of the word in the text file to be extracted; the importance of the term is calculated based on the maximum distance, the frequency of occurrence, and the inverse document frequency of the term.
Specifically, importance P of word segmentation i Is that
Figure BDA0003980095190000081
Wherein, IDF i For the inverse document frequency of the ith segmentation in the corpus,
L i the maximum distance of the ith word in the text file to be extracted is L i =last i -first i +1,last i First, the last place of occurrence of a word i For the location where the word segmentation first occurs,
n is the total number of tokens contained in the text file to be extracted,
F i the frequency of occurrence of the ith word in the text file to be extracted is represented by F i =m i /N,m i The number of occurrences of the ith segmentation word in the text file to be extracted.
It is understood that the conventional TF-IDF algorithm considers only frequency information of the word segmentation, and does not consider position information of the word segmentation. However, the most important word breaks tend to occur directly at the beginning and end of the article, with the wider the scope of the word break, the more important it is. Thus, the present invention contemplates suppressing or enhancing the importance of words by their relative location lengths. For words with wider occurrence ranges, their weights are boosted, while for words with smaller occurrence ranges, their weights are suppressed. The invention can fully acquire the distribution range of a word in the document by improving the invention, thereby improving the importance of the word and leading the feature extraction to be more accurate.
Preferably, the importance S of the jth sentence j The method comprises the following steps:
Figure BDA0003980095190000091
wherein n is j For the number of segmentations contained in the jth sentence,
P j,i the importance of the ith segmentation word in the jth sentence in the text file to be extracted is obtained.
Besides analyzing the importance of word segmentation, the method also generates the abstract of the text by extracting the most critical sentences, so that the subsequent analysis of the text can be more direct and effective.
Compared with a text feature extraction method based on deep learning, the improved TF-IDF has better performance on a cold start system, small demand on original data, high operation efficiency, capability of quickly generating text keywords and abstracts, and higher practicability on block chain lightweight nodes with weaker performance (such as edge internet of things nodes).
In addition, the invention also defines a feature extraction method of the image file and the audio file.
Specifically, in the feature extraction process of the image file, geographic information, resolution and bit depth of the image data are extracted, and the geographic information, resolution and bit depth are stored in an extension field of the image file. In the characteristic extraction process of the audio file, ID3 data of the audio file is extracted, and the ID3 data is stored in an extension field of the audio file.
Table 2 is an example of feature extraction of image data in the present invention, in which specific contents of three pieces of related information of one picture are extracted, respectively.
Field name Value of Remarks
geotag 4.000000E,50.000000N Photo geographic information
img_size 1280x720 Photo resolution
depth 8 Bit depth
Table 2 image data characteristic table
Table 3 is an example of the feature extraction of audio data in the present invention, in which relevant features of a piece of audio are extracted, respectively.
Field name Value of Remarks
title 'three years two class' Audio title
author "week JJ" Audio author
year 2007 Audio age
type Jazz jazz Audio type
comment Remarks " Audio remark information
duration 5m12s Audio duration
TABLE 3 Audio data characterization Table
Preferably, after the user's modification to the meta-information is obtained, the meta-information is formed into block data and uploaded into the blockchain.
It can be understood that, in the invention, the user can evaluate and modify the automatically generated meta-information, generate a block based on the meta-information of the application data, and push the block to the blockchain network to finish the data uplink. By the meta information extraction method, the user data meta information can be generated rapidly, the accuracy of the on-chain data in query and retrieval is improved, and frequent access to the off-chain data is avoided.
Fig. 3 is a schematic block diagram of a meta information extraction system for blockchain data sharing according to the present invention. As shown in fig. 3, in a second aspect of the present invention, a meta information extraction system for block chain oriented data sharing is provided, where the system is configured to implement the steps of the meta information extraction method for block chain oriented data sharing in the first aspect of the present invention; the system comprises a filling module, a classifying module and an extracting module; the filling module is used for collecting application data and reading personalized information from a user, and filling the application data based on the personalized information of the user; the classification module is used for classifying the filled application data by adopting a predefined classification rule base so as to respectively acquire structured data, semi-structured data and unstructured data; and the extraction module is used for extracting keywords from the structured data, extracting features from the semi-structured data and the unstructured data, generating meta-information through the keyword extraction and the feature extraction, and linking the meta-information.
It may be understood that, in order to implement each function in the method provided in the foregoing embodiment of the present application, the meta information extraction system includes a hardware structure and/or a software module that perform each function. Those of skill in the art will readily appreciate that the algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application may divide the functional modules of the meta information extraction system according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
The meta information extraction system of the present invention may be implemented by one or more server devices via a local network connection, wherein the server devices comprise at least one processor, a bus system and at least one communication interface. The processor may be a central processing unit (Central Processing Unit, CPU), or may be replaced by a field programmable gate array (Field Programmable Gate Array, FPGA), application-specific integrated circuit (ASIC), or other hardware, or the FPGA or other hardware may be used together with the CPU as a processor.
The memory may be, but is not limited to, read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, as well as electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be stand alone and coupled to the processor via a bus. The memory may also be integrated with the processor.
The hard disk may be a mechanical disk or a solid state disk (Solid State Drive, SSD), etc. The interface card may be a Host Bus Adapter (HBA), a redundant array of independent disks card (Redundant Array ofIndependent Disks, RID), an Expander card (Expander), or a Network interface controller (Network InterfaceController, NIC), which is not limited by the embodiment of the present invention. The interface card in the hard disk module is communicated with the hard disk. The storage node communicates with an interface card of the hard disk module to access the hard disk in the hard disk module.
The interface of the hard disk may be a Serial attached small computer system interface (Serial Attached SmallComputer System Interface, SAS), serial Advanced TechnologyAttachment, SATA, or high speed Serial computer expansion bus standard (Peripheral ComponentInterconnect express, PCIe), etc.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, simply DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means from one website, computer, server, or data center. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.
Compared with the prior art, the metadata extraction method and system for the block chain data sharing can fill application data through personalized information of users, realize data classification by adopting a classification rule base, and then perform feature extraction on different types of data, thereby realizing metadata generation and uplink. The method is clear, ingenious in conception and strong in universality, can provide specific feature extraction aiming at the features of different types of data, greatly improves the accuracy of feature extraction, enables the generation process of metadata to be more independent and efficient, and provides a good foundation for the subsequent data analysis and processing process after the data is uploaded.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (13)

1. The meta information extraction method for the block chain data sharing is characterized by comprising the following steps of:
step 1, acquiring application data and reading personalized information from a user, and filling the application data based on the personalized information of the user;
step 2, classifying the filled application data by adopting a predefined classification rule base so as to respectively acquire structured data, semi-structured data and unstructured data;
and 3, extracting keywords from the structured data, extracting features from the semi-structured data and the unstructured data, generating meta-information through the keyword extraction and the feature extraction, and linking the meta-information.
2. The method for extracting meta information for blockchain-oriented data sharing as in claim 1, wherein:
the personalized information of the user and the filled application data comprise: data file name, data path, data owner, data size, data time, data type, data description, data security level, file suffix, data tag.
3. The method for extracting meta information for blockchain-oriented data sharing as in claim 2, wherein:
the predefined classification rule base realizes the definition of data types based on MIME standards; and, in addition, the processing unit,
extracting a file header from the filled application data, and checking the file header based on the predefined classification rule base, so as to realize classification of the filled application data.
4. A method for extracting meta information for blockchain-oriented data sharing as defined in claim 3, wherein:
the method for checking the file header based on the predefined classification rule base comprises the following steps:
the association relation among the data suffix, the data type and the data signature is defined in the classification rule base;
and when the information in the file header of the application data is identified to correspond to a certain data signature, the application data is distributed to the data type corresponding to the certain data signature.
5. The method for extracting meta information for blockchain-oriented data sharing as in claim 4, wherein:
the data types are respectively matched with structured data, semi-structured data and unstructured data; and, in addition, the processing unit,
the structured data comprises csv files and web form files;
the semi-structured data comprises Email files, xml files, html files, json files, log files, yaml files and ini files;
the unstructured data includes text files, audio files, video files, image files.
6. The method for extracting meta information for blockchain-oriented data sharing as in claim 5, wherein:
the characteristic extraction process of the text file comprises the following steps:
constructing a corpus, segmenting the original corpus in the corpus by adopting a bi-directional matching algorithm, and extracting the inverse document frequency of the segmented word by adopting an IDF algorithm;
sequentially calculating importance of the segmented words contained in the text file to be extracted, and taking a plurality of segmented words with highest sequence as key words of the text file;
and sequentially calculating the importance of each sentence contained in the text file to be extracted, and taking the sentences with the highest ranking as the abstract of the text.
7. The method for extracting meta information for blockchain-oriented data sharing as in claim 6, wherein:
the method for calculating the importance degree of the word segmentation comprises the following steps:
calculating the maximum distance and the occurrence frequency of the word in the text file to be extracted, wherein the maximum distance is determined based on the distance between the first occurrence and the last occurrence of the word in the text file to be extracted;
and calculating importance of the word segmentation based on the maximum distance, the occurrence frequency and the inverse document frequency of the word segmentation.
8. The method for extracting meta information for blockchain-oriented data sharing as in claim 7, wherein:
importance P of the word segmentation i Is that
Figure FDA0003980095180000021
Wherein, IDF i For the inverse document frequency of the ith segmentation in the corpus,
L i the maximum distance of the ith segmentation word in the text file to be extracted is L i =last i -first i +1,last i First, the last place of occurrence of a word i For the location where the word segmentation first occurs,
n is the total number of tokens contained in the text file to be extracted,
F i the frequency of occurrence of the ith segmentation word in the text file to be extracted is represented as F i =m i /N,m i And the occurrence frequency of the ith segmentation word in the text file to be extracted is obtained.
9. The method for extracting meta information for blockchain-oriented data sharing as in claim 8, wherein:
importance S of jth sentence j The method comprises the following steps:
Figure FDA0003980095180000031
wherein n is j For the number of segmentations contained in the jth sentence,
P j,i and the importance of the ith segmentation word in the jth sentence in the text file to be extracted is obtained.
10. The method for extracting meta information for blockchain-oriented data sharing as in claim 5, wherein:
and in the characteristic extraction process of the image file, extracting the geographic information, resolution and bit depth of the image data, and storing the geographic information, resolution and bit depth into an extension field of the image file.
11. The method for extracting meta information for blockchain-oriented data sharing as in claim 5, wherein:
and in the characteristic extraction process of the audio file, extracting ID3 data of the audio file, and storing the ID3 data into an extension field of the audio file.
12. The method for extracting meta information for blockchain-oriented data sharing as in claim 1, wherein:
after the user's modification to the meta-information is obtained, the meta-information is formed into block data and uploaded to the blockchain.
13. A meta information extraction system oriented to block chain data sharing is characterized in that:
the system is used for realizing the steps of the meta information extraction method for block chain data sharing according to any one of claims 1-12; and, in addition, the processing unit,
the system comprises a filling module, a classifying module and an extracting module; and, in addition, the processing unit,
the filling module is used for collecting application data and reading personalized information from a user, and filling the application data based on the personalized information of the user;
the classification module is used for classifying the filled application data by adopting a predefined classification rule base so as to respectively acquire structured data, semi-structured data and unstructured data;
the extraction module is used for extracting keywords from the structured data, extracting features from the semi-structured data and the unstructured data, generating meta-information through the keyword extraction and the feature extraction, and linking the meta-information.
CN202211546321.2A 2022-12-05 2022-12-05 Metadata extraction method and system for block chain data sharing Pending CN116166621A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211546321.2A CN116166621A (en) 2022-12-05 2022-12-05 Metadata extraction method and system for block chain data sharing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211546321.2A CN116166621A (en) 2022-12-05 2022-12-05 Metadata extraction method and system for block chain data sharing

Publications (1)

Publication Number Publication Date
CN116166621A true CN116166621A (en) 2023-05-26

Family

ID=86410079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211546321.2A Pending CN116166621A (en) 2022-12-05 2022-12-05 Metadata extraction method and system for block chain data sharing

Country Status (1)

Country Link
CN (1) CN116166621A (en)

Similar Documents

Publication Publication Date Title
US9589208B2 (en) Retrieval of similar images to a query image
US20190236102A1 (en) System and method for differential document analysis and storage
US11288242B2 (en) Similarity-based search engine
US9607267B2 (en) System and method for mining tags using social endorsement networks
JP5192475B2 (en) Object classification method and object classification system
US9053156B1 (en) Search query results based upon topic
US8630972B2 (en) Providing context for web articles
US20070136280A1 (en) Factoid-based searching
US20130232154A1 (en) Social network message categorization systems and methods
US9720979B2 (en) Method and system of identifying relevant content snippets that include additional information
US20130179426A1 (en) Search and Retrieval Methods and Systems of Short Messages Utilizing Messaging Context and Keyword Frequency
US8832126B2 (en) Custodian suggestion for efficient legal e-discovery
US10936819B2 (en) Query-directed discovery and alignment of collections of document passages for improving named entity disambiguation precision
US20140006369A1 (en) Processing structured and unstructured data
US9298757B1 (en) Determining similarity of linguistic objects
Cheng et al. Supporting entity search: a large-scale prototype search engine
CN111557000A (en) Accuracy determination for media
CN111782970B (en) Data analysis method and device
KR102007437B1 (en) Apparatus for classifying contents and method for using the same
JP6829740B2 (en) Data search method and its data search system
Iacobelli et al. Finding new information via robust entity detection
US20160246794A1 (en) Method for entity-driven alerts based on disambiguated features
CN108228101B (en) Method and system for managing data
CN116166621A (en) Metadata extraction method and system for block chain data sharing
Zhang et al. Text information classification method based on secondly fuzzy clustering algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination