CN111858483A - Software sample hybrid storage system based on multiple databases and file systems - Google Patents
Software sample hybrid storage system based on multiple databases and file systems Download PDFInfo
- Publication number
- CN111858483A CN111858483A CN202010741675.7A CN202010741675A CN111858483A CN 111858483 A CN111858483 A CN 111858483A CN 202010741675 A CN202010741675 A CN 202010741675A CN 111858483 A CN111858483 A CN 111858483A
- Authority
- CN
- China
- Prior art keywords
- database
- storage system
- retrieval
- data
- relational database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000007246 mechanism Effects 0.000 claims description 5
- 230000003068 static effect Effects 0.000 claims description 4
- 239000012634 fragment Substances 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims description 2
- 238000013523 data management Methods 0.000 abstract description 5
- 238000013500 data storage Methods 0.000 abstract description 5
- 238000007726 management method Methods 0.000 abstract description 4
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000034 method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Library & Information Science (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database stores structural data, the distributed non-relational database stores fingerprint characteristic information extracted from a software sample, the graph database stores association relational data, the distributed file system stores text information, the cache module is connected with the retrieval module, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database. The invention solves the problems of low efficiency of warehousing and retrieving mass software sample characteristic data, difficult data storage and management, poor data service expansibility and incapability of meeting the requirement of acquiring real-time request data by multiple users, and has the advantages of classified storage, high-efficiency management and quick retrieval.
Description
Technical Field
The invention mainly relates to the technical field of software storage, in particular to a software sample hybrid storage system based on various databases and file systems.
Background
Massive software samples and characteristics thereof realize the basis of homology analysis such as software piracy detection, malicious software detection, vulnerability detection and the like (software homology analysis can be understood as whether different software codes originate from the same software code or are written by the same author or team, and whether the software codes have internal relevance and similarity), because the software sample and the characteristics have the characteristics of multiple data attributes, various types and the like, the software sample and the characteristics have structural data, for example, metadata, unstructured data, such as attribute values, and graph data, such as association relation data, existing file types and database type data, and existing single-type databases, file systems or finite mixed type storage schemes all have the problems that the efficiency of warehousing and retrieval of characteristic data of massive software samples is low, data storage and management are difficult, data service expansibility is poor, and the problem that real-time request data cannot be acquired by multiple users is solved.
Disclosure of Invention
In view of this, the present invention provides a software sample hybrid storage system based on multiple databases and file systems, which can solve the problems of low efficiency of warehousing and retrieving characteristic data of mass software samples, difficult data storage and management, poor data service extensibility, and inability to satisfy the requirement of acquiring real-time request data by multiple users in the prior art.
The software sample mixed storage system based on multiple databases and file systems comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the cache module is connected with the retrieval module, the distributed file storage system is used for storing text information, the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from software samples, the graph database is used for storing incidence relation data, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database.
Further, the distributed file storage system adopts an HDFS distributed file system, and the HDFS distributed file system is arranged side by side with the graph database, the distributed non-relational database and the relational database from right to left in sequence.
Furthermore, the retrieval module uses an elastic search to establish an efficient retrieval mechanism for realizing the rapid data query of the distributed file storage system and the database system.
Further, the cache module is a Redis cache database and is used for improving the retrieval efficiency and reducing the response time.
Further, the structural data includes source codes and function information, the fingerprint feature information includes dynamic fingerprints, static fingerprints and source code fingerprint information, the association relation data includes association information, a function call relation graph and program control flow graph information, and the text information includes security analysis reports, vulnerability information, forum related security information and security related blog information.
Further, the retrieval module uses an elastic search to establish an efficient retrieval mechanism for realizing fast data query of the distributed file storage system and the database system, which is specifically represented as follows:
step 1, an external query request is provided;
step 2, inquiring whether cached data exist or not through a Redis cache database, and if yes, executing step 5; otherwise, executing step 3;
step 3, submitting the external query request to a retrieval module for retrieval, performing word segmentation processing on query request statements by the retrieval module, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and an HDFS distributed file system, acquiring structured, non-structured and graph relational data and software sample files as required, and aggregating returned retrieval results by the retrieval module;
step 4, caching the query result in a Redis cache database while returning the query result, and providing high-efficiency data service for subsequent repeated use of the same data;
and 5, returning a retrieval result.
The invention provides a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the cache module is used for storing data, the retrieval module is used for realizing the rapid data query of the distributed file storage system and the database system, the distributed file storage system is used for storing text information, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from a software sample, the graph database is used for storing associated relational data, the cache module is connected with the retrieval module, the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database, compared with the prior art, the method and the device solve the problems that the storage and retrieval efficiency of the characteristic data of the massive software samples is low, the data storage and management are difficult, the data service expansibility is poor, and the requirement of acquiring real-time request data by multiple users cannot be met, and realize the classified storage, the efficient management and the quick retrieval of the software sample data.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a block diagram of a hybrid storage system for software samples based on various databases and file systems according to an embodiment of the present invention;
fig. 2 is a flowchart of a retrieval module implementing data query retrieval through an Elasticsearch retrieval according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In the present invention, the orientations such as "left" and "right" are used with reference to the view shown in fig. 1.
The invention provides a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database (MySQL), a distributed non-relational database (Cassandra) and a graph database (Neo4j) which are arranged side by side, specifically, the cache module is used for storing data, the retrieval efficiency can be improved, and the response time is reduced, the retrieval module is used for realizing the rapid data query of the distributed file storage system and the database system, the distributed file storage system is used for storing text information, MySQL is used for storing structural data, Cassandra is used for storing fingerprint characteristic information extracted from a software sample, Neo4j is used for storing associated relational data, the cache module is connected with the retrieval module to realize the bidirectional data transmission, and the retrieval module is respectively connected with the distributed file storage system, MySQL, Cassandra and Neo4j are connected, and bidirectional data transfer is also realized. Through the arrangement, on one hand, the advantages of the distributed file storage system and each database can be fully utilized to be suitable for different application scenes of the system; on the other hand, the efficient management and the quick retrieval of the software sample data can be realized.
As a preferred embodiment of the present invention, the distributed file storage system adopts an HDFS distributed file system, the retrieval module uses an Elasticsearch to establish an efficient retrieval mechanism, and the HDFS distributed file system is arranged side by side with Neo4j, Cassandra, and MySQL from right to left in sequence, specifically referring to fig. 1.
In a further technical solution, the structural data includes source code and function information, the fingerprint feature information includes dynamic fingerprint, static fingerprint and source code fingerprint information, the association relation data includes association information, a function call relation graph and program control flow graph information, the text information includes a security analysis report, vulnerability information, forum related security information and security related blog information, but is not limited thereto, that is, the classified storage of software sample data is realized, when the source code and the function information occur, the source code and the function information are stored in MySQL, when the dynamic fingerprint, the static fingerprint and the source code fingerprint information occur, the source code and the function information are stored in Cassandra, and when the association information, the function call relation graph and the program control flow graph information occur, the source code and the function information are stored in Neo4 j.
Meanwhile, referring to fig. 2, a flowchart for implementing data query retrieval by an Elasticsearch retrieval for a retrieval module specifically includes the following steps:
step 1, an external query request is provided;
step 2, inquiring whether cached data exist or not through a Redis cache database, and if yes, executing step 5; otherwise, executing step 3;
step 3, submitting an external query request to an Elasticissearch for retrieval, performing word segmentation processing on a query request statement by the Elasticissearch, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and a HDFS distributed file system, acquiring structured, non-structured, graph relational data and text information as required, and gathering returned retrieval results by the Elasticissearch;
step 4, caching the query result in a Redis cache database while returning the query result, and providing high-efficiency data service for subsequent repeated use of the same data;
and 5, returning a retrieval result.
In a word, the invention solves the problems of low efficiency of warehousing and retrieval of characteristic data of mass software samples, difficult data storage and management, poor data service expansibility and incapability of meeting the requirement of acquiring real-time request data by multiple users through the cache module, the retrieval module, the distributed file storage system and the database system, and realizes the classified storage, the efficient management and the rapid retrieval of software sample data.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (6)
1. The software sample hybrid storage system based on multiple databases and a file system is characterized by comprising a cache module, a retrieval module, a distributed file storage system and a database system, wherein the cache module is connected with the retrieval module, the distributed file storage system is used for storing text information, the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from a software sample, the graph database is used for storing incidence relation data, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database.
2. The software sample hybrid storage system based on multiple databases and file systems according to claim 1, wherein the distributed file storage system employs an HDFS distributed file system, and the HDFS distributed file system is arranged side by side with the graph database, the distributed non-relational database, and the relational database from right to left in sequence.
3. The software sample hybrid storage system based on multiple databases and file systems according to claim 2, wherein the retrieval module uses an elastic search to establish an efficient retrieval mechanism for implementing fast data query of the distributed file storage system and database system.
4. The software sample hybrid storage system based on multiple databases and file systems according to claim 3, wherein the cache module is a Redis cache database for improving retrieval efficiency and reducing response time.
5. The multi-database and file system based software sample hybrid storage system according to claim 4, wherein the structural data comprises source code and function information, the fingerprint feature information comprises dynamic fingerprint, static fingerprint and source code fingerprint information, the association relation data comprises association information, function call relation graph and program control flow graph information, and the text information comprises security analysis report, vulnerability information, security related forum information and security related blog information.
6. The software sample hybrid storage system based on multiple databases and file systems according to claim 5, wherein the retrieval module uses an elastic search to establish an efficient retrieval mechanism for implementing fast data query of the distributed file storage system and the database system, which is embodied as:
step 1, an external query request is provided;
step 2, inquiring whether cached data exist or not through a Redis cache database, and if yes, executing step 5; otherwise, executing step 3;
step 3, submitting the external query request to a retrieval module for retrieval, performing word segmentation processing on query request sentences by the retrieval module, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and an HDFS distributed file system, acquiring structured, non-structured, graph relational data and text information as required, and aggregating returned retrieval results by the retrieval module;
step 4, caching the query result in a Redis cache database while returning the query result, and providing high-efficiency data service for subsequent repeated use of the same data;
and 5, returning a retrieval result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010741675.7A CN111858483A (en) | 2020-07-29 | 2020-07-29 | Software sample hybrid storage system based on multiple databases and file systems |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010741675.7A CN111858483A (en) | 2020-07-29 | 2020-07-29 | Software sample hybrid storage system based on multiple databases and file systems |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111858483A true CN111858483A (en) | 2020-10-30 |
Family
ID=72948372
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010741675.7A Pending CN111858483A (en) | 2020-07-29 | 2020-07-29 | Software sample hybrid storage system based on multiple databases and file systems |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111858483A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113609228A (en) * | 2021-08-13 | 2021-11-05 | 东南数字经济发展研究院 | Exercise health cross-modal data distributed storage and retrieval system |
CN114218234A (en) * | 2022-02-22 | 2022-03-22 | 深圳市一号互联科技有限公司 | Method and system for storing data of native map |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927360A (en) * | 2014-04-18 | 2014-07-16 | 北京大学 | Software project semantic information presentation and retrieval method based on graph model |
CN106611046A (en) * | 2016-12-16 | 2017-05-03 | 武汉中地数码科技有限公司 | Big data technology-based space data storage processing middleware framework |
CN106708993A (en) * | 2016-12-16 | 2017-05-24 | 武汉中地数码科技有限公司 | Spatial data storage processing middleware framework realization method based on big data technology |
CN109189752A (en) * | 2018-10-12 | 2019-01-11 | 国网山东省电力公司电力科学研究院 | Power marketing knowledge base system based on intelligent Search Technique |
CN109492040A (en) * | 2018-11-06 | 2019-03-19 | 深圳航天智慧城市系统技术研究院有限公司 | A kind of system suitable for data center's magnanimity short message data processing |
CN109783599A (en) * | 2018-12-29 | 2019-05-21 | 北京航天云路有限公司 | Knowledge mapping search method and system based on multi storage |
CN111078765A (en) * | 2019-11-13 | 2020-04-28 | 北京中盾安全技术开发公司 | View base system based on Hadoop system architecture and construction method thereof |
CN111460236A (en) * | 2020-04-26 | 2020-07-28 | 天津七一二通信广播股份有限公司 | Big data acquisition administers quick retrieval system based on data lake |
-
2020
- 2020-07-29 CN CN202010741675.7A patent/CN111858483A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927360A (en) * | 2014-04-18 | 2014-07-16 | 北京大学 | Software project semantic information presentation and retrieval method based on graph model |
CN106611046A (en) * | 2016-12-16 | 2017-05-03 | 武汉中地数码科技有限公司 | Big data technology-based space data storage processing middleware framework |
CN106708993A (en) * | 2016-12-16 | 2017-05-24 | 武汉中地数码科技有限公司 | Spatial data storage processing middleware framework realization method based on big data technology |
CN109189752A (en) * | 2018-10-12 | 2019-01-11 | 国网山东省电力公司电力科学研究院 | Power marketing knowledge base system based on intelligent Search Technique |
CN109492040A (en) * | 2018-11-06 | 2019-03-19 | 深圳航天智慧城市系统技术研究院有限公司 | A kind of system suitable for data center's magnanimity short message data processing |
CN109783599A (en) * | 2018-12-29 | 2019-05-21 | 北京航天云路有限公司 | Knowledge mapping search method and system based on multi storage |
CN111078765A (en) * | 2019-11-13 | 2020-04-28 | 北京中盾安全技术开发公司 | View base system based on Hadoop system architecture and construction method thereof |
CN111460236A (en) * | 2020-04-26 | 2020-07-28 | 天津七一二通信广播股份有限公司 | Big data acquisition administers quick retrieval system based on data lake |
Non-Patent Citations (1)
Title |
---|
曹祺: "大数据时代图书馆信息系统的系统分析与设计", 31 May 2020, 武汉大学出版社, pages: 2 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113609228A (en) * | 2021-08-13 | 2021-11-05 | 东南数字经济发展研究院 | Exercise health cross-modal data distributed storage and retrieval system |
CN114218234A (en) * | 2022-02-22 | 2022-03-22 | 深圳市一号互联科技有限公司 | Method and system for storing data of native map |
CN114218234B (en) * | 2022-02-22 | 2022-04-29 | 深圳市一号互联科技有限公司 | Raw map data storage method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5814989B2 (en) | Method and system for high performance integration, processing and search of structured and unstructured data using coprocessors | |
US11620397B2 (en) | Methods and apparatus to provide group-based row-level security for big data platforms | |
US7644107B2 (en) | System and method for batched indexing of network documents | |
US8799291B2 (en) | Forensic index method and apparatus by distributed processing | |
EP2605158A1 (en) | Mixed join of row and column database tables in native orientation | |
US8924373B2 (en) | Query plans with parameter markers in place of object identifiers | |
CN102930060B (en) | A kind of method of database quick indexing and device | |
US10860562B1 (en) | Dynamic predicate indexing for data stores | |
KR20060044563A (en) | Method for duplicate detection and suppression | |
CN112000773B (en) | Search engine technology-based data association relation mining method and application | |
US8745062B2 (en) | Systems, methods, and computer program products for fast and scalable proximal search for search queries | |
CN102819592A (en) | Lucene-based desktop searching system and method | |
CN111858483A (en) | Software sample hybrid storage system based on multiple databases and file systems | |
KR101544560B1 (en) | An online analytical processing system for big data by caching the results and generating 2-level queries by SQL parsing | |
US20160171053A1 (en) | Adaptive index leaf block compression | |
JP4109305B1 (en) | Database query processing system using multi-operation processing | |
US8805820B1 (en) | Systems and methods for facilitating searches involving multiple indexes | |
CN115080684B (en) | Network disk document indexing method and device, network disk and storage medium | |
CN108228101B (en) | Method and system for managing data | |
CN116894022A (en) | Improving accuracy and efficiency of database auditing using structured audit logs | |
CN112416626B (en) | Data processing method and device | |
Ragavan et al. | A Novel Big Data Storage Reduction Model for Drill Down Search. | |
Cha | An effective and efficient indexing scheme for audio fingerprinting | |
CN114510605A (en) | Data storage method and device, electronic equipment and storage medium | |
Vishnoi et al. | Novel table based air indexing technique for full text search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |