CN111858483A

CN111858483A - Software sample hybrid storage system based on multiple databases and file systems

Info

Publication number: CN111858483A
Application number: CN202010741675.7A
Authority: CN
Inventors: 肖哲锋
Original assignee: Hunan Panlian Xin'an Information Technology Co ltd
Current assignee: Hunan Panlian Xin'an Information Technology Co ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-10-30

Abstract

The invention discloses a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database stores structural data, the distributed non-relational database stores fingerprint characteristic information extracted from a software sample, the graph database stores association relational data, the distributed file system stores text information, the cache module is connected with the retrieval module, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database. The invention solves the problems of low efficiency of warehousing and retrieving mass software sample characteristic data, difficult data storage and management, poor data service expansibility and incapability of meeting the requirement of acquiring real-time request data by multiple users, and has the advantages of classified storage, high-efficiency management and quick retrieval.

Description

Software sample hybrid storage system based on multiple databases and file systems

Technical Field

The invention mainly relates to the technical field of software storage, in particular to a software sample hybrid storage system based on various databases and file systems.

Background

Massive software samples and characteristics thereof realize the basis of homology analysis such as software piracy detection, malicious software detection, vulnerability detection and the like (software homology analysis can be understood as whether different software codes originate from the same software code or are written by the same author or team, and whether the software codes have internal relevance and similarity), because the software sample and the characteristics have the characteristics of multiple data attributes, various types and the like, the software sample and the characteristics have structural data, for example, metadata, unstructured data, such as attribute values, and graph data, such as association relation data, existing file types and database type data, and existing single-type databases, file systems or finite mixed type storage schemes all have the problems that the efficiency of warehousing and retrieval of characteristic data of massive software samples is low, data storage and management are difficult, data service expansibility is poor, and the problem that real-time request data cannot be acquired by multiple users is solved.

Disclosure of Invention

In view of this, the present invention provides a software sample hybrid storage system based on multiple databases and file systems, which can solve the problems of low efficiency of warehousing and retrieving characteristic data of mass software samples, difficult data storage and management, poor data service extensibility, and inability to satisfy the requirement of acquiring real-time request data by multiple users in the prior art.

The software sample mixed storage system based on multiple databases and file systems comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the cache module is connected with the retrieval module, the distributed file storage system is used for storing text information, the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from software samples, the graph database is used for storing incidence relation data, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database.

Further, the distributed file storage system adopts an HDFS distributed file system, and the HDFS distributed file system is arranged side by side with the graph database, the distributed non-relational database and the relational database from right to left in sequence.

Furthermore, the retrieval module uses an elastic search to establish an efficient retrieval mechanism for realizing the rapid data query of the distributed file storage system and the database system.

Further, the cache module is a Redis cache database and is used for improving the retrieval efficiency and reducing the response time.

Further, the structural data includes source codes and function information, the fingerprint feature information includes dynamic fingerprints, static fingerprints and source code fingerprint information, the association relation data includes association information, a function call relation graph and program control flow graph information, and the text information includes security analysis reports, vulnerability information, forum related security information and security related blog information.

Further, the retrieval module uses an elastic search to establish an efficient retrieval mechanism for realizing fast data query of the distributed file storage system and the database system, which is specifically represented as follows:

step 1, an external query request is provided;

step 2, inquiring whether cached data exist or not through a Redis cache database, and if yes, executing step 5; otherwise, executing step 3;

step 3, submitting the external query request to a retrieval module for retrieval, performing word segmentation processing on query request statements by the retrieval module, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and an HDFS distributed file system, acquiring structured, non-structured and graph relational data and software sample files as required, and aggregating returned retrieval results by the retrieval module;

step 4, caching the query result in a Redis cache database while returning the query result, and providing high-efficiency data service for subsequent repeated use of the same data;

and 5, returning a retrieval result.

The invention provides a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the cache module is used for storing data, the retrieval module is used for realizing the rapid data query of the distributed file storage system and the database system, the distributed file storage system is used for storing text information, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from a software sample, the graph database is used for storing associated relational data, the cache module is connected with the retrieval module, the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database, compared with the prior art, the method and the device solve the problems that the storage and retrieval efficiency of the characteristic data of the massive software samples is low, the data storage and management are difficult, the data service expansibility is poor, and the requirement of acquiring real-time request data by multiple users cannot be met, and realize the classified storage, the efficient management and the quick retrieval of the software sample data.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a block diagram of a hybrid storage system for software samples based on various databases and file systems according to an embodiment of the present invention;

fig. 2 is a flowchart of a retrieval module implementing data query retrieval through an Elasticsearch retrieval according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In the present invention, the orientations such as "left" and "right" are used with reference to the view shown in fig. 1.

The invention provides a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database (MySQL), a distributed non-relational database (Cassandra) and a graph database (Neo4j) which are arranged side by side, specifically, the cache module is used for storing data, the retrieval efficiency can be improved, and the response time is reduced, the retrieval module is used for realizing the rapid data query of the distributed file storage system and the database system, the distributed file storage system is used for storing text information, MySQL is used for storing structural data, Cassandra is used for storing fingerprint characteristic information extracted from a software sample, Neo4j is used for storing associated relational data, the cache module is connected with the retrieval module to realize the bidirectional data transmission, and the retrieval module is respectively connected with the distributed file storage system, MySQL, Cassandra and Neo4j are connected, and bidirectional data transfer is also realized. Through the arrangement, on one hand, the advantages of the distributed file storage system and each database can be fully utilized to be suitable for different application scenes of the system; on the other hand, the efficient management and the quick retrieval of the software sample data can be realized.

As a preferred embodiment of the present invention, the distributed file storage system adopts an HDFS distributed file system, the retrieval module uses an Elasticsearch to establish an efficient retrieval mechanism, and the HDFS distributed file system is arranged side by side with Neo4j, Cassandra, and MySQL from right to left in sequence, specifically referring to fig. 1.

In a further technical solution, the structural data includes source code and function information, the fingerprint feature information includes dynamic fingerprint, static fingerprint and source code fingerprint information, the association relation data includes association information, a function call relation graph and program control flow graph information, the text information includes a security analysis report, vulnerability information, forum related security information and security related blog information, but is not limited thereto, that is, the classified storage of software sample data is realized, when the source code and the function information occur, the source code and the function information are stored in MySQL, when the dynamic fingerprint, the static fingerprint and the source code fingerprint information occur, the source code and the function information are stored in Cassandra, and when the association information, the function call relation graph and the program control flow graph information occur, the source code and the function information are stored in Neo4 j.

Meanwhile, referring to fig. 2, a flowchart for implementing data query retrieval by an Elasticsearch retrieval for a retrieval module specifically includes the following steps:

step 1, an external query request is provided;

step 3, submitting an external query request to an Elasticissearch for retrieval, performing word segmentation processing on a query request statement by the Elasticissearch, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and a HDFS distributed file system, acquiring structured, non-structured, graph relational data and text information as required, and gathering returned retrieval results by the Elasticissearch;

and 5, returning a retrieval result.

In a word, the invention solves the problems of low efficiency of warehousing and retrieval of characteristic data of mass software samples, difficult data storage and management, poor data service expansibility and incapability of meeting the requirement of acquiring real-time request data by multiple users through the cache module, the retrieval module, the distributed file storage system and the database system, and realizes the classified storage, the efficient management and the rapid retrieval of software sample data.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. The software sample hybrid storage system based on multiple databases and a file system is characterized by comprising a cache module, a retrieval module, a distributed file storage system and a database system, wherein the cache module is connected with the retrieval module, the distributed file storage system is used for storing text information, the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from a software sample, the graph database is used for storing incidence relation data, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database.

2. The software sample hybrid storage system based on multiple databases and file systems according to claim 1, wherein the distributed file storage system employs an HDFS distributed file system, and the HDFS distributed file system is arranged side by side with the graph database, the distributed non-relational database, and the relational database from right to left in sequence.

3. The software sample hybrid storage system based on multiple databases and file systems according to claim 2, wherein the retrieval module uses an elastic search to establish an efficient retrieval mechanism for implementing fast data query of the distributed file storage system and database system.

4. The software sample hybrid storage system based on multiple databases and file systems according to claim 3, wherein the cache module is a Redis cache database for improving retrieval efficiency and reducing response time.

5. The multi-database and file system based software sample hybrid storage system according to claim 4, wherein the structural data comprises source code and function information, the fingerprint feature information comprises dynamic fingerprint, static fingerprint and source code fingerprint information, the association relation data comprises association information, function call relation graph and program control flow graph information, and the text information comprises security analysis report, vulnerability information, security related forum information and security related blog information.

6. The software sample hybrid storage system based on multiple databases and file systems according to claim 5, wherein the retrieval module uses an elastic search to establish an efficient retrieval mechanism for implementing fast data query of the distributed file storage system and the database system, which is embodied as:

step 1, an external query request is provided;

step 3, submitting the external query request to a retrieval module for retrieval, performing word segmentation processing on query request sentences by the retrieval module, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and an HDFS distributed file system, acquiring structured, non-structured, graph relational data and text information as required, and aggregating returned retrieval results by the retrieval module;

and 5, returning a retrieval result.