CN111858483A - Software sample hybrid storage system based on multiple databases and file systems - Google Patents

Software sample hybrid storage system based on multiple databases and file systems Download PDF

Info

Publication number
CN111858483A
CN111858483A CN202010741675.7A CN202010741675A CN111858483A CN 111858483 A CN111858483 A CN 111858483A CN 202010741675 A CN202010741675 A CN 202010741675A CN 111858483 A CN111858483 A CN 111858483A
Authority
CN
China
Prior art keywords
database
storage system
retrieval
data
relational database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010741675.7A
Other languages
Chinese (zh)
Inventor
肖哲锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Panlian Xin'an Information Technology Co ltd
Original Assignee
Hunan Panlian Xin'an Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Panlian Xin'an Information Technology Co ltd filed Critical Hunan Panlian Xin'an Information Technology Co ltd
Priority to CN202010741675.7A priority Critical patent/CN111858483A/en
Publication of CN111858483A publication Critical patent/CN111858483A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database stores structural data, the distributed non-relational database stores fingerprint characteristic information extracted from a software sample, the graph database stores association relational data, the distributed file system stores text information, the cache module is connected with the retrieval module, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database. The invention solves the problems of low efficiency of warehousing and retrieving mass software sample characteristic data, difficult data storage and management, poor data service expansibility and incapability of meeting the requirement of acquiring real-time request data by multiple users, and has the advantages of classified storage, high-efficiency management and quick retrieval.

Description

Software sample hybrid storage system based on multiple databases and file systems
Technical Field
The invention mainly relates to the technical field of software storage, in particular to a software sample hybrid storage system based on various databases and file systems.
Background
Massive software samples and characteristics thereof realize the basis of homology analysis such as software piracy detection, malicious software detection, vulnerability detection and the like (software homology analysis can be understood as whether different software codes originate from the same software code or are written by the same author or team, and whether the software codes have internal relevance and similarity), because the software sample and the characteristics have the characteristics of multiple data attributes, various types and the like, the software sample and the characteristics have structural data, for example, metadata, unstructured data, such as attribute values, and graph data, such as association relation data, existing file types and database type data, and existing single-type databases, file systems or finite mixed type storage schemes all have the problems that the efficiency of warehousing and retrieval of characteristic data of massive software samples is low, data storage and management are difficult, data service expansibility is poor, and the problem that real-time request data cannot be acquired by multiple users is solved.
Disclosure of Invention
In view of this, the present invention provides a software sample hybrid storage system based on multiple databases and file systems, which can solve the problems of low efficiency of warehousing and retrieving characteristic data of mass software samples, difficult data storage and management, poor data service extensibility, and inability to satisfy the requirement of acquiring real-time request data by multiple users in the prior art.
The software sample mixed storage system based on multiple databases and file systems comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the cache module is connected with the retrieval module, the distributed file storage system is used for storing text information, the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from software samples, the graph database is used for storing incidence relation data, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database.
Further, the distributed file storage system adopts an HDFS distributed file system, and the HDFS distributed file system is arranged side by side with the graph database, the distributed non-relational database and the relational database from right to left in sequence.
Furthermore, the retrieval module uses an elastic search to establish an efficient retrieval mechanism for realizing the rapid data query of the distributed file storage system and the database system.
Further, the cache module is a Redis cache database and is used for improving the retrieval efficiency and reducing the response time.
Further, the structural data includes source codes and function information, the fingerprint feature information includes dynamic fingerprints, static fingerprints and source code fingerprint information, the association relation data includes association information, a function call relation graph and program control flow graph information, and the text information includes security analysis reports, vulnerability information, forum related security information and security related blog information.
Further, the retrieval module uses an elastic search to establish an efficient retrieval mechanism for realizing fast data query of the distributed file storage system and the database system, which is specifically represented as follows:
step 1, an external query request is provided;
step 2, inquiring whether cached data exist or not through a Redis cache database, and if yes, executing step 5; otherwise, executing step 3;
step 3, submitting the external query request to a retrieval module for retrieval, performing word segmentation processing on query request statements by the retrieval module, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and an HDFS distributed file system, acquiring structured, non-structured and graph relational data and software sample files as required, and aggregating returned retrieval results by the retrieval module;
step 4, caching the query result in a Redis cache database while returning the query result, and providing high-efficiency data service for subsequent repeated use of the same data;
and 5, returning a retrieval result.
The invention provides a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the cache module is used for storing data, the retrieval module is used for realizing the rapid data query of the distributed file storage system and the database system, the distributed file storage system is used for storing text information, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from a software sample, the graph database is used for storing associated relational data, the cache module is connected with the retrieval module, the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database, compared with the prior art, the method and the device solve the problems that the storage and retrieval efficiency of the characteristic data of the massive software samples is low, the data storage and management are difficult, the data service expansibility is poor, and the requirement of acquiring real-time request data by multiple users cannot be met, and realize the classified storage, the efficient management and the quick retrieval of the software sample data.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a block diagram of a hybrid storage system for software samples based on various databases and file systems according to an embodiment of the present invention;
fig. 2 is a flowchart of a retrieval module implementing data query retrieval through an Elasticsearch retrieval according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In the present invention, the orientations such as "left" and "right" are used with reference to the view shown in fig. 1.
The invention provides a software sample mixed storage system based on multiple databases and a file system, which comprises a cache module, a retrieval module, a distributed file storage system and a database system, wherein the database system comprises a relational database (MySQL), a distributed non-relational database (Cassandra) and a graph database (Neo4j) which are arranged side by side, specifically, the cache module is used for storing data, the retrieval efficiency can be improved, and the response time is reduced, the retrieval module is used for realizing the rapid data query of the distributed file storage system and the database system, the distributed file storage system is used for storing text information, MySQL is used for storing structural data, Cassandra is used for storing fingerprint characteristic information extracted from a software sample, Neo4j is used for storing associated relational data, the cache module is connected with the retrieval module to realize the bidirectional data transmission, and the retrieval module is respectively connected with the distributed file storage system, MySQL, Cassandra and Neo4j are connected, and bidirectional data transfer is also realized. Through the arrangement, on one hand, the advantages of the distributed file storage system and each database can be fully utilized to be suitable for different application scenes of the system; on the other hand, the efficient management and the quick retrieval of the software sample data can be realized.
As a preferred embodiment of the present invention, the distributed file storage system adopts an HDFS distributed file system, the retrieval module uses an Elasticsearch to establish an efficient retrieval mechanism, and the HDFS distributed file system is arranged side by side with Neo4j, Cassandra, and MySQL from right to left in sequence, specifically referring to fig. 1.
In a further technical solution, the structural data includes source code and function information, the fingerprint feature information includes dynamic fingerprint, static fingerprint and source code fingerprint information, the association relation data includes association information, a function call relation graph and program control flow graph information, the text information includes a security analysis report, vulnerability information, forum related security information and security related blog information, but is not limited thereto, that is, the classified storage of software sample data is realized, when the source code and the function information occur, the source code and the function information are stored in MySQL, when the dynamic fingerprint, the static fingerprint and the source code fingerprint information occur, the source code and the function information are stored in Cassandra, and when the association information, the function call relation graph and the program control flow graph information occur, the source code and the function information are stored in Neo4 j.
Meanwhile, referring to fig. 2, a flowchart for implementing data query retrieval by an Elasticsearch retrieval for a retrieval module specifically includes the following steps:
step 1, an external query request is provided;
step 2, inquiring whether cached data exist or not through a Redis cache database, and if yes, executing step 5; otherwise, executing step 3;
step 3, submitting an external query request to an Elasticissearch for retrieval, performing word segmentation processing on a query request statement by the Elasticissearch, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and a HDFS distributed file system, acquiring structured, non-structured, graph relational data and text information as required, and gathering returned retrieval results by the Elasticissearch;
step 4, caching the query result in a Redis cache database while returning the query result, and providing high-efficiency data service for subsequent repeated use of the same data;
and 5, returning a retrieval result.
In a word, the invention solves the problems of low efficiency of warehousing and retrieval of characteristic data of mass software samples, difficult data storage and management, poor data service expansibility and incapability of meeting the requirement of acquiring real-time request data by multiple users through the cache module, the retrieval module, the distributed file storage system and the database system, and realizes the classified storage, the efficient management and the rapid retrieval of software sample data.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. The software sample hybrid storage system based on multiple databases and a file system is characterized by comprising a cache module, a retrieval module, a distributed file storage system and a database system, wherein the cache module is connected with the retrieval module, the distributed file storage system is used for storing text information, the database system comprises a relational database, a distributed non-relational database and a graph database which are arranged side by side, the relational database is used for storing structural data, the distributed non-relational database is used for storing fingerprint characteristic information extracted from a software sample, the graph database is used for storing incidence relation data, and the retrieval module is respectively connected with the distributed file storage system, the relational database, the distributed non-relational database and the graph database.
2. The software sample hybrid storage system based on multiple databases and file systems according to claim 1, wherein the distributed file storage system employs an HDFS distributed file system, and the HDFS distributed file system is arranged side by side with the graph database, the distributed non-relational database, and the relational database from right to left in sequence.
3. The software sample hybrid storage system based on multiple databases and file systems according to claim 2, wherein the retrieval module uses an elastic search to establish an efficient retrieval mechanism for implementing fast data query of the distributed file storage system and database system.
4. The software sample hybrid storage system based on multiple databases and file systems according to claim 3, wherein the cache module is a Redis cache database for improving retrieval efficiency and reducing response time.
5. The multi-database and file system based software sample hybrid storage system according to claim 4, wherein the structural data comprises source code and function information, the fingerprint feature information comprises dynamic fingerprint, static fingerprint and source code fingerprint information, the association relation data comprises association information, function call relation graph and program control flow graph information, and the text information comprises security analysis report, vulnerability information, security related forum information and security related blog information.
6. The software sample hybrid storage system based on multiple databases and file systems according to claim 5, wherein the retrieval module uses an elastic search to establish an efficient retrieval mechanism for implementing fast data query of the distributed file storage system and the database system, which is embodied as:
step 1, an external query request is provided;
step 2, inquiring whether cached data exist or not through a Redis cache database, and if yes, executing step 5; otherwise, executing step 3;
step 3, submitting the external query request to a retrieval module for retrieval, performing word segmentation processing on query request sentences by the retrieval module, then performing fragment retrieval in indexes of a relational database, a distributed non-relational database, a graph database and an HDFS distributed file system, acquiring structured, non-structured, graph relational data and text information as required, and aggregating returned retrieval results by the retrieval module;
step 4, caching the query result in a Redis cache database while returning the query result, and providing high-efficiency data service for subsequent repeated use of the same data;
and 5, returning a retrieval result.
CN202010741675.7A 2020-07-29 2020-07-29 Software sample hybrid storage system based on multiple databases and file systems Pending CN111858483A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010741675.7A CN111858483A (en) 2020-07-29 2020-07-29 Software sample hybrid storage system based on multiple databases and file systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010741675.7A CN111858483A (en) 2020-07-29 2020-07-29 Software sample hybrid storage system based on multiple databases and file systems

Publications (1)

Publication Number Publication Date
CN111858483A true CN111858483A (en) 2020-10-30

Family

ID=72948372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010741675.7A Pending CN111858483A (en) 2020-07-29 2020-07-29 Software sample hybrid storage system based on multiple databases and file systems

Country Status (1)

Country Link
CN (1) CN111858483A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609228A (en) * 2021-08-13 2021-11-05 东南数字经济发展研究院 Exercise health cross-modal data distributed storage and retrieval system
CN114218234A (en) * 2022-02-22 2022-03-22 深圳市一号互联科技有限公司 Method and system for storing data of native map

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927360A (en) * 2014-04-18 2014-07-16 北京大学 Software project semantic information presentation and retrieval method based on graph model
CN106611046A (en) * 2016-12-16 2017-05-03 武汉中地数码科技有限公司 Big data technology-based space data storage processing middleware framework
CN106708993A (en) * 2016-12-16 2017-05-24 武汉中地数码科技有限公司 Spatial data storage processing middleware framework realization method based on big data technology
CN109189752A (en) * 2018-10-12 2019-01-11 国网山东省电力公司电力科学研究院 Power marketing knowledge base system based on intelligent Search Technique
CN109492040A (en) * 2018-11-06 2019-03-19 深圳航天智慧城市系统技术研究院有限公司 A kind of system suitable for data center's magnanimity short message data processing
CN109783599A (en) * 2018-12-29 2019-05-21 北京航天云路有限公司 Knowledge mapping search method and system based on multi storage
CN111078765A (en) * 2019-11-13 2020-04-28 北京中盾安全技术开发公司 View base system based on Hadoop system architecture and construction method thereof
CN111460236A (en) * 2020-04-26 2020-07-28 天津七一二通信广播股份有限公司 Big data acquisition administers quick retrieval system based on data lake

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927360A (en) * 2014-04-18 2014-07-16 北京大学 Software project semantic information presentation and retrieval method based on graph model
CN106611046A (en) * 2016-12-16 2017-05-03 武汉中地数码科技有限公司 Big data technology-based space data storage processing middleware framework
CN106708993A (en) * 2016-12-16 2017-05-24 武汉中地数码科技有限公司 Spatial data storage processing middleware framework realization method based on big data technology
CN109189752A (en) * 2018-10-12 2019-01-11 国网山东省电力公司电力科学研究院 Power marketing knowledge base system based on intelligent Search Technique
CN109492040A (en) * 2018-11-06 2019-03-19 深圳航天智慧城市系统技术研究院有限公司 A kind of system suitable for data center's magnanimity short message data processing
CN109783599A (en) * 2018-12-29 2019-05-21 北京航天云路有限公司 Knowledge mapping search method and system based on multi storage
CN111078765A (en) * 2019-11-13 2020-04-28 北京中盾安全技术开发公司 View base system based on Hadoop system architecture and construction method thereof
CN111460236A (en) * 2020-04-26 2020-07-28 天津七一二通信广播股份有限公司 Big data acquisition administers quick retrieval system based on data lake

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609228A (en) * 2021-08-13 2021-11-05 东南数字经济发展研究院 Exercise health cross-modal data distributed storage and retrieval system
CN114218234A (en) * 2022-02-22 2022-03-22 深圳市一号互联科技有限公司 Method and system for storing data of native map
CN114218234B (en) * 2022-02-22 2022-04-29 深圳市一号互联科技有限公司 Raw map data storage method

Similar Documents

Publication Publication Date Title
JP5814989B2 (en) Method and system for high performance integration, processing and search of structured and unstructured data using coprocessors
US11620397B2 (en) Methods and apparatus to provide group-based row-level security for big data platforms
US7644107B2 (en) System and method for batched indexing of network documents
US8799291B2 (en) Forensic index method and apparatus by distributed processing
EP2605158A1 (en) Mixed join of row and column database tables in native orientation
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
EP1585073A1 (en) Method for duplicate detection and suppression
CN102930060B (en) A kind of method of database quick indexing and device
US20110072008A1 (en) Query Optimization with Awareness of Limited Resource Usage
US10860562B1 (en) Dynamic predicate indexing for data stores
US8745062B2 (en) Systems, methods, and computer program products for fast and scalable proximal search for search queries
CN102819592A (en) Lucene-based desktop searching system and method
CN111858483A (en) Software sample hybrid storage system based on multiple databases and file systems
KR101544560B1 (en) An online analytical processing system for big data by caching the results and generating 2-level queries by SQL parsing
JP4109305B1 (en) Database query processing system using multi-operation processing
US20160004749A1 (en) Search system and search method
US10366067B2 (en) Adaptive index leaf block compression
US8805820B1 (en) Systems and methods for facilitating searches involving multiple indexes
CN115080684B (en) Network disk document indexing method and device, network disk and storage medium
CN108228101B (en) Method and system for managing data
CN112416626B (en) Data processing method and device
Ragavan et al. A Novel Big Data Storage Reduction Model for Drill Down Search.
Cha An effective and efficient indexing scheme for audio fingerprinting
US11275786B2 (en) Implementing enhanced DevOps process for cognitive search solutions
CN108984720B (en) Data query method and device based on column storage, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination