CN113190548A

CN113190548A - HBASE-based archive library design method

Info

Publication number: CN113190548A
Application number: CN202011553605.5A
Authority: CN
Inventors: 任伟
Original assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Current assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-07-30

Abstract

A design method of an HBASE-based archive library comprises the following steps: according to the entity characteristics of the archive, combining with actual service requirements, classifying and combining the original data sources, extracting characteristic data items, and achieving the purpose of classifying and storing the original data sources; according to the data characteristics and the storage requirements of the HBASE wide table, the row keys of the HBASE wide table define the data index of each type of entity table, so that the purpose of indexing key words is achieved; in the cluster mode, the row keys designed by the keyword index are used for data query, and the required data columns are quickly returned by filtering the column keys, so that the aim of efficient query is fulfilled. Compared with the conventional file data stored in a relational database, the data with the PB level can be stored by using the technology, the retrieval requirement can be responded by proper performance, the transverse expansion is realized by cheap machine combination, and the method is suitable for the management and storage of a large amount of data; the invention can centralize the messy data into the file, which improves the convenience for the user.

Description

HBASE-based archive library design method

Technical Field

The invention relates to the field of databases, in particular to a method and a system for designing an archive library based on HBASE.

Background

With the increasingly high enterprises of security field informatization construction, the security department also issues speech in the security chief meeting in the whole country in 2018, and will vigorously implement strong police strategy and security big data strategy, deeply promote safe city construction and legal security construction, and comprehensively implement the modernization of security work. The demand for informatization in the field is increased, and data storage and management are important infrastructure work for the convenience brought by the efficient use of informatization technology.

The data set in the safety field has abundant source channels and high diversity, and the structured data coexists with unstructured or semi-structured data, such as population acquisition, civil affairs, case information, vehicle management, travel, accommodation, entertainment places, public facilities, enterprises, talents and the like, the information is subjected to data analysis and data mining technology, detailed information files of individual object entities are serially connected through simple keyword information such as identity numbers, license plate numbers and the like, and track clues and incidence relations are serially connected through file information, so that data and technical support are provided for safety agencies.

HBASE is one of solutions for storing and processing PB-level data and even EB-level data with low cost and high efficiency, and the HBASE is used as an explorator and a leader of a big data processing technology by means of a MapReduce computing engine, an HDFS file system and a BigTable storage system depending on the HADOOP ecological environment.

Disclosure of Invention

In view of the above, the present invention has been made to provide a method for designing an HBASE-based archive that overcomes or at least partially solves the above-mentioned problems.

In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:

a design method of an HBASE-based archive library comprises the following steps:

s100, classifying and combining original data sources according to the entity characteristics of the archive and the actual service requirements, and extracting characteristic data items to achieve the purpose of classifying and storing the original data sources;

s200, defining the data index of each type of entity table by the row key of the HBASE table according to the data characteristics and the storage requirement of the HBASE wide table, and achieving the purpose of indexing keywords;

s300, in a cluster mode, data query is carried out through row keys designed by the keyword indexes, and required data columns are quickly returned through filtering column keys, so that the purpose of efficient query is achieved.

Further, in S100, the archive is classified into; personnel files, vehicle files, case files.

Further, the personnel file comprises at least: personal basic information, whether the person is a high risk group, whether the person is a special group, social behavior information, personnel activity track information and personnel intimacy degree information.

Further, the vehicle profile includes at least: the vehicle information comprises vehicle basic information, vehicle activity track information, vehicle involved record information and vehicle behavior record information.

Further, the case file at least comprises: case serial number, case name, brief case, case time, case unit, reason, time, clue; case-related information: personnel involved in the case, units involved in the case, articles involved in the case, etc.; and (3) related flow: document, legal document.

Further, in S100, each data set includes a feature value, a keyword, and an update time; the key words are unique values, and the updating time is data acquisition time.

Further, in S200, the key design rule is determined by a keyword, and the keyword may be a string of numbers or a combination of characters to form a unique value with practical significance, or may be calculated by combining some strings.

Further, a flip-chip preposition processing principle is adopted for the keywords, and the opportunity that all areas conforming to HBASE can obtain data storage on average is guaranteed.

Further, in S200, when the data is the special character of the timestamp, the timestamps are sorted and stored according to a natural order, a difference between the timestamp and the maximum time sequence number is calculated, and the latest data is arranged at the top.

Further, in S300, efficient querying is performed by the MR computation engine.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the invention discloses a design method of an HBASE-based archive, which classifies and merges original data sources according to the entity characteristics of the archive and by combining with actual service requirements, extracts characteristic data items and achieves the aim of classifying and storing the original data sources; according to the data characteristics and the storage requirements of the HBASE wide table, the row keys of the HBASE wide table define the data index of each type of entity table, so that the purpose of indexing key words is achieved; in the cluster mode, the row keys designed by the keyword index are used for data query, and the required data columns are quickly returned by filtering the column keys, so that the aim of efficient query is fulfilled. Compared with the conventional archive data stored in a relational database, the bigtable storage mode adopted by the invention can store data above PB level by using the technology, can respond to the retrieval requirement with proper performance, transversely expands with a cheap machine combination, and is very suitable for management and storage of a large amount of data; the design scheme provided by the scheme is matched with the requirement of a storage data structure, so that messy and complicated data are collected into files in a centralized manner, and the convenience of a user is improved.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a HBASE-based archive design method in embodiment 1 of the present invention;

fig. 2 is a schematic diagram illustrating classification of archive information in embodiment 1 of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to solve the problems in the prior art, embodiments of the present invention provide a method for designing an archive based on HBASE.

Example 1

The embodiment discloses a design method of an archive library based on HBASE, as shown in FIG. 1, which includes:

s100, classifying and combining the original data sources according to the entity characteristics of the file library and the actual service requirements, and extracting characteristic data items to achieve the purpose of classifying and storing the original data sources.

In this embodiment, each data set includes a feature value, a keyword, and an update time; the key words are unique values, and the updating time is data acquisition time.

In this embodiment, the archive information is divided into three categories, i.e. a personnel archive, a vehicle archive, and a case archive, as shown in fig. 2, according to a large number of data sources and the design requirements in the previous period.

Specifically, the personnel file can clearly reflect the basic attributes, social activity characteristics and travel information of one person, and related personnel can conveniently be used as a basis for target investigation.

The designer class archive is divided into: basic information, certificates (passports, drivers licenses, house property certificates, commuter Australia passes), passport visas, passport endorsements, school roll, parking information, working units, mobile phone number lists, religious beliefs, social insurance, members, three persons, household change records, marital information, disability records, security records, management and control records, drug-related information clues, forced drug abstinence, reimbursement records, suspected reimbursement records, blacklists, help edge personnel, and data collection is inclined to relevant business model data.

The common characteristic of the data sets is a citizen identity number, the common characteristic is that the citizen identity number is used as a key word in the classified storage, and the acquisition time or the warehousing time of each type of data is used as a time stamp.

The vehicle file can clearly reflect the basic attributes of the vehicle: such as brand, place of production, parameters, transaction records, violation of regulation and driving track, and the like, and is convenient for relevant personnel to be taken as the basis for target investigation.

The vehicle profile includes basic information: basic information such as vehicle brand, vehicle type, vehicle color, manufacturer, frame number, license plate number and the like; moving track: track information such as a bayonet and a parking lot; recording involved cases: robbed and stolen; traffic violation: violation of law, enforcement, etc.; and (4) behavior recording: operation for rental, maintenance, and the like;

designing a vehicle class archive comprises the following steps: basic information (vehicle, owner), records of involved cases (robbery, etc.), illegal laws and violations, taxi, 4S maintenance records.

The common characteristic of the data sets is the number plate number (or the frame number), the number plate number is used as a key word in the classified storage, and the acquisition time or the warehousing time of each type of data is used as a time stamp.

The case file can clearly reflect basic information, related persons, related units, things related to cases and the like of the case, and the related persons can conveniently serve as a basis for target investigation.

The case file comprises basic information: case serial number, case name, brief case, case time, case unit, reason, time, clue; case-related information: personnel involved in the case, units involved in the case, articles involved in the case, etc.; and (3) related flow: bibliographic documents, legal documents, and the like;

the design case type archive library is divided into: basic information, inquiry notes, case-related persons, case-related articles, case-related units, legal documents and cases with broken tapes.

The common characteristic of the data sets is case number (or alarm number), the case number is used as a key word in the classified storage, and the acquisition time or the storage time of each type of data is used as a time stamp.

S200, according to the data characteristics and the storage requirements of the HBASE wide table, the row keys of the HBASE wide table define the data indexes of each type of entity table, and the purpose of indexing key words is achieved.

In this embodiment, the design principle of the row key is closely related to the business model, and basically follows the following conditions: the key words or the unique values are usually the unique values which are formed by combining a string of numbers or characters and have practical significance, and can also be calculated by combining certain strings, the processing principle of the key words is flip-chip preposition, and the opportunity that all areas conforming to HBASE can averagely obtain data storage is ensured; the other special string is a time stamp, due to the storage characteristic of HBASE, sequencing processing is stored according to a natural sequence, the difference between the time stamp and the maximum time sequence number is calculated, and the latest data can be naturally discharged at the top; the last string set is the field deduced or determined according to the service, different service models are different, and the value of adding the string set is determined according to the actual requirement.

Specifically, the keyword index may be designed by referring to the following formula:

String rowKey＝REVERSE(keyword)+DELIMITER+otherSequence+ DELIMITER+Long.MaxValue–TIMESTAMP(time)

the following formula may be followed as reference criteria for the column parameters:

{NAME＝>'D',BLOOMFILTER＝>'ROW',VERSIONS＝>'3', COMPRESSION＝>'SNAPPY'}

column family is D (also meaning name), bloom filter is set to row level, version number is kept 3 versions, maximum expiration time is permanent (or retention time is set according to requirement), compression algorithm adopts snap.

S300, in a cluster mode, data query is carried out through row keys designed by the keyword indexes, and required data columns are quickly returned through filtering column keys, so that the purpose of efficient query is achieved. Preferably, the invention makes efficient queries by the MR computation engine.

In the method for designing the HBASE-based archive, the original data sources are classified and combined according to the entity characteristics of the archive and the actual service requirements, and characteristic data items are extracted, so that the purpose of classifying and storing the original data sources is achieved; according to the data characteristics and the storage requirements of the HBASE wide table, the row keys of the HBASE wide table define the data index of each type of entity table, so that the purpose of indexing key words is achieved; in the cluster mode, the row keys designed by the keyword index are used for data query, and the required data columns are quickly returned by filtering the column keys, so that the aim of efficient query is fulfilled. Compared with the conventional archive data stored in a relational database, the bigtable storage mode adopted by the invention can store data above PB level by using the technology, can respond to the retrieval requirement with proper performance, transversely expands with a cheap machine combination, and is very suitable for management and storage of a large amount of data; the design scheme provided by the scheme is matched with the requirement of a storage data structure, so that messy and complicated data are collected into files in a centralized manner, and the convenience of a user is improved.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims

1. A design method of an archive library based on HBASE is characterized by comprising the following steps:

2. The method for designing an HBASE-based archive according to claim 1, wherein, in S100, the archive is classified into; personnel files, vehicle files, case files.

3. A method for designing an HBASE-based archive as claimed in claim 2, characterized in that the personnel archive comprises at least: personal basic information, whether the person is a high risk group, whether the person is a special group, social behavior information, personnel activity track information and personnel intimacy degree information.

4. A method for HBASE-based archive design according to claim 2, characterized in that the vehicle archive comprises at least: the vehicle information comprises vehicle basic information, vehicle activity track information, vehicle involved record information and vehicle behavior record information.

5. A method for designing an HBASE-based archive as claimed in claim 2, characterized in that the case archive comprises at least: basic file information, case-related information and case handling process information.

6. The method for designing an HBASE-based archive according to claim 1, wherein, in S100, each data set includes a feature value, a keyword, an update time; the key words are unique values, and the updating time is data acquisition time.

7. A method for designing an HBASE-based archive as claimed in claim 1, wherein in S200 the key design rules are determined by keywords which may be a string of numbers or a combination of characters that are unique values of practical significance or may be calculated from some combination of strings.

8. A method of designing an HBASE-based archive as claimed in claim 7, characterised by applying the flip-chip pre-processing principle to the keys to ensure that all regions complying with HBASE have an average chance of getting data storage.

9. The HBASE-based archive design method as claimed in claim 1, wherein, in S200, when the data is a special character of the time stamp, the time stamps are sorted and stored in a natural order, the difference between the time stamp and the maximum time sequence number is calculated, and the newest data is arranged at the top.

10. The method for designing an HBASE-based archive according to claim 1, wherein the efficient query is performed by an MR computation engine in S300.