CN106682148A

CN106682148A - Method and device based on Solr data search

Info

Publication number: CN106682148A
Application number: CN201611199422.1A
Authority: CN
Inventors: 于洪勇; 刘晓帅
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2016-12-22
Filing date: 2016-12-22
Publication date: 2017-05-17

Abstract

An embodiment of the invention discloses a method and a device based on Solr data search. The method includes selecting a bottom storage database of HBase or MongoDB, and generating a rowkey according to recorded fields; optimizing a JVM (java virtual machine) based on memory optimization, and optimizing memory configuration, disk occupation and transaction logs of Solr; during data query, acquiring corresponding data from bottom storage according to an index created by the Solr. The method and the device have the advantages that weak dependency of the data on the Solr can be realized through bottom storage and data structure design, and the data is independent from the Solr; search performance is improved through Solr optimization and shema design; through JVM optimization, Solr cluster query efficiency is improved, and the probability of Socket overtime caused by Full GC of the Solr is reduced; transaction control is made on index creation of the data and storage of the data to the database, and synchronization between searched data and queried data is guaranteed.

Description

A kind of method and device based on Solr data search

Technical field

The present embodiments relate to Mass Data Searching technical field, more particularly to a kind of side based on Solr data search Method and device.

Background technology

From 2012, big data (Big Data) word is more and more referred to that people describe and define letter with it The magnanimity information that breath explosion time generation produces, and name associated technology to develop and innovation.2 months 2012《The New York Times》One Piece special column claims, and " big data " epoch have come, and in business, economic and other field, decision-making will be increasingly based on data and divide Analyse and make, and be not based on experience and intuition.

Big data is commonly used to describe a large amount of destructurings and semi-structured data that a company creates, and these are counted According to the meeting overspending time and money when relevant database is downloaded to for analyzing.Big data analysis is often contacted with cloud computing To together, because large data set analysis need the framework as Spark to tens of, hundreds if not thousands of computer in real time Share out the work.

Big data has much on earthOne group name teaches that for the data of " internet upper one day ", among one day, internet The full content of generation can carve full 1.68 hundred million DVD；The mail for sending has as many as 294,000,000,000 envelopes (equivalent to U.S.'s paper of 2 years Matter mail quantity)；The community post for sending up to 2,000,000 (equivalent to《Epoch》The magazine word amount of 770 years)；The mobile phone sold It it is 37.8 ten thousand, higher than the Number of infants 37.1 ten thousand ... of global birth daily

By the end of 2012, data volume was risen to PB (1024TB=1PB), EB from TB (1024GB=1TB) rank (1024PB=1EB) or even ZB (1024EB=1ZB) rank.The result of study of International Data Corporation (IDC) (IDC) shows that 2008 complete The data volume that ball is produced is 0.49ZB, and the data volume of 2009 is 0.8ZB, and it is 1.2ZB to increase within 2010, and the quantity of 2011 is more 1.82ZB is up to, everyone produces the data of more than 200GB equivalent to the whole world.And to 2012, human being's production it is all The data volume of printing material is 200PB, all data volumes about 5EB that the whole mankind said in history.The research of IBM Claim, in the total data that whole human civilization is obtained, there is 90% to produce in two years in the past.And the year two thousand twenty has been arrived, full generation Data scale produced by boundary is up to 44 times of today.[5] every day, the whole world can be uploaded more than 500,000,000 pictures, per minute Just there are 20 hours videos of duration to be shared.Even however, people daily create full detail --- including voice call, Email and information are in interior various communications, and whole pictures, video and the music for uploading, its information content also cannot and The amount of digital information on people itself for being createed every day.

Big data is so important, so that its acquisition, storage, search, shared, analysis, or even it is visually presented with, all Become current important research topic.

The content of the invention

The purpose of the embodiment of the present invention is to propose a kind of method and device based on Solr data search, it is intended to solved big Data volume is stored and to the Query Optimization of mass data.

It is that, up to this purpose, the embodiment of the present invention uses following technical scheme：

In a first aspect, a kind of method based on Solr data search, methods described includes：

From the bottom data storage storehouse of HBase or MongoDB, and rowkey is generated according to the field of record；

JVM is optimized in the optimization of internal memory, and memory configurations to Solr, disk take and transaction journal is carried out Optimization；

When data are inquired about, the index after being created according to Solr obtains corresponding data from bottom storage.

Preferably, it is described JVM is optimized in the optimization of internal memory, including：

The caching for presetting size is added after distributing to the internal memory that the Solr needs.

Preferably, the memory configurations to Solr are optimized, including：

Selection to the cache size, take-back strategy of the Solr is configured；

The caching includes that automatic preheating caching, filter caching, document caching, Query Result caching and/or thresholding are slow Deposit；

The take-back strategy is chosen for：Using FieldCache, the use of mergeFactor is reduced, make to be protected in index Few section is deposited, the compound file format using index is closed, and NIO is used from NIOFSDirectory when index is created, Direct internal memory is directly used, avoids generating segment from suitable section consolidation strategy.

Preferably, the disk to Solr takes and optimizes, including：

In the case of non-correlation use, limitation uses Term Vector；

When schema is designed, suitable document granularity is selected, selectively storage domain is set；

If a record in the unique key location database for passing through Solr, fals is all set to by the attribute of stored；

For the attribute of not merit rating, omitNorms is set to true；

To date and numeric type, precision step-length precisionStep is reduced.

Preferably, it is described that transaction journal is optimized, including：

The transaction journal is used to support that near real-time obtains data and atomic update；Make to write persistence and submit flow solution to Coupling；Support the copies synchronized of SolrCloud burst host nodes；Length and the hard frequency submitted to for balancing transaction journal.

A kind of second aspect, device based on Solr data search, described device includes：

First acquisition module, for the bottom data storage storehouse from HBase or MongoDB, and according to the word of record Duan Shengcheng rowkey；

Optimization module, for being optimized to JVM in the optimization of internal memory, and memory configurations, disk occupancy to Solr Optimized with transaction journal；

Second acquisition module, for when data are inquired about, the index after being created according to Solr to be obtained from bottom storage Take corresponding data.

Preferably, the optimization module, specifically for：

Preferably, the optimization module, also particularly useful for：

Selection to the cache size, take-back strategy of the Solr is configured；

Preferably, the optimization module, also particularly useful for：

In the case of non-correlation use, limitation uses Term Vector；

For the attribute of not merit rating, omitNorms is set to true；

To date and numeric type, precision step-length precisionStep is reduced.

Preferably, the optimization module, also particularly useful for：

A kind of method and device based on Solr data search provided in an embodiment of the present invention, from HBase or The bottom data storage storehouse of MongoDB, and rowkey is generated according to the field of record；JVM carried out in the optimization of internal memory excellent Change, and memory configurations to Solr, disk take and transaction journal is optimized；When data are inquired about, after being created according to Solr Index from the bottom storage in obtain corresponding data.So as to by storing to bottom and the design of data structure can be with Weak dependence of the data to Solr is realized, by Dynamic data exchange out；By optimizing to Solr and shema designs lifting and searches Without hesitation can, while by JVM tunings lifted Solr clusters search efficiency and reduce Solr because occur Full GC cause Socket The possibility of time-out；Transaction controlling is carried out to data creation index and data storage to database, it is ensured that the data of search With the data syn-chronization of inquiry.

Brief description of the drawings

Fig. 1 is a kind of schematic flow sheet of method based on Solr data search provided in an embodiment of the present invention；

Fig. 2 is a kind of high-level schematic functional block diagram of device based on Solr data search provided in an embodiment of the present invention.

Specific embodiment

The embodiment of the present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this The described specific embodiment in place is used only for explaining the embodiment of the present invention, rather than the restriction to the embodiment of the present invention.In addition also It should be noted that for the ease of description, part rather than the entire infrastructure related to the embodiment of the present invention is illustrate only in accompanying drawing.

With reference to Fig. 1, Fig. 1 is that a kind of flow of method based on Solr data search provided in an embodiment of the present invention is illustrated Figure.

As shown in figure 1, the method based on Solr data search includes：

Step 101, from the bottom data storage storehouse of HBase or MongoDB, and according to the field generation of record rowkey；

Specifically, the returning result of search engine is ranked up by correlation, and relevant database can only be according to Row in table are returned.If that is, the limitation of non-correlation, can select to use internal memory type database synchronization relationship type number The lifting of query performance is realized according to storehouse, and search engine need not be used.Search engine is not the place of data storage, unless number According to inquiring about and showing that result is useful.So, search engine should not be used as database.

In view of as the internal memory type databases such as Memcache, Redis are to the dependence of internal memory, economically and greatly count Under the conditions of amount, start on the time loss of internal memory type database loading data, put aside internal memory type database as standby In selecting scheme.

All it is more suitable scheme from HBase or MongoDB.Meanwhile, in order to bottom is stored and search engine Decoupling, then it is necessary to have means not by way of Solr can still obtain data, this is also required in data database The situation that avoid unique key to depend on Solr to produce during the database design of storehouse completely occurs, especially for HBase.In order to keep away Exempt from full table scan, it is necessary to accurately record is obtained by unique rowkey, then the design of the rowkey is only use The key of Solr+HBase frameworks.Field according to deposited record obtains rowkey by the way of certain, rather than generation only One random value as rowkey and Solr unique key.

Step 102, optimizes in the optimization of internal memory to JVM, and memory configurations to Solr, disk take and affairs Daily record is optimized；

Wherein, JVM is Java Virtual Machine, and Solr is the application of a full-text search.

Specifically, due to the encapsulation of Solr bottoms be Lucene, and Lucene is to improve search efficiency, is adopted in index It is inverted index.And the invention of MapReduce is also to have benefited from inverted index.

The design of NoSQL is all denormalization (denormalized), so as to avoid in relevant database in order to Simplify storage and have to use a series of connection sentence in inquiry, it is a large amount of that although such design produces data storage Redundancy, but also obtain the raising of search efficiency simultaneously.This also provides and instructs in design, and either database sets Meter, or Solr index records design, all should be in units of record, rather than so-called table in relevant database (table) it is unit.

Specifically, JVM set main principle is that, be assigned to only the internal memory of Solr needs along with a point cache, To be taken a long time when avoiding gc.Because Solr clusters are to rely on ZooKeeper cooperative achievements, if occurred During Stop-The-World, ZooKeeper time-out can be impacted to ZooKeeper clusters.Especially using HBase conducts In the case that cluster is built in bottom storage, once there is Full GC overlong times in HBase clusters, it is possible to cause HBase's HMaster node lost contacts, worse situation is Stand by nodes also lost contact, such case occurs and means that whole bottom is deposited Storage is unavailable.So, carry out JVM tunings most important.

Preferably, the memory configurations to Solr are optimized, including：

Selection to the cache size, take-back strategy of the Solr is configured；

Specifically, in the optimization of internal memory in addition to the optimization to JVM, it can be to adjust that Solr also has substantial amounts of configuration in itself It is whole, so that adapt to various production environments, such as the adjustment of cache size, the selection of take-back strategy etc., and configurable caching Include automatic preheating caching (autowarming), filter caching, document caching, Query Result caching, thresholding (Field again Value) caching etc..Suggestion caches (FieldCache) using domain, reduces the use of mergeFactor, make to preserve in index compared with Few section, closes the compound file format using index.Meanwhile, can select NIOFSDirectory (Solr when creating index To a realization of Directory interfaces in technology) using Non-Blocking I/O (NIO), directly use direct internal memory, it is to avoid seize JVM heap internal memory.The excessive segment of generation is avoided from suitable section consolidation strategy.

In order to avoid single schema data volumes reach the performance issue produced after certain magnitude, it may be considered that to this Schema is split, daily or monthly dynamic generation schema, again using the Alias of Collection by all phases during inquiry It is combined with the schema of data structure, so as to avoid the single schema of inquiry.Simultaneously, it is considered to by Replication's Quantity tunes up the purpose that can equally reach and improve response speed.But nor the quantity of cluster is the bigger the better, Solr is reason By above can be with infinite expanding, in the field, Solr still has its limitation.

Preferably, the disk to Solr takes and optimizes, including：

In the case of non-correlation use, limitation uses term vector (Term Vector)；

When schema (refering in particular to the schema.xml configuration files of Solr) is designed, suitable document granularity is selected, set Selectively store domain；

For the attribute of not merit rating, omitNorms is set to true；

To date and numeric type, precision step-length precisionStep is reduced.

Specifically, to equally having optimizable place in the configuration of Solr, the either occupancy of EMS memory occupation or disk On.For example, in the case of non-correlation use, can limit using Term Vector, so as to reduce the occupancy of disk.If During meter schema, suitable document granularity is selected, storage domain can selectively be set, if mainly by Solr only A record in one key location database, can be all set to false, so as to reduce disk pressure by the attribute of stored. Do not prepare the attribute of merit rating for some, omitNorms can be set to true.For date and numeric type, Can be appropriate by precision step-length precisionStep set it is a little bit smaller.

Preferably, it is described that transaction journal is optimized, including：

The transaction journal is used to support that near real-time obtains data and atomic update；Make to write persistence and submit flow solution to Coupling；Support the copies synchronized of SolrCloud (Solr clusters) burst host node；For the length for balancing transaction journal and hard submission Frequency.

Specifically, transaction journal may insure to lose does not submit renewal to, main purpose has three：1st, it is near for supporting (NRT) obtains data and atomic update in real time；2nd, make to write persistence and submit flow decoupling to；3rd, SolrCloud bursts are supported The copies synchronized of host node.It is exactly to balance the length (how much not submitting renewal to) of transaction journal and carry firmly for transaction journal The frequency of friendship.If transaction journal is excessive, then restarting will spend the more long time to perform renewal.

Step 103, when data are inquired about, the index after being created according to Solr obtains corresponding number from bottom storage According to.

Specifically, Solr is not based on the safety of documentation level, and according to data selected storehouse, it is necessary to according to actual conditions Plus transaction controlling.

For HBase, there are the transaction frameworks such as Haeinsa, Tephra affairs can be added on HBase, and The mode of MongoDB presently the most popular addition transaction controlling is to simulate the control that affairs realize affairs using message queue. If Solr creates the transaction controlling of index in addition, Solr can be created the additions and deletions of index and database data as whole Body considers the addition of affairs.It is not king-sized situation in handling capacity, it is possible to use RabbitMQ simulates affairs, and in handling capacity In the case of very big, it is recommended to use Kafka, RabbitMQ at handling capacity and the quantity (TPS) of affairs/request per second aspect and Kafka does not have comparativity.But the original intention of Kafka designs processes daily record, can regard a log system, specific aim as It is very strong, so it does not possess the characteristic that a maturation message queue MQ should possess.And RabbitMQ is more ripe than Kafka, In availability, in stability, in reliability, RabbitMQ is more than Kafka.

A kind of method based on Solr data search provided in an embodiment of the present invention, from the bottom of HBase or MongoDB Layer data storage storehouse, and rowkey is generated according to the field of record；JVM is optimized in the optimization of internal memory, and to Solr Memory configurations, disk take and transaction journal optimize；When data are inquired about, the index after being created according to Solr is from described Corresponding data are obtained in bottom storage.So as to data pair can be realized by bottom storage and the design of data structure The weak dependence of Solr, by Dynamic data exchange out；Optimized and shema design lifting search performances by Solr, together When by JVM tunings lifted Solr clusters search efficiency and reduce Solr because occur Full GC cause Socket time-out can Can property；Transaction controlling is carried out to data creation index and data storage to database, it is ensured that the data of search and inquiry Data syn-chronization.

With reference to Fig. 2, Fig. 2 is that a kind of functional module of device based on Solr data search provided in an embodiment of the present invention is shown It is intended to.

As shown in Fig. 2 described device includes：

First acquisition module 201, for the bottom data storage storehouse from HBase or MongoDB, and according to record Field generates rowkey；

Optimization module 202, for being optimized to JVM in the optimization of internal memory, and memory configurations to Solr, disk are accounted for Optimized with transaction journal；

Second acquisition module 203, for when data are inquired about, the index after being created according to Solr to be from bottom storage Obtain corresponding data.

Preferably, the optimization module 202, specifically for：

Preferably, the optimization module 202, also particularly useful for：

Selection to the cache size, take-back strategy of the Solr is configured；

Preferably, the optimization module 202, also particularly useful for：

In the case of non-correlation use, limitation uses Term Vector；

For the attribute of not merit rating, omitNorms is set to true；

To date and numeric type, precision step-length precisionStep is reduced.

Preferably, the optimization module 202, also particularly useful for：

A kind of device based on Solr data search provided in an embodiment of the present invention, from the bottom of HBase or MongoDB Layer data storage storehouse, and rowkey is generated according to the field of record；JVM is optimized in the optimization of internal memory, and to Solr Memory configurations, disk take and transaction journal optimize；When data are inquired about, the index after being created according to Solr is from described Corresponding data are obtained in bottom storage.So as to data pair can be realized by bottom storage and the design of data structure The weak dependence of Solr, by Dynamic data exchange out；Optimized and shema design lifting search performances by Solr, together When by JVM tunings lifted Solr clusters search efficiency and reduce Solr because occur Full GC cause Socket time-out can Can property；Transaction controlling is carried out to data creation index and data storage to database, it is ensured that the data of search and inquiry Data syn-chronization.

The know-why of the embodiment of the present invention is described above in association with specific embodiment.These descriptions are intended merely to explain this The principle of inventive embodiments, and can not by any way be construed to the limitation to embodiment of the present invention protection domain.Based on herein Explanation, those skilled in the art associated by would not require any inventive effort the embodiment of the present invention other are specific Implementation method, these modes are fallen within the protection domain of the embodiment of the present invention.

Claims

1. a kind of method based on Solr data search, it is characterised in that methods described includes：

JVM is optimized in the optimization of internal memory, and memory configurations to Solr, disk take and transaction journal carry out it is excellent Change；

2. method according to claim 1, it is characterised in that described to be optimized to JVM in the optimization of internal memory, bag Include：

3. method according to claim 1, it is characterised in that the memory configurations to Solr are optimized, including：

Selection to the cache size, take-back strategy of the Solr is configured；

The caching includes automatic preheating caching, filter caching, document caching, Query Result caching and/or thresholding caching；

The take-back strategy is chosen for：Using FieldCache, the use of mergeFactor is reduced, make to preserve few in index Section, close using index compound file format, and create index when use NIO from NIOFSDirectory, directly Using direct internal memory, avoid generating segment from suitable section consolidation strategy.

4. method according to claim 1, it is characterised in that the disk to Solr takes and optimizes, including：

In the case of non-correlation use, limitation uses Term Vector；

If a record in the unique key location database for passing through Solr, False is all set to by the attribute of stored；

For the attribute of not merit rating, omitNorms is set to true；

To date and numeric type, precision step-length precisionStep is reduced.

5. method according to claim 1, it is characterised in that described to be optimized to transaction journal, including：

The transaction journal is used to support that near real-time obtains data and atomic update；Make to write persistence and submit flow decoupling to； Support the copies synchronized of SolrCloud burst host nodes；Length and the hard frequency submitted to for balancing transaction journal.

6. a kind of device based on Solr data search, it is characterised in that described device includes：

First acquisition module, for the bottom data storage storehouse from HBase or MongoDB, and according to the field life of record Into rowkey；

Optimization module, for being optimized to JVM in the optimization of internal memory, and memory configurations to Solr, disk take and thing Business daily record is optimized；

Second acquisition module, for when data are inquired about, the index after being created according to Solr to obtain right from bottom storage The data answered.

7. device according to claim 6, it is characterised in that the optimization module, specifically for：

8. device according to claim 6, it is characterised in that the optimization module, also particularly useful for：

Selection to the cache size, take-back strategy of the Solr is configured；

9. device according to claim 6, it is characterised in that the optimization module, also particularly useful for：

In the case of non-correlation use, limitation uses Term Vector；

For the attribute of not merit rating, omitNorms is set to true；

To date and numeric type, precision step-length precisionStep is reduced.

10. device according to claim 6, it is characterised in that the optimization module, also particularly useful for：