CN107943922B

CN107943922B - Method and device for retrieving information based on solr

Info

Publication number: CN107943922B
Application number: CN201711164079.1A
Authority: CN
Inventors: 谢永恒; 孟宪奎; 火一莽; 万月亮
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2020-08-25
Anticipated expiration: 2037-11-21
Also published as: CN107943922A

Abstract

The embodiment of the invention discloses a method and a device for retrieving information based on solr, which are used for receiving a request for retrieving information, acquiring parameters in the request, analyzing and identifying the parameters; starting distributed query control, and starting line interruption control or query overtime control according to a trigger condition; and loading data by adopting a segment file reverse loading and inverted list reverse loading mode, executing the user-defined standard scoring, and responding to the request. The method and the device have the key points of avoiding excessive index file loading, applying a spatial locality principle, ensuring effective utilization of system resources and improving information retrieval performance.

Description

Method and device for retrieving information based on solr

Technical Field

The embodiment of the invention relates to the technical field of information retrieval, in particular to a method and a device for retrieving information based on solr.

Background

As the big data industry is gradually applied to various industries, the query of massive data meets unprecedented challenges. In the field of big data, only the nosql databases like hbase and the like ensure the requirements of high concurrency, high performance, high storage and the like. However, the hbase database can only be queried according to rowkey, and cannot meet the variability of service requirements on the premise of guaranteeing the factors such as performance, concurrency and storage. The design and specific business of Rowkey have strong dependencies. In the hierarchical design of the secondary index, an architectural design mode that a solr retrieval engine is used as a query inlet and a hbase is used as a storage exists. The problem that the rowkey constraint of the hbase is too strong is solved.

In the traditional method for using the solr, relevance default sorting is adopted, and on the premise of sorting by using the relevance principle, a solr search engine is inevitably required to load all index files and score the index files. In a mass data mode, index files are frequently read, system memories are frequently recycled in a transition mode, the CPU utilization rate is too high, and system loads are on-line in a warning mode for a long time, so that the overall query performance and the concurrency capability cannot be effectively improved. In particular, when the solr data is divided into tables, the number of concurrent threads is directly determined by the number of tables. Furthermore, the solr cluster nodes frequently drop points, which results in the system function being unable to be used normally.

As can be seen from the use of the conventional solr. Too many index file loads are the root cause that the performance and concurrency cannot be effectively improved.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for retrieving information based on solr, aiming at improving the information retrieval efficiency.

To achieve the purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, a method for solr-based information retrieval, the method comprising:

receiving a request for information retrieval, acquiring parameters in the request, analyzing and identifying the parameters;

starting distributed query control, and starting line interruption control or query overtime control according to a trigger condition;

and loading data by adopting a segment file reverse loading and inverted list reverse loading mode, executing the user-defined standard scoring, and responding to the request.

Optionally, the distributed query control includes:

in the index writing, data are uniformly distributed to each sub-slice of the table in a Hash mode;

the total amount of written data per slice is approximately equal in the time dimension.

Optionally, the query timeout control includes:

if the Solr is in default, setting overtime time;

starting a timer, timing and inquiring;

judging the execution time;

if the query is overtime, interrupting the query;

if the query is not timed out, the complete query is executed.

Optionally, the segment file is reversely loaded, including:

performing physical isolation of data in a transverse direction by adopting a transverse sub-table, writing data and reading data according to the data, and performing data query by adopting the data loaded with a latest table and interrupting according to data reading;

in the internal default processing control logic of the solr, loading the segment files in a sequence from small to large is carried out, and the segment files are loaded in a sequence from large to small by expanding the solr default realization interface.

Optionally, the data read interruption comprises:

defining the number of hits of a collector, and intercepting and judging whether the number meets the defined expected number in the process of circularly collecting the documents;

if so, interrupt control is executed, the request is directly responded, and the next segment file is prevented from being continuously scanned.

Optionally, the inverted table is reversely loaded, including:

in the document collection process, a priority minimum heap queue technology is adopted and the size of a queue is defined, each satisfied record is put into the queue, and the data input and output are realized through a priority algorithm;

after scanning a segment, if the number of records is satisfied, directly returning;

if the number of recorded pieces is not satisfied, the scanning of the next segment is continued until the set desired number is satisfied.

Optionally, the performing the custom criteria scoring comprises:

on the premise that scoring is not applied, the scoring of solr is expanded through self-defining similarity, weight or scoring, and management is carried out through a singleton mode.

In a second aspect, an apparatus for solr-based information retrieval, the apparatus comprising:

the analysis module is used for receiving a request of information retrieval, acquiring parameters in the request, analyzing and identifying the parameters;

the starting module is used for starting the distributed query control and starting line interruption control or query overtime control according to the triggering condition;

and the loading module is used for loading data in a way of reversely loading the segment file and reversely loading the inverted list, executing the user-defined standard scoring and responding to the request.

Optionally, the starting module is specifically configured to:

in the time dimension, the total amount of written data of each slice is approximately equal;

if the Solr is in default, setting overtime time;

starting a timer, timing and inquiring;

judging the execution time;

if the query is overtime, interrupting the query;

if the query is not timed out, the complete query is executed.

Optionally, the loading module is specifically configured to:

in the internal default processing control logic of the solr, loading the segment files in a sequence from small to large, and realizing an interface by expanding the solr default to finish loading the segment files in a sequence from large to small;

the data read interruption comprises:

if yes, executing interrupt control, directly responding to the request, and avoiding continuously scanning the next segment file;

if the number of the recording pieces is not met, continuing to scan the next segment until the set expected number is met;

The embodiment of the invention has the beneficial effects that: the method and the device have the key points of avoiding excessive index file loading, applying a spatial locality principle, ensuring effective utilization of system resources and improving information retrieval performance.

Drawings

Fig. 1 is a schematic flowchart of a method for retrieving information based on solr according to an embodiment of the present invention;

fig. 2 is a schematic functional module diagram of an apparatus for retrieving information based on solr according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad invention. It should be further noted that, for convenience of description, only some structures, not all structures, relating to the embodiments of the present invention are shown in the drawings.

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for retrieving information based on solr according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 110, receiving a request for information retrieval, acquiring parameters in the request, analyzing and identifying the parameters;

step 120, starting distributed query control, and starting line interruption control or query timeout control according to a trigger condition;

wherein the distributed query control comprises:

Illustratively, during the index writing process, data is uniformly distributed to each fragment of the table in a hash manner. The total amount of written data per slice is approximately equal in the time dimension. So far, the reverse process is adopted for the reading process, the number of rows of each query is evenly distributed to each slice, and each slice only reads 1/rows of record number (rounding up is needed). By the means, relatively less records of each fragment query are ensured, and the data volume transmitted in the network is smaller. Greatly improving query performance and reducing resource consumption.

The query timeout control includes:

if the Solr is in default, setting overtime time;

starting a timer, timing and inquiring;

judging the execution time;

if the query is overtime, interrupting the query;

if the query is not timed out, the complete query is executed.

In the invention, under the consideration of a plurality of factors such as line interruption, inverted reading and the like, the functions of the system are expanded, the combination of overtime and interruption is realized, and the two-way control over overtime and interruption is realized. Under the dual control of line interruption and overtime interruption, effective query data is guaranteed to exist.

And step 130, loading data by adopting a segment file reverse loading and inverted list reverse loading mode, executing user-defined standard grading, and responding to the request.

Wherein, the segment file is reversely loaded, including:

Illustratively, the purpose of the horizontal sub-table is to perform horizontal physical isolation of data, so that the overlarge data space of the table is avoided, and the horizontal expansion is facilitated. Meanwhile, data query is carried out by loading the data of the latest table and according to a data reading interruption technology by depending on the service characteristics (reading data according to data writing). In the internal default processing control logic of the solr, segment files are loaded in a sequence from small to large, the current processing mode and the service requirement are in conflict, and up to this point, segment files are loaded in a sequence from large to small by expanding a solr default realization interface. The detailed process is shown in fig. 2. Through the two strategies, the latest data is guaranteed to be read.

Wherein the data read interruption comprises:

Illustratively, data reading interruption is realized by expanding on the basis that solr provides collectors, the main design principle is to define the number of hits of the collectors, and in the process of circularly collecting documents, interception judges whether the defined expected number is met. If so, interrupt control is executed, the request is directly responded, and the next segment file is prevented from being scanned continuously. In the document collection process, relevant expansion is also carried out on the reading query control of the inverted list, and the main reason is determined by the structure of the inverted list, in the inverted list, the last written data is placed at the end of the inverted list, the reading is started from the starting point (in the latest version of solr, the inverted list cannot read data from the end), if the latest data is required to be obtained, the whole inverted list must be read, the number of the last documents meeting the condition is intercepted, in the implementation process, the priority minimum heap queue technology is adopted and the queue size is defined, namely, each satisfied record is placed in a queue, and the data is input and output through a priority algorithm (the minimum heap queue provided by the solr is adopted). After scanning one segment, if the number of records is satisfied, go back directly, if the number of records is not satisfied, continue scanning the next segment until the set desired number is satisfied.

Wherein, the reverse table reverse loading includes:

Wherein the executing the custom criteria score comprises:

Referring to fig. 2, fig. 2 is a functional module schematic diagram of an apparatus for retrieving information based on solr according to an embodiment of the present invention. As shown in fig. 2, the apparatus includes:

the analysis module 210 is configured to receive a request for information retrieval, obtain a parameter in the request, analyze and identify the parameter;

the starting module 220 is used for starting distributed query control and starting line interruption control or query timeout control according to a triggering condition;

and the loading module 230 is configured to load data in a manner of reverse loading of the segment file and reverse loading of the inverted list, execute a custom standard score, and respond to the request.

Optionally, the starting module 220 is specifically configured to:

if the Solr is in default, setting overtime time;

starting a timer, timing and inquiring;

judging the execution time;

if the query is overtime, interrupting the query;

if the query is not timed out, the complete query is executed.

Optionally, the loading module 230 is specifically configured to:

the data read interruption comprises:

The technical principle of the embodiment of the present invention is described above in conjunction with the specific embodiments. The description is only intended to explain the principles of embodiments of the invention and should not be taken in any way as limiting the scope of the embodiments of the invention. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive step, and these embodiments will fall within the scope of the present invention.

Claims

1. A method for solr-based information retrieval, the method comprising:

loading data in a segment file reverse loading and inverted list reverse loading mode, executing user-defined standard scoring, and responding to the request;

the reverse loading of the segment file comprises the following steps:

2. The method of claim 1, wherein the distributed query control comprises:

3. The method of claim 1, wherein querying the timeout control comprises:

if the Solr is in default, setting overtime time;

starting a timer, timing and inquiring;

judging the execution time;

if the query is overtime, interrupting the query;

if the query is not timed out, the complete query is executed.

4. The method of claim 1, wherein the data read interruption comprises:

5. The method of claim 1, wherein the inverted table is loaded in reverse, comprising:

6. The method of claim 1, wherein said performing a custom criteria score comprises:

7. An apparatus for solr-based information retrieval, the apparatus comprising:

the loading module is used for loading data in a way of reversely loading segment files and reversely loading inverted lists, executing user-defined standard scoring and responding to the request;

the loading module is specifically configured to:

8. The apparatus according to claim 7, wherein the starting module is specifically configured to:

if the Solr is in default, setting overtime time;

starting a timer, timing and inquiring;

judging the execution time;

if the query is overtime, interrupting the query;

if the query is not timed out, the complete query is executed.

9. The apparatus of claim 7, wherein the data read interrupt comprises: