CN105320746A

CN105320746A - Big data based index acquisition method and system

Info

Publication number: CN105320746A
Application number: CN201510622636.4A
Authority: CN
Inventors: 龚建新; 王周松; 郑平贺
Original assignee: Beijing VRV Software Corp Ltd
Current assignee: Beijing VRV Software Corp Ltd
Priority date: 2015-09-25
Filing date: 2015-09-25
Publication date: 2016-02-10

Abstract

The present invention provides a big data based index acquisition method and system. The big data based index acquisition method comprises: performing a first analysis on data, to acquire a keyword of the data; classifying the data according to the keyword, storing the classified data into a database, acquiring a rowkey corresponding to the classified data; and establishing an index according to the rowkey corresponding to the classified data and the keyword. According to the big data based index acquisition method and system provided by the present invention, the index is established by using the keyword in the data and the rowkey generated when the data is stored, and a mapping relationship between the data in the database and the rowkey in the index is established, so that in subsequent retrieval, data corresponding to the rowkey can be acquired only by acquiring the rowkey, and a retrieval speed in massive data is improved.

Description

A kind of index acquisition methods based on large data and system

Technical field

The present invention relates to field of data retrieval, particularly a kind of index acquisition methods based on large data and system.

Background technology

Along with the development of social informatization degree, society have entered large data age.Data volume is large, the storage of data and full-text search become the bottleneck hindering Informatization Development.A lot of data incorporate some relevant databases, such as SQLServer, Mysql, Oracle etc. at present, and its data retrieval depends on index or subregion, the submeter etc. of database itself.Inquiry velocity or acceptable when data volume is smaller, but to store along with the increase of data volume, Database Systems and recall precision declines with regard to straight line, until database corruption.Trace it to its cause, these relational datas are not just large data and existing.

For storage and the full-text search of large data, prior art proposes a kind of nosql database, and the speed of the pressure adopting the combination of mongdb and solr to alleviate relevant database, the data deposited and retrieval has had large increase.But along with mass data stored in, its performance bottleneck also shows, cause store and search speed more and more slower.

Summary of the invention

For defect of the prior art, the invention provides a kind of index acquisition methods based on large data, the method sets up index by the rowkey of the keyword in data and storage data genaration, can carry out high efficiency data retrieval in the data of magnanimity.

The invention provides a kind of index acquisition methods based on large data, comprising:

Carry out first time to data to resolve, obtain the keyword of data;

According to described keyword, data are classified, and by sorted datum number storage according to storehouse, obtain the rowkey corresponding with described sorted data;

Index set up in the rowkey corresponding according to described and described sorted data and described keyword.

Optionally, before data being carried out to first time parsing, comprising:

Obtain the URL of multiple data to be obtained;

URL in the url history storehouse of URL and the Hbase cluster of each data to be obtained is mated, if the URL of data to be obtained is new URL, then described new URL is imported queue to be crawled, until the URL of all data to be obtained has mated;

URL in queue to be crawled described in obtaining successively, and obtain data to be obtained according to the URL that described URL crawls in queue.

Optionally, before data being carried out to first time parsing, comprising:

Judge whether to get described data to be obtained;

And when not getting described data to be obtained, the URL continuing to crawl in queue according to described URL obtains data to be obtained, and the number of times obtaining data to be obtained is added 1;

If the number of times of described acquisition data to be obtained is preset times and does not obtain to treat described data to be obtained, then the error library of URL corresponding to described data to be obtained stored in described Hbase cluster will do not got.

Optionally, by sorted datum number storage according to storehouse before, comprising:

Sorted data are carried out packing compression according to preset strategy;

File after air exercise packet compression carries out second time and resolves, and the data after second time being resolved are stored into database.

Optionally, described method also comprises: the step being obtained data by described index;

The described step being obtained data by described index, being comprised:

Obtain the keyword of user's input, according to the keyword of user's input in the index of search server, obtain the rowkey corresponding with keyword;

According to the described rowkey corresponding with keyword, obtain data corresponding with rowkey in database.

The invention allows for a kind of index based on large data and obtain system, comprising:

First parsing module: resolve for carrying out first time to data, obtain the keyword of data;

First acquisition module: for classifying to data according to described keyword, and by sorted datum number storage according to storehouse, obtains the rowkey corresponding with described sorted data;

Set up module: for setting up index according to rowkey corresponding to described and described sorted data and described keyword.

Optionally, this system also comprises:

Second acquisition module: for obtaining the URL of multiple data to be obtained;

Matching module: mate for the URL in the url history storehouse of URL and the Hbase cluster to each data to be obtained, if the URL of data to be obtained is new URL, then described new URL is imported queue to be crawled, until the URL of all data to be obtained has mated;

3rd acquisition module: for the URL in queue to be crawled described in obtaining successively, and obtain data to be obtained according to the URL that described URL crawls in queue

Optionally, this system also comprises:

Judge module: get described data to be obtained for judging whether;

Optionally, this system also comprises:

Packetization module: for sorted data are carried out packing compression according to preset strategy;

Second parsing module: carry out second time for the file after packet compression of fighting each other and resolve, and the data after second time being resolved are stored into database.

Optionally, this system also comprises:

4th acquisition module: for obtaining the keyword of user's input, according to the keyword of user's input in the index of search server, obtain the rowkey corresponding with keyword;

5th acquisition module: for according to the described rowkey corresponding with keyword, obtain data corresponding with roekey in database.

As shown from the above technical solution, index access method based on large data of the present invention, the rowkey generated by the keyword in data and when storing data sets up index, so that the data in database and the rowkey in index are set up corresponding relation, in follow-up retrieval, only need to obtain rowkey, the data that rowkey is corresponding can be obtained, improve the retrieval rate in the data of magnanimity.

Accompanying drawing explanation

Can understanding the features and advantages of the present invention clearly by reference to accompanying drawing, accompanying drawing is schematic and should not be construed as and carry out any restriction to the present invention, in the accompanying drawings:

Fig. 1 shows the process flow diagram of the index acquisition methods based on large data that one embodiment of the invention provides;

Fig. 2 shows the process flow diagram of the index acquisition methods based on large data that another embodiment of the present invention provides;

The index based on large data that Fig. 3 shows one embodiment of the invention to be provided obtains the structural representation of system.

Embodiment

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 shows the process flow diagram of the index acquisition methods based on large data that one embodiment of the invention provides, and with reference to Fig. 1, the index acquisition methods based on large data of the present embodiment, comprising:

Step 101, data are carried out to first time and resolve, obtain the keyword of data;

Step 102, according to described keyword, data to be classified, and by sorted datum number storage according to storehouse, obtain the rowkey corresponding with described sorted data;

Step 103, set up index according to rowkey corresponding to described and described sorted data and described keyword.

The rowkey generated by the keyword in data and when storing data sets up index, so that the data in database and the rowkey in index are set up corresponding relation, in follow-up retrieval, only need to obtain rowkey, the data that rowkey is corresponding can be obtained, improve the speed retrieved in mass data.

In order to improve the efficiency of data acquisition, the present invention is obtaining the URL of multiple data to be obtained; URL in the url history storehouse of URL and the Hbase cluster of each data to be obtained is mated, if the URL of data to be obtained is new URL, then described new URL is imported queue to be crawled, until the URL of all data to be obtained has mated; URL in queue to be crawled described in obtaining successively, and according to the URL data to be obtained that described URL crawls in queue.

Wherein, the order of the URL in queue to be crawled described in acquisition can be the order of first in first out.

In order to further improve the efficiency of data acquisition and storage, the present invention, before data being carried out to first time parsing, also detects crawling operation, to judge whether to get described data to be obtained; And when not getting described data to be obtained, the URL continuing to crawl in queue according to described URL obtains data to be obtained, and the number of times obtaining data to be obtained is added 1; If the number of times of described acquisition data to be obtained is preset times and does not obtain to treat described data to be obtained, then the error library of URL corresponding to described data to be obtained stored in described Hbase cluster will do not got.

In order to improve data store efficiency, the present invention by sorted datum number storage according to storehouse before, also need sorted data assembling to become XML file, and according to preset strategy carry out packing compress; File after air exercise packet compression carries out second time and resolves, and the data after second time being resolved are stored into database.

The method, after getting above-mentioned index, also comprises: obtain data according to above-mentioned index, concrete steps are as follows:

According to the described rowkey corresponding with keyword, obtain data corresponding with roekey in database.

By above-mentioned index, the keyword that user inputs is inquired about, to obtain rowkey corresponding to keyword, and find corresponding data in a database by this rowkey, to realize the effect carrying out retrieval efficiently in the database of magnanimity.

The process flow diagram of the data capture method that Fig. 2 provides for another embodiment of the present invention, with reference to Fig. 2, is described in detail to index acquisition methods and based on the method for this index acquisition data below:

Step 201, obtain the URL of multiple data to be obtained from internet based on database;

Step 202, judge whether the URL of data to be obtained is existing URL, based on the URL in the url history storehouse of Hbase cluster, the URL of each data to be obtained is mated, obtain non-existent URL in url history storehouse, and import successively in queue to be crawled and url history storehouse, and the URL repeated with the URL in url history storehouse in the URL of data to be obtained is abandoned;

Step 203, choose and be with the URL that crawls in queue successively, and the data obtained on the URL that selects, and when not getting described data to be obtained, the URL continuing to crawl in queue according to this URL obtains data to be obtained, and the number of times obtaining data to be obtained is added 1;

If the number of times obtaining data to be obtained is less than three times, then the URL selected is imported described band again and crawl queue; If or the number of times crawling failure reaches three times, then the URL selected and abnormal information are imported the error library of described Hbase cluster.

Wherein, the preset times in step 203 is only be used to for three times conveniently understand this technical side's scheme, can depend on the circumstances.

Step 204, the resolution rules obtained in rule base, and according to resolution rules, first time parsing is carried out to the data obtained, obtain the keyword in data;

Wherein, keyword is the word in common dictionary, such as: time, title, content, author etc.;

The keyword that step 205, basis parse, classifies to data, and is assembled into XML file, carries out packing compression according to preset strategy;

File after data loading middleware air exercise packet compression in step 206, server carries out second time and resolves, and the data after second time being resolved are stored into database, and database will generate corresponding rowkey unique with data automatically; Meanwhile, according to the keyword in these data and corresponding rowkey unique with these data, set up index, and index is imported in search server ElasticSearch cluster;

Wherein, along with data continuous stored in, database will set up rowkey sequence, also will there is identical rowkey sequence in search server;

The service interface middleware of step 207, server receives the keyword for inquiring about that user is inputted by client, according to keyword, inquires the rowkey corresponding with the keyword inputted in the index of service interface middleware in search server;

Will be understood that, if when having multiple data corresponding with the keyword of input, acquisition will be rowkey list;

Step 208, based on obtain rowkey or rowkey list, service interface middleware obtains data corresponding to this rowkey or rowkey list in a database.

The structural representation of the acquisition of the index based on the large data system that Fig. 3 provides for one embodiment of the invention, with reference to Fig. 3, the invention allows for a kind of index based on large data and obtain system, this system comprises:

First parsing module 31: resolve for carrying out first time to data, obtain the keyword of data;

First acquisition module 32: for classifying to data according to described keyword, and by sorted datum number storage according to storehouse, obtains the rowkey corresponding with described sorted data;

Set up module 33: for setting up index according to rowkey corresponding to described and described sorted data and described keyword.

In order to improve the efficiency of data acquisition, the present invention carried out pre-service before transferring data to the first parsing module, and this system also comprises:

Second acquisition module 34: for obtaining the URL of multiple data to be obtained;

Matching module 35: mate for the URL in the url history storehouse of URL and the Hbase cluster to each data to be obtained, if the URL of data to be obtained is new URL, then described new URL is imported queue to be crawled, until the URL of all data to be obtained has mated;

Crawl module 36: for the URL in queue to be crawled described in obtaining successively, and obtain data to be obtained according to the URL that described URL crawls in queue.

In order to further improve the efficiency of data acquisition and storage, the present invention also comprises:

Judge module 37: for judging whether to crawl successfully;

In order to improve the efficiency that data store, this system also comprises:

Packetization module 38: for sorted data are carried out packing compression according to preset strategy;

Second parsing module 39: carry out second time for the file after packet compression of fighting each other and resolve, and the data after second time being resolved are stored into database.

This system also comprises:

4th acquisition module 40: for obtaining the keyword of user's input, according to the keyword of user's input in the index of search server, obtain the rowkey corresponding with keyword;

5th acquisition module 41: for according to the described rowkey corresponding with keyword, obtain data corresponding with roekey in database.

The rowkey that this method generates by the keyword in data and when storing data sets up index, so that the data in database and the rowkey in index are set up corresponding relation, in follow-up retrieval, only need to obtain rowkey, the data that rowkey is corresponding can be obtained, improve the speed retrieved in mass data.

Although describe embodiments of the present invention by reference to the accompanying drawings, but those skilled in the art can make various modifications and variations without departing from the spirit and scope of the present invention, such amendment and modification all fall into by within claims limited range.

Claims

1., based on an index acquisition methods for large data, it is characterized in that, comprising:

Carry out first time to data to resolve, obtain the keyword of data;

2. method according to claim 1, is characterized in that, before data being carried out to first time parsing, comprising:

Obtain the URL of multiple data to be obtained;

3. method according to claim 2, is characterized in that, before data being carried out to first time parsing, comprising:

Judge whether to get described data to be obtained;

4. method according to claim 1, is characterized in that, by sorted datum number storage according to storehouse before, comprising:

Sorted data are carried out packing compression according to preset strategy;

5. the method according to any one of claim 1-4, is characterized in that, described method also comprises: the step being obtained data by described index;

The described step being obtained data by described index, being comprised:

6. the index based on large data obtains a system, it is characterized in that, comprising:

7. system according to claim 6, is characterized in that, comprising:

3rd acquisition module: for the URL in queue to be crawled described in obtaining successively, and obtain data to be obtained according to the URL that described URL crawls in queue.

8. system according to claim 7, is characterized in that, comprising:

Judge module: get described data to be obtained for judging whether;

9. system according to claim 6, is characterized in that, comprising:

10. the system according to any one of claim 6 ~ 9, is characterized in that, comprising: