CN103678691A

CN103678691A - Universal index creating method and system based on hadoop

Info

Publication number: CN103678691A
Application number: CN201310738719.0A
Authority: CN
Inventors: 王冬杰
Original assignee: Century Light Technology Development (beijing) Co Ltd
Current assignee: Century Light Technology Development (beijing) Co Ltd
Priority date: 2013-12-26
Filing date: 2013-12-26
Publication date: 2014-03-26
Anticipated expiration: 2033-12-26

Abstract

The invention provides a method for creating indexes based on hadoop. The method includes the steps that data to be indexed are loaded to an HDFS file system through services at a service end, and according to the configuration of the data and an indexed mode, indexes are created in the system in a distributed mode. The method particularly includes the following steps that the data are retrieved and stored; configuration and data path designation are conducted; configuration index configuration files and data analysis formats are imported; the data are read in a distributed mode and the indexes are written in; the indexes are combined. According to the method, data reading speed and index creating speed are increased in the index creating process without independently deploying the services and independently developing the indexes.

Description

A kind of general establishment indexing means and system based on hadoop

Technical field

The present invention relates to data management field, particularly a kind of establishment indexing means and system based on hadoop.

Background technology

Along with the quickening of IT application process, traditional the concentrated type data is saving and disposal route cannot satisfying magnanimity spatial datas and the demand of query processing.Cloud computing technology is the new technology of the processing large-scale calculations with fine extensibility that computer realm occurs in recent years, and it belongs to the distributed file system being based upon on the hardware that great amount of cost is lower, and the data access of high-throughput can be provided; MapReduce parallel computation framework disperses a large amount of data manipulations to each computing node parallel processing, reaches the object that improves whole cloud computing platform processing power.The high scalability that cloud computing possesses, high scalability, high fault tolerance and powerful parallel processing capability become the ideal scheme of mass data efficient storage and processing.As the cloud computing platform of increasing income, hadoop is strong, with low cost with its dilatation ability, efficiency is high, high reliability, free and good advantages such as portability, becomes rapidly distributed computing framework and the mass data storage solution of current main-stream.Hadoop, as cloud storage platform, is comprised of a NameNode and a plurality of DataNode, and wherein NameNode is in charge of the access of file system title space and control external client, and DataNode is responsible for storing data.

Complicated along with ecommerce, the multiple mass data in platform will realize the information retrieval for different application object, and along with needing the increase of retrieving information kind, constantly for every kind of data, builds the project of searching system; Simultaneously, for every kind of data can be retrieved in different searching systems, just need to set up respectively index for the search system of every kind of data, and the support of process need information of setting up index is as the type of hand-coding primary data information (pdi) in index, the information such as data message title, when this just requires to develop search service project, need to understand service data information and its feature, and the requirement of retrieving.

According to every kind of data, independently realize and set up index and have intrinsic drawback: Index process need to be developed function of search, need to invade operation system, understand business characteristic, data characteristics, and mathematical logic relation.For the data of every type, set up the process of index, need independent deployment services, need exploitation index item separately, business complexity has also affected Index process and search exploitation greatly.In addition, many for type of service, meanwhile, every kind of business is all the situation of mass data, needs the server of disposing to increase, and the engineering of setting up index of every kind of data and server can not be fully shared, caused the wasting of resources.

Summary of the invention

For solving the existing problem of above-mentioned prior art, the present invention proposes a kind of general indexing means of setting up, and be based upon the system in the method.By the service of calling and use searching system, according to general directory system requirement, preparing standard data is positioned in the HDFS file system of hadoop, then, by the indexed mode of each row in configuration data in configuration file, by general index service, use the distributed way of hadoop to create index again.

The present invention adopts following technical scheme: a kind of method of the establishment index based on hadoop, comprise that the service of business end will treat that index data is written into HDFS file system, and according to the configuration to described data and indexed mode, distributed establishment index in this system.

Preferably, the method comprises the following steps:

Step 1, retrieval and storage data;

Step 2, configuration and data routing are specified;

Step 3, importing configuration index configuration file and Data Analysis form;

Step 4, distributed reading out data and write index;

Step 5, merging index.

Preferably, wherein said retrieval and storage data comprise:

The hadoop data that regularly organizing search service needs from business datum are passed through in the service of business end,

The data message retrieving is gathered and is organized into wall scroll data, and with general JSON form, store the predefine storage directory in hadoop file system into,

In described storage directory, all data are divided into a plurality of according to certain logic, then store in the sub-directory under described storage directory, to support the multitask of follow-up index creation service to carry out.

Preferably, the JSON form of described data can be determined according to the actual information amount of data.

Preferably, wherein said configuration and data routing are specified and are included in configuration data information in general directory system, and the information index mode in data.

Preferably, wherein said in general directory system configuration data information, and the information index mode in data comprises, configuration imports the indexed mode of data, specifically comprises:

According to the searching attribute of the every row in single file data, in schema file, configure corresponding Attribute domain information, comprise constraint when configuration data of description information is write index,

Complete file system data to the conversion between index data,

The above-mentioned configuration completing is uploaded in hadoop file system, for setting up index.

Constraint when preferably, wherein said data message is write index comprises:

Name item, is ic for describing the title of this data Attribute domain,

Type item, for the type of data of description index,

Indexed item, for describing whether participation index of this data Attribute domain,

Whether stored item, participate in storage for describing this data Attribute domain,

Whether required item is entry required for describing this Attribute domain.

Preferably, wherein said in general directory system configuration data information, and the information index mode in data also comprises, the java class.path of configuration schema file path and general parsing JSON, so that system is resolved JSON automatically, and obtain the storage directory of data and the resolution file that data arrive index.

Preferably, wherein said importing configuration index configuration file and Data Analysis form comprise:

For different types of data, carry out separately the configuration operation of described step 2, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system, so that system is set up index for several data,

In the data for particular type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of system.

Preferably, wherein said distributed reading out data and write index and comprise:

By hadoop, create a plurality of concurrent subtasks, assigned catalogue reading out data, according to the configuration of above-mentioned data message and indexed mode, is arrived in each subtask,

After reading out data, according to the configuration of above-mentioned data message and indexed mode, data in JSON are converted to JAVA object, thereby a complete data message in file system is loaded in a JAVA object, each attribute in object is corresponding to the information in each territory in data;

According to above-mentioned, to schema file configuration, utilize the attribute that configures the Attribute domain in every row information in configuration file, to the information of getting corresponding attribute in JAVA;

According to the attribute information of every row information configuration, create the domain information of this item number certificate, and write index file;

The fritter index file of Index process is write to the respective directories under data directory in configuration file.

Preferably, wherein when writing index file,

If that the required item configuration in this row configuration is true, while there is no this attribute information in object, do not load these data;

If that the configuration of required item is false, the domain information of the type item configuring according to this row, indexed item, this item number certificate of store item information creating also writes index file.

Preferably, wherein said merging index comprises:

After described step 4 establishment index completes, by another task, the small pieces index file under the storing directory of index is carried out to index merging, the massive index after merging is sent to the retrieval server of front end.

Another aspect of the present invention provides a kind of system of the establishment index based on hadoop, comprising: for retrieving and store the unit of data; For configuring the unit with data routing appointment; For importing the unit of configuration index configuration file and Data Analysis form; For distributed reading out data and the unit that writes index; For merging the unit of index.

Preferably, wherein for retrieving and store the unit of data, comprise:

Receiving element, gathers and is organized into wall scroll data for the required data message that the service of business end is retrieved from business datum, and with general JSON form, stores the predefine storage directory in hadoop file system into,

Catalogue setting unit, at described storage directory, is divided into a plurality of by all data according to certain logic, then stores in the sub-directory under described storage directory, to support the multitask of follow-up index creation service to carry out.

Preferably, wherein saidly for configuring with the unit of data routing appointment, comprise:

Schema dispensing unit for according to the searching attribute of every row of single file data, configures corresponding Attribute domain information in schema file, comprises constraint when configuration data of description information is write index, completes file system data to the conversion between index data,

Uploading unit, for the above-mentioned configuration completing is uploaded to hadoop file system, for setting up index.

Preferably, wherein saidly for importing the unit of configuration index configuration file and Data Analysis form, comprise:

Independent dispensing unit, described for configuring the configuration with the unit of data routing appointment for carrying out separately for different types of data, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system, so that system is set up index for several data,

Import unit, for the data for particular type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of system.

Preferably, wherein saidly for distributed reading out data and the unit that writes index, comprise:

Subtask creating unit, for create a plurality of concurrent subtasks by hadoop, assigned catalogue reading out data, according to the configuration of above-mentioned data message and indexed mode, is arrived in each subtask,

Object converting unit, for after reading out data, according to the configuration of above-mentioned data message and indexed mode, data in JSON are converted to JAVA object, thereby a complete data message in file system is loaded in a JAVA object, and each attribute in object is corresponding to the information in each territory in data;

Attribute acquiring unit, for according to the above-mentioned configuration to schema file, utilizes the attribute that configures the Attribute domain in every row information in configuration file, to the information of obtaining corresponding attribute in JAVA;

Write indexing units, for create the domain information of this item number certificate according to the attribute information of every row information configuration, and write index file, the fritter index file of Index process is write to the respective directories under data directory in configuration file.

Preferably, wherein said for the unit that merges index specifically for:

After establishment index completes, by another task, the small pieces index file under the storing directory of index is carried out to index merging, the massive index after merging is sent to the retrieval server of front end.

The present invention has improved and has set up Index process, whole establishment Index process and the needed data of establishment index are all in the HDFS of hadoop file system, so the speed of reading out data and establishment index is very fast, in the hadoop cluster of use common configuration server construction, can reach more considerable speed.

Than prior art, not the needing of technical scheme of the present invention invades operation system and understands business characteristic, to accessing the polytype business of same data system, also do not need independent deployment services and exploitation index separately, is not subject to the impact of business complexity.The server performance of disposing does not have too harsh demand yet, has fully shared existing server, reduces the wasting of resources.

Accompanying drawing explanation

Fig. 1 is according to the process flow diagram of the index establishing method of the embodiment of the present invention.

Embodiment

Various ways can be for (comprising the process of being embodied as; Device; System; Material forms; The computer program comprising on computer-readable recording medium; And/or processor (such as following processor, this processor is configured to execution the instruction of storing on the storer of processor and/or provided by this storer is being provided)) implement the present invention.In this manual, any other form that these enforcements or the present invention can adopt can be called technology.Generally speaking, can change within the scope of the invention the step order of disclosed process.Unless separately had and expressed, the parts (such as processor or storer) that are described as being configured to execute the task may be embodied as by provisional configuration to become in preset time to carry out the general parts of this task or be manufactured into the concrete parts of carrying out this task.

Below with diagram the principle of the invention accompanying drawing together with the detailed description to one or more embodiment of the present invention is provided.In conjunction with such embodiment, describe the present invention, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain manyly substitute, modification and equivalent.Set forth in the following description many details to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some or all details in these details.

The object of the present invention is to provide a kind of method of the establishment index based on hadoop, and be based upon the system in the method, overcome the problem existing in the Mass Data Management of prior art.

The index creation method that this aspect provides comprises:

Step 1, front-end system are prepared data;

Step 2, be configured with data routing and specify;

Step 4, use relevant information are carried out distributed reading out data and write index;

Step 5, merging index.

For understanding better technical scheme disclosed by the invention, below in connection with specific embodiment, further describe specific implementation of the present invention:

Fig. 1 is according to the process flow diagram of the index establishing method of the embodiment of the present invention.As shown in Figure 1, enforcement specific embodiment of the invention scheme is as follows:

Step 1: front-end system is prepared data

The service of business end is by hadoop or other job, regularly from business datum, the data of organizing search service to need, retrieving information is gathered and is organized into wall scroll data, and with general JSON form, store certain catalogue in hadoop file system into, as: under/user/search/fse/proinfo/out/00/, on the storage directory of data, all data can be become to some according to certain logical division, store in the sub-directory under certain catalogue, being convenient to follow-up indexing in service can carry out in multitask, and efficiency can be higher.

How many JSON of data can form according to data message, is exemplified below:

{″datatype″:″ProInfo″,″ic″:″990171911″,″pname″:″2pcs/lot?TV?N95?Dual?SIM?Card?Phone?With?TV?\u0026?Bluetooth?Function″,″pdesc″:″Free?shipping+2pcs/lot?TV?N95?Dual?SIM?Card?Phone?With?TV?\u0026?Bluetooth?Function″,″sid″:″1a0084d7011a04a57f4c6600″,″sidl2″:″00″,″istate″:″2″,″cfid″:″100002″,″cidp″:″1335001″,″cidd″:″105001″,″lineid″:″119960″,″isfs″:″0″,″ad?esc″:″0″,″stype″:″9″,″opt″:″20090204″,″srht″:″20090204″,″et″:″20090218″,″ct″:″20080526″,″lrf″:″0″,″onedf″:″0″,″spzf″:″0″,″vipf″:″0″,″gprf″:″0″,″ppunid″:″0″,″isff″:″0″,″offtype″:″1″}

Step 2: configuration and data routing are specified

Configuration data information in general directory system, and the various information index modes in data.

2.1 configurations import the indexed mode of data:

According to the search feature of the every row in single file data, in schema file, configure corresponding information, as <field name=" ic " type=" string " indexed=" true " stored=" true " required=" true "/>

Constraint when this journey configuration data of description information is write index, as: the value of name item is ic, the title that Attribute domain in these data is described is ic, the type of type data of description index is string, it is participation index that indexed describes this territory, stored item is described this territory and is participated in storage, and it is entry required (every line item must have this property value information) that required item is described this territory.

Completing this file configuration, is for file system data is to the conversion between index data.This configuration completes and uploads in file system, for setting up Index process.

The java class.path of 2.2 configuration schema file paths and general parsing JSON:

<DATAINFO_CLASS>com.dhgate.search.fse.po.ProInfo</DATAIN?FO_CLASS>

<SCHEMA?fullpath=″/user/search/fse/proinfo/schema.xml″/>

The first row configuration purpose makes system automatically by this type of, resolve JSON.

The second row configuration purpose is the storage directory of system acquisition data and the resolution file that data arrive index.

Step 3: import configuration index configuration file and Data Analysis form

In order to make this general-purpose system can be suitable for several data, set up index, for different types of data, carry out separately the configuration operation of above-mentioned steps two, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system.In the data for certain type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of general-purpose system.

Step 4: use relevant information to carry out distributed reading out data and write index

General index service creates Index process, by hadoop, create how concurrent subtask, each subtask is according to being configured to assigned catalogue reading out data in 2.2, after reading out data according to the configuration in 2.2, by JSON data number conversion, be JAVA object, now, by a complete data message in file system, be loaded in a JAVA object, each attribute in object, has corresponded to the information in each territory in data.

Next according to the schema file configuration in 2.1, utilize and in configuration file, configure the NAME attribute in every row information, to the information of getting corresponding attribute in JAVA, if that the required item configuration in this row configuration is true, if there is no so this attribute information in object, do not load these data; If that information exists or required item configures is false, now according to the type item of this row configuration, indexed item, the domain information of these data of store item information creating, and write index file.

Because whole establishment Index process and the needed data of establishment index are all in the HDFS of hadoop file system, so the speed of reading out data and establishment index is very fast, in the hadoop cluster that 4 station servers of use common configuration build, create index speed and can reach the speed of 10,000 per second.In addition, the merging index operation in now needn't execution index process, can greatly improve and write data speed like this, the fritter index file of Index process is also write to the corresponding catalogue of data directory in configuration file.

Step 5: merge index

In step 4, be finished and index after process, by another task, be responsible for the small pieces index file under the storing directory of index in step 4 process to carry out index merging, now, complete the union operation of all data directories.After index merges, the massive index after merging is sent to the retrieval server of front end.

According to a further aspect in the invention, provide a kind of system of the establishment index based on hadoop, having comprised: for retrieving and store the unit of data; For configuring the unit with data routing appointment; For importing the unit of configuration index configuration file and Data Analysis form; For distributed reading out data and the unit that writes index; For merging the unit of index.

Wherein for retrieving and store the unit of data, comprise:

Wherein saidly for configuring with the unit of data routing appointment, comprise:

Wherein saidly for importing the unit of configuration index configuration file and Data Analysis form, comprise:

Wherein saidly for distributed reading out data and the unit that writes index, comprise:

Wherein said for the unit that merges index specifically for:

In sum, the present invention has improved indexing means and the system set up.By the service of calling and use searching system, by according to general directory system requirement, preparing standard data is positioned in the HDFS file system of hadoop, then, by the indexed mode of each row in configuration data in configuration file, by general index service, use the distributed way of hadoop to create index again.

Than prior art, technical scheme of the present invention has following superiority: whole establishment Index process and the needed data of establishment index are all in the HDFS of hadoop file system, so the speed of reading out data and establishment index is very fast, in the hadoop cluster of use common configuration server construction, can reach more considerable speed.

Not the needing of technical scheme of the present invention invades operation system and understands business characteristic, to accessing the polytype business of same data system, also do not need independent deployment services and exploitation index separately, is not subject to the impact of business complexity.The server performance of disposing does not have too harsh demand yet, has fully shared existing server, reduces the wasting of resources.

Disclosed content is only preferably embodiment of the present invention above; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. a method for the establishment index based on hadoop, is characterized in that,

The service of business end will treat that index data is written into HDFS file system, according to the configuration to described data and indexed mode, and distributed establishment index in this system.

2. method according to claim 1, comprises the following steps:

Step 1, retrieval and storage data;

Step 2, configuration and data routing are specified;

Step 4, distributed reading out data and write index;

Step 5, merging index.

3. method according to claim 2, wherein said retrieval and storage data comprise:

4. method according to claim 3, wherein:

The JSON form of described data can be determined according to the actual information amount of data.

5. method according to claim 2, wherein said configuration and data routing are specified and are included in configuration data information in general directory system, and the information index mode in data.

6. method according to claim 5, wherein said in general directory system configuration data information, and the information index mode in data comprises, configuration imports the indexed mode of data, specifically comprises:

Complete file system data to the conversion between index data,

7. method according to claim 6, constraint when wherein said data message is write index comprises:

Name item, is ic for describing the title of this data Attribute domain,

Type item, for the type of data of description index,

Whether required item is entry required for describing this Attribute domain.

8. method according to claim 7, wherein said in general directory system configuration data information, and the information index mode in data also comprises, the java class.path of configuration schema file path and general parsing JSON, so that system is resolved JSON automatically, and obtain the storage directory of data and the resolution file that data arrive index.

9. method according to claim 2, wherein said importing configuration index configuration file and Data Analysis form comprise:

10. method according to claim 8, wherein said distributed reading out data and write index and comprise:

11. methods according to claim 10, wherein when writing index file,

12. according to the method described in claim 2 or 10, and wherein said merging index comprises:

The system of 13. 1 kinds of establishment index based on hadoop, is characterized in that, comprising:

For retrieving and store the unit of data;

For configuring the unit with data routing appointment;

For importing the unit of configuration index configuration file and Data Analysis form;

For distributed reading out data and the unit that writes index;

For merging the unit of index.

14. systems according to claim 13, wherein comprise for retrieving and store the unit of data:

15. according to the system described in claim 13 or 14, wherein saidly for configuring with the unit of data routing appointment, comprises:

16. according to the system described in claim 15, wherein saidly for importing the unit of configuration index configuration file and Data Analysis form, comprises:

17. systems according to claim 15, wherein saidly comprise for distributed reading out data and the unit that writes index:

18. systems according to claim 17, wherein said for the unit that merges index specifically for: