CN103678691A - Universal index creating method and system based on hadoop - Google Patents

Universal index creating method and system based on hadoop Download PDF

Info

Publication number
CN103678691A
CN103678691A CN201310738719.0A CN201310738719A CN103678691A CN 103678691 A CN103678691 A CN 103678691A CN 201310738719 A CN201310738719 A CN 201310738719A CN 103678691 A CN103678691 A CN 103678691A
Authority
CN
China
Prior art keywords
data
index
configuration
file
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310738719.0A
Other languages
Chinese (zh)
Other versions
CN103678691B (en
Inventor
王冬杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Century Light Technology Development (beijing) Co Ltd
Original Assignee
Century Light Technology Development (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Century Light Technology Development (beijing) Co Ltd filed Critical Century Light Technology Development (beijing) Co Ltd
Priority to CN201310738719.0A priority Critical patent/CN103678691B/en
Priority claimed from CN201310738719.0A external-priority patent/CN103678691B/en
Publication of CN103678691A publication Critical patent/CN103678691A/en
Application granted granted Critical
Publication of CN103678691B publication Critical patent/CN103678691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures

Abstract

The invention provides a method for creating indexes based on hadoop. The method includes the steps that data to be indexed are loaded to an HDFS file system through services at a service end, and according to the configuration of the data and an indexed mode, indexes are created in the system in a distributed mode. The method particularly includes the following steps that the data are retrieved and stored; configuration and data path designation are conducted; configuration index configuration files and data analysis formats are imported; the data are read in a distributed mode and the indexes are written in; the indexes are combined. According to the method, data reading speed and index creating speed are increased in the index creating process without independently deploying the services and independently developing the indexes.

Description

A kind of general establishment indexing means and system based on hadoop
Technical field
The present invention relates to data management field, particularly a kind of establishment indexing means and system based on hadoop.
Background technology
Along with the quickening of IT application process, traditional the concentrated type data is saving and disposal route cannot satisfying magnanimity spatial datas and the demand of query processing.Cloud computing technology is the new technology of the processing large-scale calculations with fine extensibility that computer realm occurs in recent years, and it belongs to the distributed file system being based upon on the hardware that great amount of cost is lower, and the data access of high-throughput can be provided; MapReduce parallel computation framework disperses a large amount of data manipulations to each computing node parallel processing, reaches the object that improves whole cloud computing platform processing power.The high scalability that cloud computing possesses, high scalability, high fault tolerance and powerful parallel processing capability become the ideal scheme of mass data efficient storage and processing.As the cloud computing platform of increasing income, hadoop is strong, with low cost with its dilatation ability, efficiency is high, high reliability, free and good advantages such as portability, becomes rapidly distributed computing framework and the mass data storage solution of current main-stream.Hadoop, as cloud storage platform, is comprised of a NameNode and a plurality of DataNode, and wherein NameNode is in charge of the access of file system title space and control external client, and DataNode is responsible for storing data.
Complicated along with ecommerce, the multiple mass data in platform will realize the information retrieval for different application object, and along with needing the increase of retrieving information kind, constantly for every kind of data, builds the project of searching system; Simultaneously, for every kind of data can be retrieved in different searching systems, just need to set up respectively index for the search system of every kind of data, and the support of process need information of setting up index is as the type of hand-coding primary data information (pdi) in index, the information such as data message title, when this just requires to develop search service project, need to understand service data information and its feature, and the requirement of retrieving.
According to every kind of data, independently realize and set up index and have intrinsic drawback: Index process need to be developed function of search, need to invade operation system, understand business characteristic, data characteristics, and mathematical logic relation.For the data of every type, set up the process of index, need independent deployment services, need exploitation index item separately, business complexity has also affected Index process and search exploitation greatly.In addition, many for type of service, meanwhile, every kind of business is all the situation of mass data, needs the server of disposing to increase, and the engineering of setting up index of every kind of data and server can not be fully shared, caused the wasting of resources.
Summary of the invention
For solving the existing problem of above-mentioned prior art, the present invention proposes a kind of general indexing means of setting up, and be based upon the system in the method.By the service of calling and use searching system, according to general directory system requirement, preparing standard data is positioned in the HDFS file system of hadoop, then, by the indexed mode of each row in configuration data in configuration file, by general index service, use the distributed way of hadoop to create index again.
The present invention adopts following technical scheme: a kind of method of the establishment index based on hadoop, comprise that the service of business end will treat that index data is written into HDFS file system, and according to the configuration to described data and indexed mode, distributed establishment index in this system.
Preferably, the method comprises the following steps:
Step 1, retrieval and storage data;
Step 2, configuration and data routing are specified;
Step 3, importing configuration index configuration file and Data Analysis form;
Step 4, distributed reading out data and write index;
Step 5, merging index.
Preferably, wherein said retrieval and storage data comprise:
The hadoop data that regularly organizing search service needs from business datum are passed through in the service of business end,
The data message retrieving is gathered and is organized into wall scroll data, and with general JSON form, store the predefine storage directory in hadoop file system into,
In described storage directory, all data are divided into a plurality of according to certain logic, then store in the sub-directory under described storage directory, to support the multitask of follow-up index creation service to carry out.
Preferably, the JSON form of described data can be determined according to the actual information amount of data.
Preferably, wherein said configuration and data routing are specified and are included in configuration data information in general directory system, and the information index mode in data.
Preferably, wherein said in general directory system configuration data information, and the information index mode in data comprises, configuration imports the indexed mode of data, specifically comprises:
According to the searching attribute of the every row in single file data, in schema file, configure corresponding Attribute domain information, comprise constraint when configuration data of description information is write index,
Complete file system data to the conversion between index data,
The above-mentioned configuration completing is uploaded in hadoop file system, for setting up index.
Constraint when preferably, wherein said data message is write index comprises:
Name item, is ic for describing the title of this data Attribute domain,
Type item, for the type of data of description index,
Indexed item, for describing whether participation index of this data Attribute domain,
Whether stored item, participate in storage for describing this data Attribute domain,
Whether required item is entry required for describing this Attribute domain.
Preferably, wherein said in general directory system configuration data information, and the information index mode in data also comprises, the java class.path of configuration schema file path and general parsing JSON, so that system is resolved JSON automatically, and obtain the storage directory of data and the resolution file that data arrive index.
Preferably, wherein said importing configuration index configuration file and Data Analysis form comprise:
For different types of data, carry out separately the configuration operation of described step 2, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system, so that system is set up index for several data,
In the data for particular type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of system.
Preferably, wherein said distributed reading out data and write index and comprise:
By hadoop, create a plurality of concurrent subtasks, assigned catalogue reading out data, according to the configuration of above-mentioned data message and indexed mode, is arrived in each subtask,
After reading out data, according to the configuration of above-mentioned data message and indexed mode, data in JSON are converted to JAVA object, thereby a complete data message in file system is loaded in a JAVA object, each attribute in object is corresponding to the information in each territory in data;
According to above-mentioned, to schema file configuration, utilize the attribute that configures the Attribute domain in every row information in configuration file, to the information of getting corresponding attribute in JAVA;
According to the attribute information of every row information configuration, create the domain information of this item number certificate, and write index file;
The fritter index file of Index process is write to the respective directories under data directory in configuration file.
Preferably, wherein when writing index file,
If that the required item configuration in this row configuration is true, while there is no this attribute information in object, do not load these data;
If that the configuration of required item is false, the domain information of the type item configuring according to this row, indexed item, this item number certificate of store item information creating also writes index file.
Preferably, wherein said merging index comprises:
After described step 4 establishment index completes, by another task, the small pieces index file under the storing directory of index is carried out to index merging, the massive index after merging is sent to the retrieval server of front end.
Another aspect of the present invention provides a kind of system of the establishment index based on hadoop, comprising: for retrieving and store the unit of data; For configuring the unit with data routing appointment; For importing the unit of configuration index configuration file and Data Analysis form; For distributed reading out data and the unit that writes index; For merging the unit of index.
Preferably, wherein for retrieving and store the unit of data, comprise:
Receiving element, gathers and is organized into wall scroll data for the required data message that the service of business end is retrieved from business datum, and with general JSON form, stores the predefine storage directory in hadoop file system into,
Catalogue setting unit, at described storage directory, is divided into a plurality of by all data according to certain logic, then stores in the sub-directory under described storage directory, to support the multitask of follow-up index creation service to carry out.
Preferably, wherein saidly for configuring with the unit of data routing appointment, comprise:
Schema dispensing unit for according to the searching attribute of every row of single file data, configures corresponding Attribute domain information in schema file, comprises constraint when configuration data of description information is write index, completes file system data to the conversion between index data,
Uploading unit, for the above-mentioned configuration completing is uploaded to hadoop file system, for setting up index.
Preferably, wherein saidly for importing the unit of configuration index configuration file and Data Analysis form, comprise:
Independent dispensing unit, described for configuring the configuration with the unit of data routing appointment for carrying out separately for different types of data, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system, so that system is set up index for several data,
Import unit, for the data for particular type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of system.
Preferably, wherein saidly for distributed reading out data and the unit that writes index, comprise:
Subtask creating unit, for create a plurality of concurrent subtasks by hadoop, assigned catalogue reading out data, according to the configuration of above-mentioned data message and indexed mode, is arrived in each subtask,
Object converting unit, for after reading out data, according to the configuration of above-mentioned data message and indexed mode, data in JSON are converted to JAVA object, thereby a complete data message in file system is loaded in a JAVA object, and each attribute in object is corresponding to the information in each territory in data;
Attribute acquiring unit, for according to the above-mentioned configuration to schema file, utilizes the attribute that configures the Attribute domain in every row information in configuration file, to the information of obtaining corresponding attribute in JAVA;
Write indexing units, for create the domain information of this item number certificate according to the attribute information of every row information configuration, and write index file, the fritter index file of Index process is write to the respective directories under data directory in configuration file.
Preferably, wherein said for the unit that merges index specifically for:
After establishment index completes, by another task, the small pieces index file under the storing directory of index is carried out to index merging, the massive index after merging is sent to the retrieval server of front end.
The present invention has improved and has set up Index process, whole establishment Index process and the needed data of establishment index are all in the HDFS of hadoop file system, so the speed of reading out data and establishment index is very fast, in the hadoop cluster of use common configuration server construction, can reach more considerable speed.
Than prior art, not the needing of technical scheme of the present invention invades operation system and understands business characteristic, to accessing the polytype business of same data system, also do not need independent deployment services and exploitation index separately, is not subject to the impact of business complexity.The server performance of disposing does not have too harsh demand yet, has fully shared existing server, reduces the wasting of resources.
Accompanying drawing explanation
Fig. 1 is according to the process flow diagram of the index establishing method of the embodiment of the present invention.
Embodiment
Various ways can be for (comprising the process of being embodied as; Device; System; Material forms; The computer program comprising on computer-readable recording medium; And/or processor (such as following processor, this processor is configured to execution the instruction of storing on the storer of processor and/or provided by this storer is being provided)) implement the present invention.In this manual, any other form that these enforcements or the present invention can adopt can be called technology.Generally speaking, can change within the scope of the invention the step order of disclosed process.Unless separately had and expressed, the parts (such as processor or storer) that are described as being configured to execute the task may be embodied as by provisional configuration to become in preset time to carry out the general parts of this task or be manufactured into the concrete parts of carrying out this task.
Below with diagram the principle of the invention accompanying drawing together with the detailed description to one or more embodiment of the present invention is provided.In conjunction with such embodiment, describe the present invention, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain manyly substitute, modification and equivalent.Set forth in the following description many details to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some or all details in these details.
The object of the present invention is to provide a kind of method of the establishment index based on hadoop, and be based upon the system in the method, overcome the problem existing in the Mass Data Management of prior art.
The index creation method that this aspect provides comprises:
Step 1, front-end system are prepared data;
Step 2, be configured with data routing and specify;
Step 3, importing configuration index configuration file and Data Analysis form;
Step 4, use relevant information are carried out distributed reading out data and write index;
Step 5, merging index.
For understanding better technical scheme disclosed by the invention, below in connection with specific embodiment, further describe specific implementation of the present invention:
Fig. 1 is according to the process flow diagram of the index establishing method of the embodiment of the present invention.As shown in Figure 1, enforcement specific embodiment of the invention scheme is as follows:
Step 1: front-end system is prepared data
The service of business end is by hadoop or other job, regularly from business datum, the data of organizing search service to need, retrieving information is gathered and is organized into wall scroll data, and with general JSON form, store certain catalogue in hadoop file system into, as: under/user/search/fse/proinfo/out/00/, on the storage directory of data, all data can be become to some according to certain logical division, store in the sub-directory under certain catalogue, being convenient to follow-up indexing in service can carry out in multitask, and efficiency can be higher.
How many JSON of data can form according to data message, is exemplified below:
{″datatype″:″ProInfo″,″ic″:″990171911″,″pname″:″2pcs/lot?TV?N95?Dual?SIM?Card?Phone?With?TV?\u0026?Bluetooth?Function″,″pdesc″:″Free?shipping+2pcs/lot?TV?N95?Dual?SIM?Card?Phone?With?TV?\u0026?Bluetooth?Function″,″sid″:″1a0084d7011a04a57f4c6600″,″sidl2″:″00″,″istate″:″2″,″cfid″:″100002″,″cidp″:″1335001″,″cidd″:″105001″,″lineid″:″119960″,″isfs″:″0″,″ad?esc″:″0″,″stype″:″9″,″opt″:″20090204″,″srht″:″20090204″,″et″:″20090218″,″ct″:″20080526″,″lrf″:″0″,″onedf″:″0″,″spzf″:″0″,″vipf″:″0″,″gprf″:″0″,″ppunid″:″0″,″isff″:″0″,″offtype″:″1″}
Step 2: configuration and data routing are specified
Configuration data information in general directory system, and the various information index modes in data.
2.1 configurations import the indexed mode of data:
According to the search feature of the every row in single file data, in schema file, configure corresponding information, as <field name=" ic " type=" string " indexed=" true " stored=" true " required=" true "/>
Constraint when this journey configuration data of description information is write index, as: the value of name item is ic, the title that Attribute domain in these data is described is ic, the type of type data of description index is string, it is participation index that indexed describes this territory, stored item is described this territory and is participated in storage, and it is entry required (every line item must have this property value information) that required item is described this territory.
Completing this file configuration, is for file system data is to the conversion between index data.This configuration completes and uploads in file system, for setting up Index process.
The java class.path of 2.2 configuration schema file paths and general parsing JSON:
<DATAINFO_CLASS>com.dhgate.search.fse.po.ProInfo</DATAIN?FO_CLASS>
<SCHEMA?fullpath=″/user/search/fse/proinfo/schema.xml″/>
The first row configuration purpose makes system automatically by this type of, resolve JSON.
The second row configuration purpose is the storage directory of system acquisition data and the resolution file that data arrive index.
Step 3: import configuration index configuration file and Data Analysis form
In order to make this general-purpose system can be suitable for several data, set up index, for different types of data, carry out separately the configuration operation of above-mentioned steps two, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system.In the data for certain type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of general-purpose system.
Step 4: use relevant information to carry out distributed reading out data and write index
General index service creates Index process, by hadoop, create how concurrent subtask, each subtask is according to being configured to assigned catalogue reading out data in 2.2, after reading out data according to the configuration in 2.2, by JSON data number conversion, be JAVA object, now, by a complete data message in file system, be loaded in a JAVA object, each attribute in object, has corresponded to the information in each territory in data.
Next according to the schema file configuration in 2.1, utilize and in configuration file, configure the NAME attribute in every row information, to the information of getting corresponding attribute in JAVA, if that the required item configuration in this row configuration is true, if there is no so this attribute information in object, do not load these data; If that information exists or required item configures is false, now according to the type item of this row configuration, indexed item, the domain information of these data of store item information creating, and write index file.
Because whole establishment Index process and the needed data of establishment index are all in the HDFS of hadoop file system, so the speed of reading out data and establishment index is very fast, in the hadoop cluster that 4 station servers of use common configuration build, create index speed and can reach the speed of 10,000 per second.In addition, the merging index operation in now needn't execution index process, can greatly improve and write data speed like this, the fritter index file of Index process is also write to the corresponding catalogue of data directory in configuration file.
Step 5: merge index
In step 4, be finished and index after process, by another task, be responsible for the small pieces index file under the storing directory of index in step 4 process to carry out index merging, now, complete the union operation of all data directories.After index merges, the massive index after merging is sent to the retrieval server of front end.
According to a further aspect in the invention, provide a kind of system of the establishment index based on hadoop, having comprised: for retrieving and store the unit of data; For configuring the unit with data routing appointment; For importing the unit of configuration index configuration file and Data Analysis form; For distributed reading out data and the unit that writes index; For merging the unit of index.
Wherein for retrieving and store the unit of data, comprise:
Receiving element, gathers and is organized into wall scroll data for the required data message that the service of business end is retrieved from business datum, and with general JSON form, stores the predefine storage directory in hadoop file system into,
Catalogue setting unit, at described storage directory, is divided into a plurality of by all data according to certain logic, then stores in the sub-directory under described storage directory, to support the multitask of follow-up index creation service to carry out.
Wherein saidly for configuring with the unit of data routing appointment, comprise:
Schema dispensing unit for according to the searching attribute of every row of single file data, configures corresponding Attribute domain information in schema file, comprises constraint when configuration data of description information is write index, completes file system data to the conversion between index data,
Uploading unit, for the above-mentioned configuration completing is uploaded to hadoop file system, for setting up index.
Wherein saidly for importing the unit of configuration index configuration file and Data Analysis form, comprise:
Independent dispensing unit, described for configuring the configuration with the unit of data routing appointment for carrying out separately for different types of data, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system, so that system is set up index for several data,
Import unit, for the data for particular type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of system.
Wherein saidly for distributed reading out data and the unit that writes index, comprise:
Subtask creating unit, for create a plurality of concurrent subtasks by hadoop, assigned catalogue reading out data, according to the configuration of above-mentioned data message and indexed mode, is arrived in each subtask,
Object converting unit, for after reading out data, according to the configuration of above-mentioned data message and indexed mode, data in JSON are converted to JAVA object, thereby a complete data message in file system is loaded in a JAVA object, and each attribute in object is corresponding to the information in each territory in data;
Attribute acquiring unit, for according to the above-mentioned configuration to schema file, utilizes the attribute that configures the Attribute domain in every row information in configuration file, to the information of obtaining corresponding attribute in JAVA;
Write indexing units, for create the domain information of this item number certificate according to the attribute information of every row information configuration, and write index file, the fritter index file of Index process is write to the respective directories under data directory in configuration file.
Wherein said for the unit that merges index specifically for:
After establishment index completes, by another task, the small pieces index file under the storing directory of index is carried out to index merging, the massive index after merging is sent to the retrieval server of front end.
In sum, the present invention has improved indexing means and the system set up.By the service of calling and use searching system, by according to general directory system requirement, preparing standard data is positioned in the HDFS file system of hadoop, then, by the indexed mode of each row in configuration data in configuration file, by general index service, use the distributed way of hadoop to create index again.
Than prior art, technical scheme of the present invention has following superiority: whole establishment Index process and the needed data of establishment index are all in the HDFS of hadoop file system, so the speed of reading out data and establishment index is very fast, in the hadoop cluster of use common configuration server construction, can reach more considerable speed.
Not the needing of technical scheme of the present invention invades operation system and understands business characteristic, to accessing the polytype business of same data system, also do not need independent deployment services and exploitation index separately, is not subject to the impact of business complexity.The server performance of disposing does not have too harsh demand yet, has fully shared existing server, reduces the wasting of resources.
Disclosed content is only preferably embodiment of the present invention above; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (18)

1. a method for the establishment index based on hadoop, is characterized in that,
The service of business end will treat that index data is written into HDFS file system, according to the configuration to described data and indexed mode, and distributed establishment index in this system.
2. method according to claim 1, comprises the following steps:
Step 1, retrieval and storage data;
Step 2, configuration and data routing are specified;
Step 3, importing configuration index configuration file and Data Analysis form;
Step 4, distributed reading out data and write index;
Step 5, merging index.
3. method according to claim 2, wherein said retrieval and storage data comprise:
The hadoop data that regularly organizing search service needs from business datum are passed through in the service of business end,
The data message retrieving is gathered and is organized into wall scroll data, and with general JSON form, store the predefine storage directory in hadoop file system into,
In described storage directory, all data are divided into a plurality of according to certain logic, then store in the sub-directory under described storage directory, to support the multitask of follow-up index creation service to carry out.
4. method according to claim 3, wherein:
The JSON form of described data can be determined according to the actual information amount of data.
5. method according to claim 2, wherein said configuration and data routing are specified and are included in configuration data information in general directory system, and the information index mode in data.
6. method according to claim 5, wherein said in general directory system configuration data information, and the information index mode in data comprises, configuration imports the indexed mode of data, specifically comprises:
According to the searching attribute of the every row in single file data, in schema file, configure corresponding Attribute domain information, comprise constraint when configuration data of description information is write index,
Complete file system data to the conversion between index data,
The above-mentioned configuration completing is uploaded in hadoop file system, for setting up index.
7. method according to claim 6, constraint when wherein said data message is write index comprises:
Name item, is ic for describing the title of this data Attribute domain,
Type item, for the type of data of description index,
Indexed item, for describing whether participation index of this data Attribute domain,
Whether stored item, participate in storage for describing this data Attribute domain,
Whether required item is entry required for describing this Attribute domain.
8. method according to claim 7, wherein said in general directory system configuration data information, and the information index mode in data also comprises, the java class.path of configuration schema file path and general parsing JSON, so that system is resolved JSON automatically, and obtain the storage directory of data and the resolution file that data arrive index.
9. method according to claim 2, wherein said importing configuration index configuration file and Data Analysis form comprise:
For different types of data, carry out separately the configuration operation of described step 2, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system, so that system is set up index for several data,
In the data for particular type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of system.
10. method according to claim 8, wherein said distributed reading out data and write index and comprise:
By hadoop, create a plurality of concurrent subtasks, assigned catalogue reading out data, according to the configuration of above-mentioned data message and indexed mode, is arrived in each subtask,
After reading out data, according to the configuration of above-mentioned data message and indexed mode, data in JSON are converted to JAVA object, thereby a complete data message in file system is loaded in a JAVA object, each attribute in object is corresponding to the information in each territory in data;
According to above-mentioned, to schema file configuration, utilize the attribute that configures the Attribute domain in every row information in configuration file, to the information of getting corresponding attribute in JAVA;
According to the attribute information of every row information configuration, create the domain information of this item number certificate, and write index file;
The fritter index file of Index process is write to the respective directories under data directory in configuration file.
11. methods according to claim 10, wherein when writing index file,
If that the required item configuration in this row configuration is true, while there is no this attribute information in object, do not load these data;
If that the configuration of required item is false, the domain information of the type item configuring according to this row, indexed item, this item number certificate of store item information creating also writes index file.
12. according to the method described in claim 2 or 10, and wherein said merging index comprises:
After described step 4 establishment index completes, by another task, the small pieces index file under the storing directory of index is carried out to index merging, the massive index after merging is sent to the retrieval server of front end.
The system of 13. 1 kinds of establishment index based on hadoop, is characterized in that, comprising:
For retrieving and store the unit of data;
For configuring the unit with data routing appointment;
For importing the unit of configuration index configuration file and Data Analysis form;
For distributed reading out data and the unit that writes index;
For merging the unit of index.
14. systems according to claim 13, wherein comprise for retrieving and store the unit of data:
Receiving element, gathers and is organized into wall scroll data for the required data message that the service of business end is retrieved from business datum, and with general JSON form, stores the predefine storage directory in hadoop file system into,
Catalogue setting unit, at described storage directory, is divided into a plurality of by all data according to certain logic, then stores in the sub-directory under described storage directory, to support the multitask of follow-up index creation service to carry out.
15. according to the system described in claim 13 or 14, wherein saidly for configuring with the unit of data routing appointment, comprises:
Schema dispensing unit for according to the searching attribute of every row of single file data, configures corresponding Attribute domain information in schema file, comprises constraint when configuration data of description information is write index, completes file system data to the conversion between index data,
Uploading unit, for the above-mentioned configuration completing is uploaded to hadoop file system, for setting up index.
16. according to the system described in claim 15, wherein saidly for importing the unit of configuration index configuration file and Data Analysis form, comprises:
Independent dispensing unit, described for configuring the configuration with the unit of data routing appointment for carrying out separately for different types of data, and the configuration file of different pieces of information is placed in to other catalogues of general-purpose system, so that system is set up index for several data,
Import unit, for the data for particular type, start and set up before index, the correspondence configuration of the type data is imported to the execution environment of system.
17. systems according to claim 15, wherein saidly comprise for distributed reading out data and the unit that writes index:
Subtask creating unit, for create a plurality of concurrent subtasks by hadoop, assigned catalogue reading out data, according to the configuration of above-mentioned data message and indexed mode, is arrived in each subtask,
Object converting unit, for after reading out data, according to the configuration of above-mentioned data message and indexed mode, data in JSON are converted to JAVA object, thereby a complete data message in file system is loaded in a JAVA object, and each attribute in object is corresponding to the information in each territory in data;
Attribute acquiring unit, for according to the above-mentioned configuration to schema file, utilizes the attribute that configures the Attribute domain in every row information in configuration file, to the information of obtaining corresponding attribute in JAVA;
Write indexing units, for create the domain information of this item number certificate according to the attribute information of every row information configuration, and write index file, the fritter index file of Index process is write to the respective directories under data directory in configuration file.
18. systems according to claim 17, wherein said for the unit that merges index specifically for:
After establishment index completes, by another task, the small pieces index file under the storing directory of index is carried out to index merging, the massive index after merging is sent to the retrieval server of front end.
CN201310738719.0A 2013-12-26 A kind of general establishment indexing means based on hadoop and system Active CN103678691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310738719.0A CN103678691B (en) 2013-12-26 A kind of general establishment indexing means based on hadoop and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310738719.0A CN103678691B (en) 2013-12-26 A kind of general establishment indexing means based on hadoop and system

Publications (2)

Publication Number Publication Date
CN103678691A true CN103678691A (en) 2014-03-26
CN103678691B CN103678691B (en) 2016-11-30

Family

ID=

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354251A (en) * 2015-10-19 2016-02-24 国家电网公司 Hadoop based power cloud data management indexing method in power system
CN108268614A (en) * 2017-12-29 2018-07-10 郑州轻工业学院 A kind of distribution management method of forest reserves spatial data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102426609A (en) * 2011-12-28 2012-04-25 厦门市美亚柏科信息股份有限公司 Index generation method and index generation device based on MapReduce programming architecture
CN103207889A (en) * 2013-01-31 2013-07-17 重庆大学 Method for retrieving massive face images based on Hadoop

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102426609A (en) * 2011-12-28 2012-04-25 厦门市美亚柏科信息股份有限公司 Index generation method and index generation device based on MapReduce programming architecture
CN103207889A (en) * 2013-01-31 2013-07-17 重庆大学 Method for retrieving massive face images based on Hadoop

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FHQLLT: "solr schema.xml 配置总结", 《HTTP://FHQLT.ITEYE.COM/BLOG/1716338》 *
董长春: "一种基于hadoop的倒排索引技术的研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354251A (en) * 2015-10-19 2016-02-24 国家电网公司 Hadoop based power cloud data management indexing method in power system
CN108268614A (en) * 2017-12-29 2018-07-10 郑州轻工业学院 A kind of distribution management method of forest reserves spatial data
CN108268614B (en) * 2017-12-29 2020-08-18 郑州轻工业学院 Distributed management method for forest resource spatial data

Similar Documents

Publication Publication Date Title
CN101620609B (en) Multi-tenant data storage and access method and device
US9020802B1 (en) Worldwide distributed architecture model and management
CN104102710A (en) Massive data query method
US10338958B1 (en) Stream adapter for batch-oriented processing frameworks
CN105786808B (en) A kind of method and apparatus for distributed execution relationship type computations
US9330161B2 (en) Creating global aggregated namespaces for storage management
CN102999537A (en) System and method for data migration
CN102567495B (en) Mass information storage system and implementation method
CN101694626B (en) Script execution system and method
CN103106249B (en) A kind of parallel data processing system based on Cassandra
CN109086325A (en) Data processing method and device based on block chain
CN105138661A (en) Hadoop-based k-means clustering analysis system and method of network security log
CN102937964B (en) Intelligent data service method based on distributed system
CN106471501A (en) The method of data query, the storage method data system of data object
CN104050248A (en) File storage system and storage method
CN109766206A (en) A kind of log collection method and system
CN103927331A (en) Data querying method, data querying device and data querying system
CN107888666A (en) A kind of cross-region data-storage system and method for data synchronization and device
CN103399894A (en) Distributed transaction processing method on basis of shared storage pool
CN106055678A (en) Hadoop-based panoramic big data distributed storage method
CN103823846A (en) Method for storing and querying big data on basis of graph theories
CN104410666A (en) Method and system for implementing heterogeneous storage resource management under cloud computing
CN102779160A (en) Mass data information indexing system and indexing construction method
CN105550351B (en) The extemporaneous inquiry system of passenger&#39;s run-length data and method
CN103365740A (en) Data cold standby method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: 100088 B3, Haidian District, Beijing, Huayuan Road

Applicant after: Century Light Technology Development (Beijing) Co., Ltd.

Address before: 100088 B3, Haidian District, Beijing, Huayuan Road

Applicant before: Century Light Technology Development (Beijing) Co., Ltd.

COR Change of bibliographic data

Free format text: CORRECT: APPLICANT; FROM: SHIJI HEGUANG TECHNOLOGY DEVELOPMENT (BEIJING) CO., LTD. TO: SHIJI HEGUANG TECHNOLOGY DEVELOPMENT (BEIJING) LTD.

C14 Grant of patent or utility model
GR01 Patent grant