CN106611013A

CN106611013A - Information searching method and system

Info

Publication number: CN106611013A
Application number: CN201510705372.9A
Authority: CN
Inventors: 吴强; 王福荣; 王丽清; 胡华伟; 周裕峰
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2015-10-27
Filing date: 2015-10-27
Publication date: 2017-05-03

Abstract

The invention discloses an information searching method and system. The information searching method comprises the steps of performing task decomposition on a search statistical request after the search statistical request from a searching user is received, to obtain a corresponding map reduce task; reading data from corresponding distributed data storage nodes in a distributed file system according to the obtained map reduce task, wherein data storage adopts an RcFile format in an Hive data warehouse in the distributed file system; performing distributed calculation according to the data read from each distributed data storage node; merging calculation results of the respective distributed data storage nodes to obtain a searching result; and providing the searching result for the searching user. By adoption of the method and the system provided by the invention, cloud resource calculation nodes are automatically increased in an automatic resource adaptive manner, so that mobile online log searching and analyzing efficiency is improved.

Description

Information query method and system

Technical field

The present invention relates to moving communicating field, more particularly to a kind of information query method and system.

Background technology

In mobile Internet, mobile terminal such as mobile phone and PAD terminals are carried out by telecom operators Wireless mode is accessed, and realizes the access of network.In order to ensure public cybersecurity, telecom operators To accessing by CTNET, CTWAP or WLAN mode, the network of Internet service is accessed Trace Data is retained.Two types Trace Data is mainly contained, mobile Internet access user exist Vestige and mobile Internet access user access after access to internet in certification login process when accessing the Internet Trace Data during the Internet.

With the fast development and the popularization of intelligent mobile phone terminal of mobile Internet, the trace of mobile Internet access Mark retained data amount breaks through to TB ranks by GB ranks.With the C network users of Fujian telecommunications 9,000,000, The original online Trace Data of the daily generation in January, 2014 is 700G.According at least preservation 3 The Chinese Ministry of Industry and Information of the moon requires, then data total amount has growing trend in 70T.

It is that online Trace Data is associated after matching that existing technology solves framework, is loaded into relationship type Data base is realizing the inquiry to user's internet behavior and statistical analysiss.When the data of traversal queries In more than 10TB, centralized relevant database processing system occurs in that data query to amount Positioning is slow, needs to expend close 6 hours during one week online Trace Data for retrieving a user, The user network behavior analysiss of macroscopic view cannot be completed under existing framework.Even if current the Internet row Already in a large number using hadoop big data Technology applications on user behavior analysis, but still need because Portfolio rapid growth and the magnanimity of analytical data amount brought increases, constantly passively manual setting Cloud computing resources and storage resource.

Therefore, it is necessary to the vestige retained data for proposing a kind of efficient mobile Internet access is retained and retrieved Method is solving above-mentioned technical problem.

The content of the invention

The disclosure technical problem to be solved is how to propose a kind of trace of efficient mobile Internet access Mark retained data is retained and search method solves mass data storage and presence in retrieval in prior art Problem.

The disclosure provides a kind of information query method, including：Receiving looking into for inquiry user's transmission After asking statistics request, query statistic request is carried out Task-decomposing to obtain corresponding map reduce Task；According to the map reduce tasks for obtaining, it is distributed accordingly from distributed file system Formula data memory node reads data；Hive data warehouses wherein in distributed file system In, data storage adopts RcFile forms；According to the number that each Distributed Storage node reads According to carrying out Distributed Calculation；The result of calculation of each Distributed Storage node is merged, with Obtain Query Result；Query Result is supplied to into inquiry user.

Further, the method includes：The online Trace Data of Real-time Collection mobile subscriber；To adopt The online Trace Data for collecting is loaded in the Hive data warehouses in distributed file system.

Further, the method includes：It is distributed the online for collecting Trace Data is loaded into In the step in Hive data warehouses in file system, also include：

When the establishment of Hive Data Warehouses table is carried out, according to query statistic request task point Solution number and system capability determine a point bucket number.

Further, the method includes：Using formula

Buckets=min (data_total_size/dfs.block.size, map_count)

Point bucket number Buckets is calculated, wherein min () is to take minimum value function, Data_total_size is online Trace Data total amount, and dfs.block.size is distributed file system The file block size of middle configuration, map_count is that query statistic request task decomposes number.

Further, the method includes：Online Trace Data includes recognizing for DPI device classes upload The authentication information and the Internet that card information and internet access information, WAP gateway classification are uploaded is visited Ask information, the NAT information of address conversion of the SYSLOG log servers upload of fire wall.

The present invention also provides a kind of information query system, including interface unit, query driven unit, Data processing unit and distributed file system, wherein：Interface unit, for receiving inquiry user The query statistic request of transmission；Query driven unit, for receiving inquiry user in interface unit After the query statistic request of transmission, Task-decomposing is carried out to query statistic request, it is corresponding to obtain Map reduce tasks, and the map reduce tasks for obtaining are supplied to into data-reading unit； Data processing unit, for according to the map reduce tasks for obtaining, from distributed file system In corresponding Distributed Storage node read data, read according to each Distributed Storage node The data for taking carry out Distributed Calculation, and the result of calculation of each Distributed Storage node is closed And, to obtain Query Result；And indicate that Query Result is supplied to inquiry user by interface unit；Point Cloth file system, for distributed storage data, wherein the Hive in distributed file system In data warehouse, data storage adopts RcFile forms.

Further, also include：Collecting unit and data load units, wherein：

Collecting unit, for the online Trace Data of Real-time Collection mobile subscriber；

Data load units, the online Trace Data for collecting unit to be collected is loaded into distribution In Hive data warehouses in formula file system.

Further, data load units are specifically carrying out the establishment of Hive Data Warehouses table When, number is decomposed according to query statistic request task and system capability determines a point bucket number.

Further, data load units utilize formula

Buckets=min (data_total_size/dfs.block.size, map_count)

Further, Trace Data of surfing the Net includes the authentication information of DPI device classes upload and interconnection Authentication information and internet access information, fire prevention that net access information, WAP gateway classification are uploaded The NAT information of address conversion that the SYSLOG log servers of wall are uploaded.

Information query method and system that the disclosure is provided, are increased automatically in the way of automatic resource adaptation Plus cloud Resource Calculation node, lift the efficiency of mobile Internet access log query and analysis.

Description of the drawings

Fig. 1 illustrates the flow chart of the information query method of one embodiment of the invention.

Fig. 2 illustrates the structural representation of the information query system of one embodiment of the invention.

Fig. 3 illustrates the schematic flow sheet of the information query method of one embodiment of the invention.

Fig. 4 illustrates the design sketch of the information query method of one embodiment of the invention.

Fig. 5 illustrates the structured flowchart of the information query system of one embodiment of the invention.

Fig. 6 illustrates the structured flowchart of the information query system of an alternative embodiment of the invention.

Specific embodiment

The present invention is described more fully with reference to the accompanying drawings, wherein illustrating the example of the present invention Property embodiment.

Fig. 1 illustrates the flow chart of the information query method of one embodiment of the invention.Such as Fig. 1 institutes Show, the method mainly includes：

Step 100, after the query statistic request that inquiry user sends is received, to query statistic Request carries out Task-decomposing, to obtain corresponding map reduce tasks.

Step 102 is corresponding from distributed file system according to the map reduce tasks for obtaining Distributed Storage node read data；Hive numbers wherein in distributed file system According in warehouse, data storage adopts RcFile forms.

In one embodiment, the online Trace Data of Real-time Collection mobile subscriber；By what is collected Online Trace Data is loaded in the Hive data warehouses in distributed file system.

In one embodiment, Trace Data of surfing the Net includes the authentication information that DPI device classes are uploaded The authentication information uploaded with internet access information, WAP gateway classification and internet access letter The NAT information of address conversion that breath, the SYSLOG log servers of fire wall are uploaded.

In one embodiment, when carrying out Hive Data Warehouses table and creating, according to looking into Ask statistics request task and decompose number and a system capability determination point bucket number.

In one embodiment, it is possible to use formula

Buckets=min (data_total_size/dfs.block.size, map_count)

Point bucket number Buckets is calculated, wherein min () is to take minimum value function, Data_total_size is online Trace Data total amount, and dfs.block.size is distributed file system The file block size of middle configuration, map_count is that query statistic request task decomposes number.This Sample, in the way of automatic resource adaptation cloud Resource Calculation node is increased automatically, is conducive to the later stage to carry out Retrieval.

Step 104, according to the data that each Distributed Storage node reads distributed meter is carried out Calculate.The memory node that specifically can be formed according to point bucket algorithm carry out distribution calculating, so can be with Lift the efficiency of mobile Internet access log query and analysis.

Step 106, the result of calculation of each Distributed Storage node is merged, to obtain Query Result.

Step 108, by Query Result inquiry user is supplied to.

Information query method provided in an embodiment of the present invention, can give full play to cloud computing resource pool and The distributed big datas of Hadoop process two kinds of technical advantages, and in data warehouse Hive optimization number is used According to a point bucket algorithm, and vestige categorical data of surfing the Net is stored using Rcfile compressed formats, its In, Hive RcFile compress the use of storage format, compare nature TextFile lattice in Hadoop Formula, equal number data can save 2/3 memory space；Hive divides bucket algorithm using optimization data, Compare in Hive natures and stored regardless of bucket, the service inquiry time is substantially improved.

Fig. 2 illustrates the structural representation of the information query system of one embodiment of the invention.The present invention There is provided at a kind of distributed big datas of HADOOP of structure on X86-based cloud computing resource pool Reason system, in the way of automatic resource adaptation, increases cloud Resource Calculation node automatically, lifts movement The efficiency that internet log is inquired about and analyzed.As shown in Fig. 2 the system includes：Log acquisition module 27th, the cloud computing resource pool 21 of X86-based, HDFS distributed file systems 22, map Reduce23, PIC enquiry module 25, HIVE statistical analysis modules 26.Wherein, log collection Module 27 is responsible for the module of mobile Internet access Trace Data collection, the collection of mobile Internet access Trace Data Module is deployed on the physical equipment for possessing express network access, is responsible for the DPI equipment packet domain Classification upload authentication information and internet access information, WAP gateway classification upload authentication information, Internet access information (including proxy information), the SYSLOG log servers of fire wall are uploaded NAT information of address conversion is collected and associates.

Big data query analysis module is deployed on the cloud computing resource pool 21 of x86 frameworks, fully Using the flexible computing resource dispatching of cloud computing Iaas, big data query analysis module will be moved The Trace Data for finishing is associated in dynamic online Trace Data acquisition module be loaded into Hive data bins Storehouse, data storage adopts RcFile forms, the file storage lattice commonly used in Hadoop platform system Formula has the TextFile for supporting text and supports binary SequenceFile etc., and they are belonged to Row storage mode.What RCFile (Record Columnar File) storage organization was followed is " first water The design concept of flat division, then vertical division ".

First, RCFile possesses the data loading and adaptive load energy equivalent to row storage Power；Secondly, the reading optimization of RCFile can avoid unnecessary row from reading when form is scanned, Test shows that as a rule it possesses better performance than other structures；Again, RCFile Using the compression of row dimension, therefore, it is possible to effectively lift memory space utilization rate, it is however generally that, Hive RcFile compress the use of storage format, compare nature TextFile forms in Hadoop, Equal number data can save 2/3 memory space.

Carrying out, tables of data establishment time-division bucket number computing formula is as follows：

Buckets=min (data_total_size/dfs.block.size, map_count)

Wherein buckets is a point bucket number；Data_total_size is data total size； Dfs.block.size is the fast size of file configured in hdfs.Map_count is service inquiry task Decompose number.

Determine a point bucket number according to above-mentioned formula, carry out tables of data in data warehouse Hive and create When, the matching of follow-up business query decomposition and existing system capacity scheme can be taken into full account, pass through The test repeatedly of commensurability data, has reached the optimization balance of time and resource, can effectively shorten The time that later retrieval is used.

Fig. 3 illustrates the schematic flow sheet of the information query method of one embodiment of the invention.Such as Fig. 3 Shown, the method includes：

Step 301, user sends query statistic request by Web interface31.

Process user to ask by the query statistic that web interface send, and the inquiry is united Meter request is sent to Hive Drive32.

Step 302, Hive Drive32 task resolution engines.

Specifically, Hive Drive32 decompose and translation and inquiry statistics request, please by the query statistic Ask decomposition and be translated as map reduce (mapping reduction) task.

Step 303, Map reduce33 are performed various according to the dependence of task Mapreduce tasks.

Specifically, a mapreduce task is all serialized to a plan.xml file In, in being then loaded into job cache, and each several part parsing plan.xml (unserializing), And associative operation is performed, result is put into into interim position, then by DML (data manipulation languages Speech) it is transferred to specified location.

Step 304, HDFS (Hadoop Distributed File System, distributed field system System) transfer Distributed Storage node data and carry out Distributed Calculation, the wherein distributed storage The data of node are the data obtained according to point bucket algorithm.

Step 305, Map reduce33 merge the result of calculation of each node.

Step 306, Hive Drive32 returns represent result and give Web interface31, to pass through Web interface31 will represent result presentation to inquiry user.

Fig. 4 illustrates the design sketch of the information query method of one embodiment of the invention, with Fujian telecommunications As a example by the collection analysises of mobile Internet access Trace Data.The C network users of Fujian telecommunications 9,000,000,2014 The original online Trace Data of the daily generation in year January is 700G, according at least preserving 3 months Chinese Ministry of Industry and Information require, data total amount in 70T, according to the framework of the present invention, log collection mould Block and each data source are interconnected by high speed fibre, and use physical server, realize data Converge and associate.

The essential core module of the present invention, is carried on Fujian telecommunication service cloud computing resource pool, uses The virtualization calculating platform of vmware vspher, opens 6 process nodes.Using of the invention real When point bucket algorithm for applying example is stored, the roaming access model, hand basket for user closes on mould Type, highest access purpose model etc. and realize behavior analysiss, in the data query to 10,000,000,000, Test query result is 1201 seconds, compares former centralized relational database processing system, there is big Width is lifted.

Fig. 5 illustrates the structured flowchart of the information query system of one embodiment of the invention, the system 500 include interface unit 501, query driven unit 502, data processing unit 503 and distributed File system 504, wherein：Interface unit 501 is used to receive the query statistic that inquiry user sends Request；Query driven unit 502 is used to receive the inquiry system that inquiry user sends in interface unit After meter request, Task-decomposing is carried out to query statistic request, to obtain corresponding map reduce Task, and the map reduce tasks for obtaining are supplied to into data-reading unit；Data processing list Unit 503, it is corresponding from distributed file system for according to the map reduce tasks for obtaining Distributed Storage node reads data, according to the data that each Distributed Storage node reads Distributed Calculation is carried out, the result of calculation of each Distributed Storage node is merged, with To Query Result；And indicate that Query Result is supplied to inquiry user by interface unit；Distributed document System 504, for distributed storage data, wherein the Hive data in distributed file system In warehouse, data storage adopts RcFile forms.

In one embodiment, interface unit 501 can be Web interface, query driven list Unit 502 can be Hive Drive, data processing unit can be Map reduce, distributed text Part system can be HDFS.

In one embodiment, the device also includes：Collecting unit 505 and data load units 506, wherein：Collecting unit 505, for the online Trace Data of Real-time Collection mobile subscriber；Number According to load units 506, the online Trace Data for collecting unit to be collected is loaded into distributed In Hive data warehouses in file system.

In one embodiment, data load units 506 are specifically in Hive data warehouses are carried out When tables of data is created, number is decomposed according to query statistic request task and system capability determines point bucket Number.

In one embodiment, data load units 506 utilize formula

Buckets=min (data_total_size/dfs.block.size, map_count)

Fig. 6 illustrates the structured flowchart of the information query system of an alternative embodiment of the invention, the letter Breath inquiry system 600 can be possess the host server of computing capability, personal computer PC, Or portable portable computer, mobile terminal or other-end etc..The present invention is embodied as Example is not limited implementing for calculate node.

Information query system 600 includes processor (processor) 601, communication interface (Communications Interface) 602, memorizer (memory) 603 and bus 604.Its In, processor 601, communication interface 602 and memorizer 603 complete phase by bus 604 Communication between mutually.

Communication interface 602 is used for and network device communications, and wherein the network equipment includes such as virtual machine Administrative center, shared storage etc..

Processor 601 is used for configuration processor.Processor 601 can be a central processing unit CPU, or can be application-specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the embodiment of the present invention one or more integrated circuits.

Memorizer 603 is used to deposit file.Memorizer 603 can be stored comprising high-speed RAM Device, also can also include nonvolatile memory (non-volatile memory), for example, at least one Disk memory.Memorizer 603 can also be memory array.Memorizer 603 is also possible to be divided Block, and block can be combined into virtual volume by certain rule.

In one embodiment, said procedure can be the program generation for including computer-managed instruction Code.The program is particularly used in：It is right after the query statistic request that inquiry user sends is received Query statistic request carries out Task-decomposing, to obtain corresponding map reduce tasks；According to The map reduce tasks for arriving, the corresponding Distributed Storage section from distributed file system Point reads data；In Hive data warehouses wherein in distributed file system, data storage Using RcFile forms；The data read according to each Distributed Storage node carry out distributed Calculate；The result of calculation of each Distributed Storage node is merged, to obtain inquiry knot Really；Query Result is supplied to into inquiry user.

In one specifically embodiment, the method also includes：Real-time Collection mobile subscriber's is upper Net Trace Data；The Hive online Trace Data for collecting being loaded in distributed file system In data warehouse.

In one specifically embodiment, the online for collecting Trace Data is being loaded into into distribution In the step in Hive data warehouses in formula file system, also include：Carrying out Hive data When tables of data is created in warehouse, number is decomposed according to query statistic request task and system capability determines Divide bucket number.

In one specifically embodiment, using formula

Buckets=min (data_total_size/dfs.block.size, map_count)

In one specifically embodiment, Trace Data of surfing the Net includes what DPI device classes were uploaded Authentication information and the Internet that authentication information and internet access information, WAP gateway classification are uploaded The NAT information of address conversion that access information, the SYSLOG log servers of fire wall are uploaded.

Those of ordinary skill in the art are it is to be appreciated that each example in embodiment described herein Property unit and algorithm steps, can be with electronic hardware or the knot of computer software and electronic hardware Close to realize.These functions are realized with hardware or software form actually, depending on technical scheme Application-specific and design constraint.Professional and technical personnel can select not for specific application With method realizing described function, but this realize it is not considered that beyond the model of the present invention Enclose.

If function being realized using in the form of computer software and as independent production marketing or being used When, then to a certain extent it is believed that all or part of technical scheme is (such as to existing Have the part that technology contributes) embody in form of a computer software product.The computer Software product is generally stored inside in the non-volatile memory medium of embodied on computer readable, including some fingers Order is used so that computer equipment (can be personal computer, server or the network equipment Deng) perform various embodiments of the present invention method all or part of step.And aforesaid storage medium bag Include USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random Access memorizer (RAM, Random Access Memory), magnetic disc or CD etc. are various Can be with the medium of store program codes.

Description of the invention is given for the sake of example and description, and is not exhaustively Or the form disclosed in limiting the invention to.Many modifications and variations are for the common skill of this area It is obvious for art personnel.It is to more preferably illustrate the principle of the present invention to select and describe embodiment And practical application, and one of ordinary skill in the art is made it will be appreciated that the present invention is suitable so as to design In the various embodiments with various modifications of special-purpose.

Claims

1. a kind of information query method, it is characterised in that include：

After the query statistic request that inquiry user sends is received, query statistic request is carried out appointing Business is decomposed, to obtain corresponding map reduce tasks；

It is corresponding distributed from distributed file system according to the map reduce tasks for obtaining Data memory node reads data；Hive data warehouses wherein in distributed file system In, data storage adopts RcFile forms；

Distributed Calculation is carried out according to the data that each Distributed Storage node reads；

The result of calculation of each Distributed Storage node is merged, to obtain Query Result；

Query Result is supplied to into inquiry user.

2. method according to claim 1, it is characterised in that also include：

The online Trace Data of Real-time Collection mobile subscriber；

The Hive data bins online Trace Data for collecting being loaded in distributed file system In storehouse.

3. method according to claim 2, it is characterised in that

In the Hive data being loaded into the online for collecting Trace Data in distributed file system In step in warehouse, also include：

4. method according to claim 3, it is characterised in that

Using formula

Buckets=min (data_total_size/dfs.block.size, map_count)

5. method according to claim 2, it is characterised in that

Online Trace Data includes the authentication information of DPI device classes upload and internet access letter Authentication information and internet access information, fire wall that breath, WAP gateway classification are uploaded The NAT information of address conversion that SYSLOG log servers are uploaded.

6. a kind of information query system, it is characterised in that including interface unit, query driven list Unit, data processing unit and distributed file system, wherein：

Interface unit, for receiving the query statistic request that inquiry user sends；

Query driven unit, the query statistic for receiving inquiry user's transmission in interface unit please After asking, Task-decomposing is carried out to query statistic request, appointed with obtaining corresponding map reduce Business, and the map reduce tasks for obtaining are supplied to into data-reading unit；

Data processing unit, for according to the map reduce tasks for obtaining, from distributed document Corresponding Distributed Storage node reads data in system, according to each Distributed Storage section The data that point reads carry out Distributed Calculation, and the result of calculation of each Distributed Storage node is entered Row merges, to obtain Query Result；And indicate that Query Result is supplied to inquiry to use by interface unit Family；

Distributed file system, for distributed storage data, wherein in distributed file system Hive data warehouses in, data storage adopt RcFile forms.

7. system according to claim 6, it is characterised in that also include：Collecting unit and Data load units, wherein：

8. system according to claim 7, it is characterised in that

Data load units specifically when carrying out Hive Data Warehouses table and creating, according to looking into Ask statistics request task and decompose number and a system capability determination point bucket number.

9. system according to claim 8, it is characterised in that

Data load units utilize formula

Buckets=min (data_total_size/dfs.block.size, map_count)

10. method according to claim 7, it is characterised in that