CN107766572A

CN107766572A - Distributed extraction and visual analysis method and system based on economic field data

Info

Publication number: CN107766572A
Application number: CN201711113558.0A
Authority: CN
Inventors: 刘银; 林杨阳; 刘建华; 丁文豪
Original assignee: Beijing State Xin Hong Number Science And Technology Co Ltd
Current assignee: Beijing State Xin Hong Number Science And Technology Co Ltd
Priority date: 2017-11-13
Filing date: 2017-11-13
Publication date: 2018-03-06

Abstract

This application provides a kind of distributed extraction based on economic field data and visual analysis method and system.Distributed extraction and visual analysis method wherein based on economic field data, including：Distributed data extraction step：Back-end server receives user and extracts the instruction of big data and be sent to host node, the big task for extracting data is cut into small task by host node according to the instruction received according to some or multiple field dimensions of task, and small task is distributed to different processing nodes, node is handled to initiate to ask to text retrieval system according to the small task of distribution, host node stores the small task of generation into database, and the state of synchronous task in the process of running；Distributed storage step；Distributed Calculation and analytical procedure;Data load and caching step；Result visualization shows step.The application improves the efficiency of big data extraction, while reduce the threshold of user's big data analysis by above-mentioned means.

Description

Distributed extraction and visual analysis method and system based on economic field data

Technical field

The application is related to technical field of data processing, especially, be related to distributed extraction based on economic field data and Visual analysis method and system.

Background technology

In the epoch of the very fast expansion of this current data volume, big data becomes the noun of the supreme arrogance of a person with great power.The big data epoch Have no lack of quantity, it is important that we need to find overall rule by quantity, so as to analyze big data.

Big data is utilized as the key factor of raising core competitiveness.The decision-making of all trades and professions is from " business is driven It is dynamic " transformation " data-driven ".Analysis to big data can make retailer grasp market trend in real time and make reply rapidly； More accurate effective marketing strategy can be formulated for businessman decision support is provided；Enterprise can be helped to be provided more for consumer Timely and personalized service；In medical field, diagnostic accuracy and drug effectiveness can be improved；In government utility field, greatly Data also begin to play the important function promoting economic development, maintain social stability etc..Data application is produced to life In, it can effectively help people or enterprise to be made comparisons to information accurate judgement, to take appropriate action.Data analysis It is that tissue purposefully collects data, analyze data, and makes the process of information.Namely refer to personal or enterprise in order to Solves the problems such as decision-making or marketing in life production, the process that operation analysis method is handled data.

By taking macroeconomy FIELD Data as an example.The analysis of macroeconomy FIELD Data is related to wechat public number, academic intelligence Internet content including storehouse website, finance and economics information website, industry portal website etc. is analyzed.For government department's online service Positioning, service strategy and the influence of internet evaluation to vocational work provide data analysis support.With the quick hair of technology Exhibition, people carry out data analysis using big data visual analyzing means, data are shown in a more intuitive way, from Different dimension observed data, so as to data carry out deeper into observation and analysis.

Big data visual analyzing refers to while big data automatically analyzes method for digging, using supporting information visualization User interface and analyze the man-machine interaction mode and technology of process, the computing capability of effective integration computer and the cognition of people Ability, to obtain the insight for large-scale complex data set.

When current design is built and realizes the visual analyzing platform based on macroeconomy FIELD Data, can encounter several Individual technical barrier.It is that the extraction of macroeconomy FIELD Data requires problem first, it is desirable to be according to setting from text retrieval system The querying condition put（Essential condition：Crucial phrase+time range）Extract hit data.Next to that technical threshold problem （When user is ignorant of sql query statements, does big data analysis again）, it is finally user mutual, bandwagon effect and mode problem （It is self-defined to show effect, flexibility and diversified ways of presentation）.

At present, big data isomerous environment data syn-chronization instrument DataX and Sqoop, all it is that the data for solving isomerous environment are handed over Change problem.DataX is the instrument of an exchange high speed data between database/file system of isomery, is realized any Data handling system (RDBMS/Hdfs/Local filesystem) between data exchange, by data platform department of Taobao Complete.Sqoop is an instrument for being used for mutually shifting the data in Hadoop and relevant database, can be by a pass Be type database (such as：MySQL, Oracle, Postgres etc.) in data lead in the HDFS for entering Hadoop, also may be used Entered so that HDFS data are led in relevant database.

But problems be present using above-mentioned instrument：It is that the extraction of macroeconomy FIELD Data requires problem first, it is desirable to It is according to the querying condition of setting from text retrieval system（Crucial phrase+time range）Hit data are extracted, it is grand Sight economic field data and general public sentiment class data maximum are not both that macroeconomic tendency is mainly studied in macroeconomy field With influence macroeconomic various factors, therefore the data source time span extracted is long, and the industry field being related to compares More, data volume is bigger, can not meet carrying based on macroeconomy FIELD Data using DataX and Sqoop instruments increased income etc. Requirement is taken, so needing to design a kind of distributed extracting method to solve this problem.

Distributed system refer to more machines by network connection together, as generally serve upper layers. Specifically, it would be desirable to which the problem of magnanimity computing capability could be handled splits into many fritters, then distributes to fritter same Different calculate nodes is handled in set system, finally merges to obtain final result by the result of separate computations if necessary, So this system is referred to as distributed system.Its interior joint refers to that one independently can complete one group of logic according to distributed protocol Program individual, often fingering journey in engineering.It is completely independent and is mutually isolated between node, communication sole mode is by unreliable Network.

Hive is namely based on a distributed system Hadoop Tool for Data Warehouse, can be by the data file of structuring A database table is mapped as, and simple sql query functions are provided, sql sentences can be converted to MapReduce tasks and entered Row operation.It is characterized in quickly realizing simple MapReduce statistics by class SQL statement, it is not necessary to develop special MapReduce is applied, and is very suitable for the statistical analysis of data warehouse.And Spark SQL, as Apache Spark big data frames A part for frame, it is mainly used in structural data processing and Spark data is performed with class SQL inquiry.By Spark SQL, The data that different-format can be directed to perform ETL operations（Such as JSON, Parquet, database）Then specific inquiry behaviour is completed Make.

But because the integration of user interaction functionality person of needing to use that the Open-Source Tools such as hive and spark sql provide possesses necessarily SQL basis, in order to realize do not possess SQL basis analysis personnel remain able to carry out data analysis the problem of, so need A kind of new distributed visible analysis method is designed to solve this problem.

The content of the invention

The application provides distributed extraction and visual analysis method and system based on economic field data, for solving The problem of big data is less efficient, user's operation difficulty is too high and data analysis is not directly perceived enough is extracted in the prior art.

Distributed extraction and visual analysis method disclosed in the present application based on economic field data, including：

Distributed data extraction step：Back-end server receives user and extracts the instruction of big data and be sent to host node, main section The big task for extracting data is cut into small task by point according to the instruction that receives according to some or multiple field dimensions of task, And small task is distributed to different processing nodes, processing node is initiated according to the small task of distribution to text retrieval system please Ask, host node stores the small task of generation into database, and the state of synchronous task in the process of running；

Distributed storage step：In the data set deposit data-base cluster that processing node returns to text retrieval system；

Distributed Calculation and analytical procedure：Background server receives the instruction of user and need is loaded from data-base cluster according to it The data set wanted, filter data, analyze data and statistical analysis are then crossed, then in result set write into Databasce cluster;

Data load and caching step：After background server receives the request of client loading data, according to request from database The associated metadata of middle reading task, memory table is created, is loaded data into according to parameter and from data-base cluster in memory table, Feedback result after the completion of loading；

Result visualization shows step：By data, by chart etc., intuitively ways of presentation shows.

Preferably, in distributed data extraction step, the small task of generation assigns the priority of task according to a certain percentage, The more high more preferential operation of task priority, the task of same levels arrives first according to FIFO first obtains scheduling strategy execution, according to processing The configuration parameter of node, the different task of priority is given to different processing nodes in proportion；Handle the receiving thread of node After receiving task, the dispatching algorithm that the scheduling of scheduling thread use priority, FIFO scheduling and equity dispatching are combined will receive To task add in task queue, extraction data manipulation is performed according to the parameter of task and receives data.

Preferably, in Distributed Calculation and analytical procedure, instructed receiving analysis task of the user with query argument Afterwards, sql query statements are assembled into according to the mapping relations analytic parameter of literary name section and entities field and splicing.

Preferably, visualize in step, the data for asking the current generation to need by way of the on-demand loading of front end, And the data having requested that are cached by front end caching mechanism.

Preferably, step, including following fine division step are visualized,

The number according to corresponding to user pulls analysis field to the instruction to transmission acquisition request field from the background of dimension or number line According to；

Shown in table form after getting data；

The subtype for being judged and being shown to select according to the number of the number of dimension axle field and number line field；

The configurable parameter of the type is shown according to the subtype of the selection of user, chart is generated according to the parameter of user configuration And shown.

Distributed extraction and Visualized Analysis System disclosed in the present application based on economic field data, including：

Distributed data extraction module：For receive user extract big data instruction and be sent to host node, host node according to The big task for extracting data is cut into small task by the instruction received according to some or multiple field dimensions of task, and will be small Task distributes to different processing nodes, and processing node initiates to ask according to the small task of distribution to text retrieval system, main section Point stores the small task of generation into database, and the state of synchronous task in the process of running；

Distributed Storage module：Data set for text retrieval system to be returned is stored in data-base cluster；

Distributed data calculates and analysis module：For receiving the instruction of user and needs being loaded from data-base cluster according to it Data set, filter data, analyze data and statistical analysis are then crossed, then in result set write into Databasce cluster；

Data load and cache module：After request for receiving client loading data, read according to request from database The associated metadata of task, memory table is created, loads data into memory table, has loaded according to parameter and from data-base cluster Into rear feedback result;

Result visualization display module：For by data, by chart etc., intuitively ways of presentation to show.

Preferably, in distributed data extraction module, the small task of generation assigns the priority of task according to a certain percentage, The more high more preferential operation of task priority, the task of same levels arrives first according to FIFO first obtains scheduling strategy execution, according to processing The configuration parameter of node, the different task of priority is given to different processing nodes in proportion；Handle the receiving thread of node After receiving task, the dispatching algorithm that the scheduling of scheduling thread use priority, FIFO scheduling and equity dispatching are combined will receive To task add in task queue, extraction data manipulation is performed according to the parameter of task and receives data.

Preferably, in distributed data calculating and analysis module：Receiving analysis task of the user with query argument After instruction, sql query statements are assembled into according to the mapping relations analytic parameter of literary name section and entities field and splicing.

Preferably, visualize in module, the data for asking the current generation to need by way of the on-demand loading of front end, And the data having requested that are cached by front end caching mechanism.

Preferably, visualize in module,

（1）According to corresponding to user pulls analysis field to the instruction to transmission acquisition request field from the background of dimension or number line Data；

（2）Shown in table form after getting data；

（3）The chart class for being judged and being shown to select according to the number of the number of dimension axle field and number line field Type；

（4）The configurable parameter of the type is shown according to the subtype of the selection of user, is generated according to the parameter of user configuration Chart is simultaneously shown.

Compared with prior art, the application has advantages below：

This application provides a kind of distributed data extraction method, the user based on big data framework need not write sql inquiries The method that sentence can also carry out big data visual analyzing.（1）Cutting and allocation algorithm by task so that big data quantity Distribution extraction is possibly realized；（2）Pre- subregion, the dynamic of storage and the optimization method of filter of database table can speed up Parallel processing speeds；（3）Dynamic splices the operation difficulty that sql query statements make user reduce analysis；（4）User interface simultaneously In simple operation ensure that the flexibility of business personnel's analysis, customized chart ways of presentation enhances visual presentation The friendly of analysis result.

The application is applied to the distributed extraction and visualization of big data, the distribution for the economic field data that are particularly suitable for use in Extraction and visualization.

Brief description of the drawings

Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as the limitation to the application.And whole In individual accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings：

Fig. 1 is distributed extraction and the signal of visual analysis method first embodiment of the application based on economic field data Figure；

Fig. 2 is the distribution of distributed extraction and visual analysis method first embodiment of the application based on economic field data The schematic diagram of data extraction；

Fig. 3 is distributed extraction and the signal of Visualized Analysis System second embodiment of the application based on economic field data Figure.

Embodiment

It is below in conjunction with the accompanying drawings and specific real to enable the above-mentioned purpose of the application, feature and advantage more obvious understandable Mode is applied to be described in further detail the application.

In the description of the present application, it is to be understood that term " first ", " second " are only used for describing purpose, and can not It is interpreted as indicating or implies relative importance or imply the quantity of the technical characteristic indicated by indicating.Thus, define " the One ", one or more this feature can be expressed or be implicitly included to the feature of " second "." multiple " are meant that two Or two or more, unless otherwise specifically defined.Term " comprising ", "comprising" and similar terms are understood to out The term of putting property, i.e., " including/including but not limited to ".Term "based" is " being based at least partially on ".Term " embodiment " Represent " at least one embodiment "；Term " another embodiment " expression " at least one further embodiment ".The phase of other terms Close during definition will be described below and provide.

With reference to figure 1, distributed extraction and visual analysis method first of the application based on economic field data are shown The flow of embodiment.Data extract and visual analyzing whole process in be related to data and extract, store, calculating and the side of showing Face.Four parts are indispensable, form one from the analysis system for extracting the overall process showed.Data are extracted and visual The key step changed in analysis overall process includes：Distributed data extraction, distributed storage, Distributed Calculation and statistical analysis, Data load and caching and result visualization displaying.

This method for optimizing embodiment comprises the following steps：

Step S101：The instruction that user extracts macroeconomy FIELD Data task is received, for example the querying condition of user's input is Crucial phrase+time range, and be sent to host node, some or multiple word of the host node according to the instruction received according to task The big task for extracting data is cut into small task by Duan Weidu, and small task is distributed to different processing nodes, handles node To initiate to ask to text retrieval system according to the small task of distribution, host node stores the small task of generation into database, and The state of synchronous task in the process of running.

Specifically, text retrieval system includes Solr the or Elasticsearch full-text search collection that store mass data The system of group.Wherein Elasticsearch is real-time a distributed search and analysis engine, is built upon full-text search and draws Hold up the search engine on the basis of Apache Lucene.It can be with high speed processing large-scale data.It can be used for full-text search, Structured search and analysis, naturally it is also possible to be combined this three.Solr is the enterprise that increases income of Apache Lucene projects Industry search platform.Its major function includes full-text search, hit sign, facet search, dynamic clustering, geo-database integration, and Rich text（Such as Word, PDF）Processing.

As shown in Fig. 2 the framework of distributed data extraction includes three parts：Between host node, processing node, node Communication.Host node and processing node contacts are got up to be formed the entirety for externally providing service by the communication between node.

Host node（master）Function include：The metadata of maintenance and management middle table（Table name, literary name section and type）, Including mapping relations and record information；Generation and distribution task, monitor task running status, the management of history log are abnormal to accuse It is alert；The communication information of calculation procedure is handled, updates task status.

Handle node（slave）Function include：Specifying for task is received and read, efficiently and stably performs task；Deposit Store up in data to data storehouse.

Communication between node is primarily referred to as the socket communications based on tcp agreements and sends with realizing high-performance, receives life Order and data.Wherein, the message of communication needs definition command format standard.Also realize the work(such as heartbeat detection and disconnection reconnecting in addition Energy.

The key point of this step is task cutting and distribution and tasks carrying strategy.Specifically, as shown in Fig. 2 This step includes following fine division step：

（1）Equably segmentation task：When host node receives the extraction data task of user（Such as extraction macroeconomy field number According to task）Instruction after, had according to the querying condition of task with crucial phrase and time range（Time started is with the end of Between）Rule, and because time range is continuous divisible dimension, temporally dimension and same time interval（The most granule of time Degree, such as seven days）Big task is cut into small task.The time range of each small task is different and occurs without time-interleaving and appoints Time range between business occurs without space, and the holding of other dimensions and big task is constant.All small task coverages Summation is equal to the scope of big task.

（2）The small task of generation assigns the running priority level of task by a certain percentage, and priority divides five grades, respectively For 1,3,5,7,9 grades, the more high more preferential operation task of number of degrees.The task of same levels is arrived first by FIFO must first dispatch plan Slightly perform.

（3）According to the configuration parameter of processing node（By CPU core calculation）, the different task of priority is given not in proportion With node, at this moment task be assigned.The small task of host node generation is stored into relevant database, is needed in running The state of synchronous task, and log.When certain processing node or host node occur abnormal or hang, the information of task will not Lose.

（4）Send and receive task：Host node is responsible for distributed tasks（Batch sending）To each processing node, node is handled Listening port simultaneously receives task.

（5）After the receiving thread of processing node receives the task of batch, task is put into candidate tasks pond, dispatches line The dispatching algorithm that the scheduling of journey use priority, FIFO scheduling and equity dispatching are combined selects a small amount of appoint from candidate tasks pond Business is added in task queue.

Specifically, processing node distinguishes processing Thread Count on the premise of always processing thread is set according to a certain percentage Different grades of task is handled, the thread of same levels forms sets of threads.Each sets of threads is respectively from the task team of corresponding level Task is claimed in row and is performed.For example priority is high, normal, basic presses 3:2:1 ratio, the high task of 3 thread process priority, 2 Task in individual processing priority, the low task of 1 processing priority.After processing node receives task, sort by priority Task, the high priority of task of priority is added in high level task queue, the task of corresponding level is added after priority is low Queue, and queue length sets threshold values（Default value is 10）.The task of same levels is added to task queue（At task wait Reason）When, time more early more preferential addition queue need to be generated by the generation time-sequencing of task, when task queue is expired, dormancy Certain time（Several seconds or millisecond）The task that whether can add is reexamined, or waits task queue to have vacant position.Due to when certain When there are multiple big tasks to need to extract data in the period, multiple small tasks corresponding to different big tasks, the life of task can be produced There is point of priority into the time, there may exist and occur in some big task long-times when only FIFO, which is dispatched, is responsible for scheduler task It cannot handle, and the small task of some big tasks priority treatment always（It is more early than other to generate the time）, it may appear that starve property Problem.In order to solve hungry sex chromosome mosaicism, addition task is to during task queue, needing all big tasks of poll（In the presence of small not handled Business）, a small amount of tasks is therefrom selected, and small task is selected by a certain percentage between big task, it is then added in queue.

（6）Task is taken out from queue by processing thread, processing thread performs extraction data manipulation according to the parameter of task（Hair Send query statement, paging query）, receive the business datum that full-text search cluster returns（From internet, collection is returned original Text data, structuring or semi-structured business datum are converted to by pretreatment）Collection, business datum collection is then write number According in storehouse, extract and complete such as the business datum of certain task, be then followed by handling a task.It is asynchronous that transmission is responsible for by transmission thread Information reports the situation of certain task to host node.

（7）Host node receives the information of small task run state, collects the small task status and statistics of each processing node The data volume extracted simultaneously judges whether all small tasks of big task run completion, updated if completing big task state and Sum.

（8）As host node receives new extraction data command, return to step（1）.

This step solves the requirement of high efficiency extraction data well, and is carried how to include the distribution of similar demand Take system to provide method and be related to scheme, there is very strong application value.

Step S102：In the data set deposit data-base cluster that processing node returns to text retrieval system, and use number Accelerate parallel processing speeds according to the pre- partitioning technique and dynamic memory of storehouse table.

Specifically, carrying out pre- subregion to table according to the pre- number of partitions, the random number generated according to the major key of data record writes The specified partition of database.

By taking hbase as an example.One hbase table is corresponded into a kind of data source or a kind of function.All words in hbase tables Content in section is stored using json forms, namely all fields and content of a record press the field string that json forms generate It is stored in table all field, in order to dynamic analysis.Pre- subregion is carried out to hbase tables, the data major key of same task is write from memory Recognize storage of the addition prefix using 0# as prefix；When data volume reaches to a certain degree, the major key of hbase tables need to add prefix [1- 9]|[a-z]+#。

Hbase table major key forms are：Prefix code # task id@record major key.

Wherein, prefix code is one-bit digital or letter, and " # " and "@" are bound symbol, and task id is certain extraction data Task major key, record major key for certain task a certain bar record major key（MD5）.

Such as： a#1000@1d128b617546ee05e126ad0b33381248

Prefix code is raw according to data ranging formula preCode=remarksCode [hash (key) % partionNum] Into.

Prefix code array is

String[] remarksCode= {"0","1","2","3","4","5","6","7","8","9","a","b",

"c","d","e","f","g","h","i","j","k","l","m","n",

"o","p","q","r","s","t","u","v","w","x","y","z"};

In data ranging formula, preCode is prefix code, and remarksCode is the array of storage prefix code, and key is note The major key of record, partionNum are the number for needing subregion, and hash refers to Hash table（Hash table, also cry hash table）, it is root The data structure directly to be conducted interviews according to key.Hash is recorded by the way that key is mapped to a position in table to access, to add The speed searched soon.This mapping function is called hash function, and the array for depositing record is called hash table.Key is passed through one Fixed algorithmic function is that so-called hash function is converted into an integer numeral, then just carries out the digital logarithm group length Remainder, remainder result is just as the subscript of array, in value is stored in using the numeral as lower target array space.And working as makes When inquiry with Hash table, exactly reuse hash function and key is converted into corresponding array index, and navigate to The space obtains value, thus, it is possible to makes full use of the positioning performance of array to carry out data positioning.Directly use key Hash values（The method for calculating key hash values can be with unrestricted choice, such as algorithm CRC32, MD5, or even local hash systems System, such as java hashcode）The number of partitions positions the prefix code of subregion on mould.This algorithm is not only simple, and has Good random distribution nature.

Table subregion uses the pre- subregion of prefix, the MD5 major keys Hash write-in specified partition of data record, can efficiently solve out The defects of existing focus（Data write-in, read inequality）.

The data volume in factor data source would generally be very big（Million or millions more than）, default setting equably stores data to 36 In individual subregion（Or more subregion, two can be split into when the memory space of some subregion exceedes threshold values automatically）, but analyze rank Data caused by section are typically smaller than record number corresponding to data source, can be according to the feelings of the relative reduction of result data amount of analysis task Condition, the small set of data volume is stored into partial-partition, rather than in 36 subregions of the deposit with different prefixes, dynamic point The characteristics of area stores is advantageous to follow-up data loading.Such as：When data volume is less than 50,000, only storage to 5 subregions, then PartionNum=5, according to data ranging formula calculate prefix code be 0 to 4 between one.

The design of hbase tables and dynamic storage method, the storage focus for solving the databases such as hbase of this step application are asked Topic and quick reading data problem.

Step S103：Background server receives the instruction of user and loads the data set of needs from data-base cluster according to it, Then filter data, analyze data and statistical analysis are crossed, then in result set write into Databasce cluster.

Specifically, the user instruction received meets the specification of definition.The specification of definition includes（1）The meta-attribute of field：Word Section Chinese, field English name, field data types；（2）The data type of definition：date、datetime、long、 double、string、text；（3）The filtering computational methods of definition：It is equal to, be not equal to, be more than, be less than, be more than or equal to, is less than Be equal to, scope, be empty, be not empty, include, do not include, regular expression etc..

Specifically, the condition for crossing filter data supports the combinational expression of the AND-OR INVERTER between field while supports field Regular expression.The purpose of filtering is after being extracted from distributed data base, filters out and needs not participate in before analysis is started The record of analysis, effectively reduce the data acquisition system of analysis.

The priority principle of processing data type is：

Int integers>Double floating types>Date date types>Datetime time types>String character string classes> Text long text types.

Integer field contents are handled first, finally just handle the field contents of long text type.

The data of the field of different types of data are according to priority filtered, then statistical analysis again, effectively accelerates arithmetic speed, Reduce internal memory usage amount.

Use priority filtering reason is that the data of different types of data perform the cpu that identical filter condition is consumed Resource is different, and unnecessary resource can be effectively avoided when between the field in filter condition being "or" or NOT operation relation Expense（cpu）.When the relation of field 1 and the filter condition of field 2 is "or", the content of a certain field only therein meets Then this record is hit filter condition（Need to participate in the record analyzed）, this record quilt if all filter condition is unsatisfactory for Filter out miss（It is shown to be the record for needing not participate in analysis）.Can sequentially it be respectively compared twice when miss（Word Section 1 is required for field 2）, and number of comparisons when hitting is fewer than miss number, the resource overhead consumed during hit It can also lack with the time.When the data type of field 1 be long and it is corresponding filtering computational methods be " being more than ", field 2 data When type is text and corresponding filtering computational methods are " regular expression ", the resource that the content of filtered fields 1 is consumed is opened Pin is smaller than field 2, and the time of computing is also few.When needing to handle big data quantity, expense and time that both add up can be put It is more obviously big.

Such as when using same regular expression handling the data content of string and text fields respectively, because of string The content-length of type is shorter than the content-length of text field, so the cpu clocks needed will quite lack.

Therefore, the data of the field of different types of data are according to priority filtered, then statistical analysis again, effectively accelerates computing Speed, reduce internal memory usage amount.

Business literary name section is used in data analysis（Field English name）With the mapping relations of entities field.Because data Analysis is that the field of development language is handled, rather than the literary name section in database.So need in database Data are converted to the set of entity class.Data exchange（Data and the process write data in database are read from database） In, the field in json forms need to be parsed, the value of field is imparted to the corresponding field of entity class.Uniform provisions business literary name section Be advantageous to the normative standard of field management with the mapping relations of entities field.Briefly：Literary name section and entities field are one by one It is corresponding.

In traditional Relational DataBase, most basic sql query statements such as SELECT field A, field B, field C FROM table A WHERE field A>10, there is projection（field A, field B, field C）, data source (table A) and filter (field A>10, codition) three parts form, and are distributed corresponding sql and look into Reuslt, data source, operation during inquiry, that is to say, that sql sentences press result-> data source- >Operation order describes.But it is according to operation- during sql sentences are actually performed>data source->Result order performs.

In a complete query statement form " select field from table where querying condition order by Sort criteria group by are grouped condition " in, " where querying conditions ", " order by sort criterias " and " group by Packet condition " is all not essential, it is necessary to only " select field from tables ".So fortune according to sql query statements Row principle, dynamically splicing meet the sentence of sql standard queries grammers.

Specifically, background program instructs in the analysis task for receiving user（With query argument）Afterwards, json lattice are parsed The parameter of formula（Three parts composition including query statement）Splicing assembling sql query statements.Splice in assembling process, use need to be differentiated It is expression formula which family, which sets, and which is result row, and which is data source, there is which filter condition, then corresponding parameter Content is converted to grammaticalness, the sentence fragment of literary style standard according to the mapping relations of literary name section and entities field, forms user Expressed complete sentence.Most filter condition（The field being related in data source）The position of filter can be advanced to, It need not be spliced in query statement, so efficiently reduce the resource used during analysis.Then performed using spark sql components Result is simultaneously write hbase databases by query statement.So, the distributed advantages of spark have fully been played and have been solved The shortcomings that spark sql.

In this step, user need not write sql sentences, it is only necessary to neatly select some or more numbers in front end page According to source（Data extraction returns to be stored in the data set of database）, optional some fields analyzed, can the multiple fields of multiselect divide The dimension of group statistics, while filter condition may be selected（Support regular expression）Filter data is crossed, is then analyzed and is counted, and Near real-time shows operation result, reduces the threshold of big data analysis.

In addition, the method for crossing filter data of this step solves the problems, such as the data filtering of complexity.

Step S104：After background server receives the request of client loading data, analytic parameter, from relevant database The associated metadata of middle reading task, then creates memory table, then loaded data into according to parameter from data-base cluster in Deposit in table, feedback result after the completion of loading.

Specifically, background server receives when showing result command of client user's transmission, looked into by jdbc interfaces Ask certain analysis task in relevant database（Completion is analyzed）Result set metadata information.Pass through hbase clients The result set data of the task are inquired about in interface polls hbase clusters.After returning result collection, web server is believed according to metadata The Data Concurrent of breath parsing json forms is sent to front end.

Specifically, this step also takes metadata cache technology, distributed data load method, provides inquiry progress The interface and preloading technology of bar.

The metadata cache technology refers to the memory resource limitation because of server, can not thus data be all saved in In internal memory, retain a certain amount of data set or the data set in the range of one section of access time using algorithm according to nearest.In user When sending repeatedly request, service end can need to detect data and metadata（The information of table）Whether it is deleted, if deleted, weighs New loading data.Service end timing detects whether that the data needs for meeting deletion condition are deleted, and starts if condition meets Handle thread and perform deletion task, metadata and corresponding data are deleted in the environment of affairs, it is ensured that client accesses data When be not in abnormal conditions.The information cache of metadata improves the speed for accessing metadata in the internal memory of program.

The distributed data load method refers to when the number of partitions of the metadata of certain task（Stored corresponding to finger task In the number of partitions of hbase tables）When having multiple, data are read in a distributed manner from hbase tables using spark（Read the finger of subregion Fixed number evidence）, then data distribution formula is loaded into the memory table of relevant database, is characterized in quick and high concurrent Property.When the number of partitions only has one（Data volume is seldom）, then loaded using local program, it is not necessary to use spark original Cause：When spark asks the metadata of hbase tables, connecting zookeeper process needs the regular hour, between spark nodes Communication similarly needs the time, is unfavorable for more efficiently completing the loading data of the task more quickly.The background program of service end During startup, the context environmental for the spark that has been prepared in advance（Each calculate node has distributed resource, and executive process has been turned on）, The speed for loading data for the first time can so be lifted.

The interface for providing inquiry progress bar refers to the progress situation of Real-time Feedback loading procedure.

The preloading refers to, using before chart function, in advance load in memory table data, without using when Just load data.

Step S105：By data, by chart etc., intuitively ways of presentation shows.

The emphasis that macroscopical big data visualization shows is that the economic information of digitization is passed through into straightaway chart, form Data user of service is presented to Deng ways of presentation.Visualization shows should be noted 2 points in this course：First, how will Volume data rapid requests are shown, second, the difference for how tackling a variety of data users of service shows demand, so as to Flexibly show existing statistics.For above-mentioned 2 demands, in front end, visualization takes big data quantity caching in terms of showing Mechanism and based on dynamic pull flexible visualization show method.

1. big data quantity caching mechanism

Big data visualizes asked data and all sent back in a manner of json.Front end is on the one hand logical in request process The data that the mode of on-demand loading asks the current generation to need are crossed, to reduce the time to be expended in request process, improves and adds Carry speed.On the other hand the data having requested that are cached by front end caching mechanism, reduces repeat to ask within a certain period of time The number asked, the speed for improving data loading is reached with this.

The shortcomings that existing basis caching：Browser is exactly in itself cache policy to http request, but this caching Two defects of mode：

(1) get requests can only be cached.

(2) while setting for caching is all specified in the header of rear end response.Present many service code logics are Front end is concentrated on, front end exploitation this mode difficult to use is thus caused and caches.

The web front-end caching mechanism being related at present in the present embodiment has several aspects：

(1) the js files locally loaded are cached.On the one hand by setting url parameters cache in jQuery ajax methods： true.datatype:" script " is cached local js data.On the other hand increase while using browser rs cache Add and use Application Cache mechanism.Cached in units of file, and file has certain update mechanism.Its is specific Method is to refer to the file of an appcache ending by manifest attributes in HTML heads.AppCache principle has Two key points：Manifest attributes and manifest files.In HTML the and manifest files for quoting manifest files The file to be cached listed finally all can be by browser rs cache.

(2) requested data are present in front page layout by way of Hidden field is set in the page.Follow-up one Request in fixing time carries out real time data processing by reading the data in Hidden field.Reduce the request to backstage.

(3) local demand file is compressed process of compilation, reduces file size.

By above caching mechanism, on the one hand reduce the data transfer of redundancy, save flow；On the other hand wink is alleviated Between congestion, reduce the requirement to original server.

2. method is showed based on the flexible visualization that dynamic pulls

The method for carrying out data visualization is that dimension based on data exhibiting and numerical value are carried out.Each needs are visualized Multiple corresponding fields all be present in the data source showed.The field showed when carrying out visualization and showing by pulling needs analysis To dimension axle or number line, the information corresponding to the field is just sent to backstage, Real time request while pulling and completing To the data that show corresponding to the field, first parsing is presented in the page in table form.Give data user of service one basis Visualization show.Field each time pulls the information that can all get the field in real time.

The field for being dragged to dimension and number line can be by the verification of one group of data type, existing data field class Type has：date、datetime、long、double、string、text.Different types of field has different drop-down options, It is directed to the option of the field.This method very great Cheng during dragging by asking a verification data to verify in real time Reduce the process of data processing on degree.Be capable of dynamic flexible goes out different style sheets to different types of data exhibiting.

If it is desired to by the data of serializing by various types of pattern exhibition such as：Line Chart, block diagram, accumulation Figure, pie chart, map etc., just need to realize by js dynamic configurations option on the basis of using visual control different Visualization shows.This method utilizes echart visual controls during visualization shows, and increases in front end page more Item configurableization option, allows user flexibly to select desired function, avoids user oneself from writing code, has widened and can be used The colony of user, user oneself can select during custom-configuring, and reject showing for redundancy, reach clear with this Effect of visualization.

Data drawing list forms special visualization and showed after being stored in instrument board, each visualized graphs is as one Independent module, can be with arbitrary placement on the basis of using drag function technology, the weight showed according to user to economic data Point carries out different distribution layouts.This method pulls module apart from the position of browser and itself element using dynamic calculation It is wide high, the storage location after pulling is calculated, is lifted after Real time request and preserves data.

It is specific as follows to edit chart step：

(1) analysis field is pulled to dimension or number line.Request field verifies during dragging, lists optional drop-down choosing .Pull to lift backward while completion and send data corresponding to the acquisition request field.

(2) json data conversions are shown into form ranks in table form after getting json data.Form most base The visualization of plinth shows.

(3) after choosing the field for wanting analysis, carried out according to the number of the number of dimension axle field and number line field Judge that data at this stage can be shown by which kind chart, several diagrammatic forms are chosen, show that the shape can be used Formula.After user selects one of them to show type, it may appear that show the configurable parameter of form for this.Such as：The maximum of axle Value minimum value scope, whether add boost line, whether increase zoom function, addition label etc..

(4) in the instrument board of file where the icon generated being saved in into the data source, data visualization is formed Chart.The data storage of preservation is in background data base.It can also modify, change for the Visual Chart generated When by what data ID asked back the Visual Chart show data, reappear in editor's chart working region.

The step has the following advantages that：

(1) different field that can be directed to a data source generates different visualized graphs, shows the different analyses in economic field Emphasis.

(2) it can arbitrarily pull chart module and carry out arbitrary placement, the emphasis of prominent visualization special topic.

(3) flexible topology's real-time storage is visualized, reduces user's operating procedure, it is convenient and swift.

(4) threshold of big data analysis is reduced.

It is simple in order to describe for foregoing each method embodiment, therefore it is all expressed as to a series of combination of actions, but this The technical staff in field should know that the application is not limited by described sequence of movement, because according to the application, it is some Step can serially or simultaneously be performed using other；Secondly, those skilled in the art should also know, above method embodiment is equal Belong to preferred embodiment, necessary to involved action and module not necessarily the application.

Reference picture 3, show distributed extraction and Visualized Analysis System one embodiment of the application based on economic field data Structured flowchart, including：

The distributed data extraction module of module 301, for receiving the instruction of user's extraction big data and being sent to host node, main section The big task for extracting data is cut into small task by point according to the instruction that receives according to some or multiple field dimensions of task, The small task of generation assigns the priority of task according to a certain percentage, the more high more preferential operation of task priority, same levels Task is arrived first according to FIFO and first obtains scheduling strategy execution, and according to the configuration parameter of processing node, the different task of priority is pressed Ratio gives different processing nodes；After the receiving thread of processing node receives task, the scheduling of scheduling thread use priority, The dispatching algorithm that FIFO is dispatched and equity dispatching is combined adds received task in task queue, according to the parameter of task Initiate to ask and receive data to text retrieval system.Host node stores the small task of generation into database, and is running During synchronous task state；

The Distributed Storage module of module 302：Data set for text retrieval system to be returned is stored in data-base cluster；

The distributed data of module 303 calculates and analysis module, for receiving analysis task life of the user with query argument Order, sql query statements are assembled into according to the mapping relations analytic parameter of literary name section and entities field and splicing, and looked into according to sql The data set that sentence loads needs from data-base cluster is ask, filter data, analyze data and statistical analysis are then crossed, then knot In fruit collection write into Databasce；

The data of module 304 load and cache module：After request for receiving client loading data, analytic parameter, from relation The associated metadata of task is read in type database, memory table is then created, data is then loaded from database according to parameter Into memory table, feedback result after the completion of loading；

The result visualization display module of module 305：The data that the current generation needs are asked by way of the on-demand loading of front end, and The data having requested that are cached by front end caching mechanism, and are used for following functions：

（2）Shown in table form after getting data；

It should be noted that said apparatus embodiment belongs to preferred embodiment, involved unit and module might not Necessary to being the application.

Each embodiment in this specification is described by the way of progressive, what each embodiment stressed be with The difference of other embodiment, between each embodiment identical similar part mutually referring to.For the dress of the application For putting embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is real referring to method Apply the explanation of example part.Device and device embodiment described above is only schematical, wherein described be used as is divided Module from part description can be or may not be it is physically separate, can be located at a place or or It is distributed on multiple NEs.Some or all of module therein can be selected to realize this implementation according to the actual needs The purpose of example scheme.Those of ordinary skill in the art are without creative efforts, you can to understand and implement.

Above to one kind ... method and apparatus provided herein, it is described in detail, tool used herein Body example is set forth to the principle and embodiment of the application, and the explanation of above example is only intended to help and understands this Shen Method and its core concept please；Meanwhile for those of ordinary skill in the art, according to the thought of the application, specific real There will be changes in mode and application are applied, in summary, this specification content should not be construed as the limit to the application System.

Claims

1. distributed extraction and visual analysis method based on economic field data, it is characterised in that including：

2. distributed extraction and visual analysis method according to claim 1 based on economic field data, its feature It is, in the distributed data extraction step, the small task of generation assigns the priority of task according to a certain percentage, and task is excellent The more high more preferential operation of first level, the task of same levels arrives first according to FIFO first obtains scheduling strategy execution, according to processing node Configuration parameter, the different task of priority is given to different processing nodes in proportion；The receiving thread of processing node receives After task, the dispatching algorithm that the scheduling of scheduling thread use priority, FIFO scheduling and equity dispatching are combined is appointed what is received Business is added in task queue, is performed extraction data manipulation according to the parameter of task and is received data.

3. distributed extraction and visual analysis method according to claim 1 or 2 based on economic field data, it is special Sign is, in the Distributed Calculation and analytical procedure, after analysis task instruction of the user with query argument is received, and root Sql query statements are assembled into according to the mapping relations analytic parameter of literary name section and entities field and splicing.

4. distributed extraction and visual analysis method according to claim 1 or 2 based on economic field data, it is special Sign is, in the visual presentation step, the data of current generation needs is asked by way of the on-demand loading of front end, and lead to Front end caching mechanism is crossed to be cached the data having requested that.

5. distributed extraction and visual analysis method according to claim 1 or 2 based on economic field data, it is special Sign is, the visual presentation step, including following fine division step,

Shown in table form after getting data；

6. distributed extraction and Visualized Analysis System based on economic field data, it is characterised in that including：

7. distributed extraction and Visualized Analysis System according to claim 6 based on economic field data, its feature It is, in the distributed data extraction module, the small task of generation assigns the priority of task according to a certain percentage, and task is excellent The more high more preferential operation of first level, the task of same levels arrives first according to FIFO first obtains scheduling strategy execution, according to processing node Configuration parameter, the different task of priority is given to different processing nodes in proportion；The receiving thread of processing node receives After task, the dispatching algorithm that the scheduling of scheduling thread use priority, FIFO scheduling and equity dispatching are combined is appointed what is received Business is added in task queue, is performed extraction data manipulation according to the parameter of task and is received data.

8. the distribution based on economic field data according to claim 6 or 7 is extracted and Visualized Analysis System, it is special Sign is, in the distributed data calculating and analysis module：Instructed receiving analysis task of the user with query argument Afterwards, sql query statements are assembled into according to the mapping relations analytic parameter of literary name section and entities field and splicing.

9. the distribution based on economic field data according to claim 6 or 7 is extracted and Visualized Analysis System, it is special Sign is led in the data in the visual presentation module, asking the current generation to need by way of the on-demand loading of front end Front end caching mechanism is crossed to be cached the data having requested that.

10. the distribution based on economic field data according to claim 6 or 7 is extracted and Visualized Analysis System, its It is characterised by, in the visual presentation module,

（2）Shown in table form after getting data；