CN105205104A - Cloud platform data acquisition method - Google Patents

Cloud platform data acquisition method Download PDF

Info

Publication number
CN105205104A
CN105205104A CN201510531172.6A CN201510531172A CN105205104A CN 105205104 A CN105205104 A CN 105205104A CN 201510531172 A CN201510531172 A CN 201510531172A CN 105205104 A CN105205104 A CN 105205104A
Authority
CN
China
Prior art keywords
data
query
user
index
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510531172.6A
Other languages
Chinese (zh)
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201510531172.6A priority Critical patent/CN105205104A/en
Publication of CN105205104A publication Critical patent/CN105205104A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cloud platform data acquisition method. The method comprises the steps of integrating multiple query methods in a distributed environment, and providing a unified query interface for a user by taking non-structured query and structured data query as execution units; converting the query requests of the user into formats which can be recognized by the multiple member query methods; and finally returning the query results to the user according to a certain format. The cloud platform financial data query method provided by the invention can be used for overcoming the defects of the conventional structured data query in the aspect of flexibility and practicability, lowering the technical difficulty that non-specialized persons query a database, and can well utilize the value of business data.

Description

A kind of cloud platform data acquisition methods
Technical field
The present invention relates to finance data process, particularly a kind of cloud platform data acquisition methods.
Background technology
Finance data is that investor carries out investment decision, stock trader Tou Yan department carries out the important evidence studied, for corporate client and Tou Yan department provide timely, accurate, easy-to-use finance data to be the long-term and challenge of arduousness of relevant departments one of facing always.Along with the arrival of the rich informationization of network and large data age, comprised a large amount of structurings and unstructured information in current finance data, and increment is huge.While system for cloud computing science and technology level is developed by leaps and bounds, in order to avoid useful data message runs off, just need to set up corresponding database as carrier to store these data.But the data retrieval present situation under cloud computing environment is, the specification disunity of retrieve data, cause the understanding of retrieval of content different, the deviation of demand causes Functional Design lack of standardization, and the longitudinal direction directly affected between the superior and the subordinate's application is through; The management control effects that existing querying method changes newly increased requirement, demand is not obvious, in the change etc. of reply data structure extension, is difficult to the border expanding inquiry application.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of cloud platform data acquisition methods, in based on the finance data searching system of cloud computing, carrying out data retrieval and inquiry, comprising:
Multiple queries method under distributed environment is carried out integrated, using unstructured search and structured data query all as performance element, for user provides unified query interface; The inquiry request of user is converted to the form that multiple membership query method can identify, Query Result returns to user with certain form the most at last.
Preferably, in described unstructured search, provide resource management, Data Integration, index stores by the described searching system based on cloud computing; And build non-structured data query service system; Adopt Hadoop Open Framework structure, rely on ZooKeeper mechanism and carry out distributed coordination, cluster metadata and Set up-conservancy, retrieve layer provides index upgrade, index deletion, inquiry, participle, index database, external interface module; Data collection layer is provided infrastructures and the administration module of data resource; Levels interface for coordinating data interaction between two rank and Service delivery, with traffic format standard for according to the design carrying out index database; By artificial pretreated mode, document content is divided, generate the text chunk that different key term is corresponding, using the original input of setting up as index database, use the interface function that Servlet technology of increasing income provides, realize the foundation of index, interpolation, renewal, deletion and inquiry, form the inverted index that user inputs keyword-key term-document, and externally provide HTTP calling interface by the secondary development customized;
In described structuralized query, keyword query is applied to relational database, modeling is carried out to database structure, the mode of use figure carrys out the topological structure of characterization database, form structural data mode chart, Query Problem is converted into figure and inquires about problem, described structural data mode chart is a non-directed graph G=(V, E), wherein V represents the set on summit, the relation table of each vertex correspondence in database, and E represents the set on limit, every bar limit corresponds to a foreign key relationship between tables of data, and concrete query script comprises:
Step 1: create node concordance list, in described node concordance list characterisation of structures data pattern figure each summit comprise the index structure of key word, creation method is: each field of often row in tables of data, relation table is spliced into document, to the document extracting keywords, form the inverted index of keyword to table name, row name;
Step 2: according to keyword positioning relationship table, for the keyword of user's input, comes by query node concordance list the summit comprising this keyword in station-keeping mode figure;
Step 3: carry out data query centered by keyword; Expand centered by the summit that described step 2 generates, generate the data query pattern of candidate, each query pattern is the subgraph of structural data mode chart, and contains all keywords; The expansion of query pattern adopts the method for breadth first traversal, and process is as follows:
1) define queue Q and V, the Centroid of all generations is added in queue Q and V as originate mode;
2) from Q, pattern P is taken out, by the association mode { P of P 1, P 2..., P nadd in queue Q and V, wherein association mode P i(i=1,2 ..., n), meet the following conditions: 1. | P i|=| P|+1, wherein | P i| be P icomprise the number on summit; 2. P iv is not present in for connected graph;
3) travel through patterns all in Q successively, until Q be sky, choose meet following condition query pattern as Output rusults:
1. output mode needs to comprise all keywords;
2. leaf summit all comprises at least one keyword;
3. the number of vertex that output mode comprises should be less than predetermined maximal value S max;
4) according to query pattern splicing construction query language (SQL) statement, SQL query statement is all spliced to each candidate query pattern, by concordance list described in user's keyword query, obtain table name and row name information write SQL statement, use SQL carry out data base querying and return Query Result.
Preferably, described finance data searching system comprises service server, application server, data server, integrated service device and each database; Wherein, service server carries out information retrieval by calling application server, and usage data information carries out Push Service; Application server carries out unified index and maintenance to data; Integrated service device is integrated structuring, unstructured data, adopts to look into heavy-duty machine system and data-pushing technology and realize the Classifying Sum of data and regular, and is shown as user by protocol interface and front end page and service server provides information service;
The finance data being dispersed in each Database Systems, file system and internet integrated by integrated service device, gathers and clean data, and by the Data Integration strategy based on business division territory, the Data Integration of separate sources main body formed data server; The main process of Data Integration service comprises: first inquiry request is delivered to data extraction module with XMLSchema form, data extraction module converts XML to SQL query statement, then data pick-up is carried out according to Query Result, finally the form that the result set extracted converts XML to is passed to integrated processing module, unstructured data also needs to change into XML format, then does integrated process and the data server that finally generation is unified by integrated processing module to XML document;
Utilize and look into heavy-duty machine system based on the text of paragraph topic, the subject information of text data is used to compare its similarity, realize the classification of the finance data for same subject and identical content, producing an eigenwert by each paragraph in text, is the characteristic value collection based on paragraph topic by a text representation; Calculated the similarity of two texts by the paragraph eigenwert comparing two texts, then think repeated text when last similarity exceedes setting threshold value, carry out looking into retry; The one-piece construction that these data look into heavy framework comprises: look into restructuring part, look into reconfiguration management, look into heavily interpretation of result three part; Wherein, look into restructuring part by semantic analytics engine for carrying out word segmentation processing to data content, characteristic value generator generates the eigenwert feature of data according to word segmentation result, the eigenwert of 64 is divided equally 4 groups by same rule and carries out index stores; In eigenwert comparison process, first carry out the dimensionality reduction of data calculating, and the data feature values Hamming distances calculating data characteristics value tag and eigenwert storehouse is more than or equal to the comparison result of 3; Look into reconfiguration management and it is investigated that heavy result carries out log recording for logarithm, and check that data look into heavy result;
In addition, the data delivery system in searching system adopts the propelling movement algorithm based on user behavior cluster to realize personalized data push service; By setting up binary relation mutually corresponding between user and data, utilize the similarity relationships of user behavior to excavate the potential interested object of each user, and then carry out personalized propelling movement; Data delivery system is made up of the user behaviors log logging modle of user profile, the model analysis module of user preference and propelling movement algoritic module three part; Wherein user behaviors log logging modle is used in the various actions information of each business contact point recording user, comprises the residence time of the page, clicks sequence, the personal information of content-browsing record and user and transactions history (deriving from centralized transaction/trading system), market browsing histories (source market system); The model analysis module of user preference is used for the analysis to User action log, the attribute of user's multi-angle is calculated and is marked, for each user sets up respective many attribute descriptions, and use professional knowledge and the numerous attribute ratings of Data Mining Tools to user to carry out cluster, the user with similar behavior pattern is flocked together; Push algoritic module be then utilize combinational algorithm from data server according to classify and grading user model for user calculates the client interests degree of each data in real time, and return wherein to carry out concentrating to business foreground and show.
The present invention compared to existing technology, has the following advantages:
The present invention proposes a kind of cloud platform finance data acquisition methods, overcome the drawback of traditional structure data query in dirigibility and practicality, reduce the technical threshold of layman's Query Database, utilize the value of business datum better.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the cloud platform data acquisition methods according to the embodiment of the present invention.
Embodiment
Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some in these details or all details.
Fig. 1 is the cloud platform data acquisition methods process flow diagram according to the embodiment of the present invention.The finance data searching system that the present invention is based on cloud computing mainly comprises with lower part: service server, application server, data server, integrated service device and each database.Wherein, service server carries out information retrieval by calling application server, and usage data information carries out Push Service.Application server has retrieval and index ability, for carrying out unified index and maintenance to data.Integrated service device possesses the ability integrated structuring, unstructured data, adopt and look into heavy-duty machine system and data-pushing technology and realize the Classifying Sum of data and regular, and be shown as user by protocol interface and front end page and service server provides information service.
Wherein, the finance data being dispersed in each Database Systems, file system and internet integrated by integrated service device, data gathered and cleans, and by the Data Integration strategy based on business division territory, the Data Integration of separate sources main body being formed data server.The main process of Data Integration service comprises: first inquiry request is delivered to data extraction module with XMLSchema form, data extraction module converts XML to SQL query statement, then data pick-up is carried out according to Query Result, finally the form that the result set extracted converts XML to is passed to integrated processing module, equally, unstructured data also needs to change into XML format, then does integrated process and the data server that finally generation is unified by integrated processing module to XML document.
Financial industry system data amount is very large, also very high to the security requirement of data.Hadoop framework uses distributed file system (HDFS) to store as low layer and supports; HDFS provides the mass data storage solution of a kind of high fault tolerance and high-throughput, and it does not shut down the characteristics such as dynamic capacity-expanding, data automatically detect and copy is that the large data access of platform and the high security of data provide solution route.The file block storage characteristics of HDFS makes carrying out the Distribution Algorithm of system can being relied on automatically to realize the migration of data block and the upgrading of capacity when power system capacity expands, and to delay machine or manual maintenance without the need to system.The data self replication strategy that HDFS has and data consistency means for automatic monitoring system meet the high security requirement of data.HDFS resource optimal allocation and many copies access mechanism have increased substantially the data read rates of system, and HDFS is the several times of conventional store scheme for the access performance of forms data block.
The HDFS data model storage of this platform is divided into Daas, PaaS, SaaS tri-layers from top to bottom successively.(1) DaaS (data and service layer) is mainly used in data storage and search, utilizes the features such as the dirigibility of HDFS, low delay, distributivity after regular, externally to provide data, services by the data of data server.(2) PaaS (platform and service layer) is mainly used in the access of data and file and supports secondary development, and unified certification is completed by ldap server, and platform adopts JDBC data access interface to be the difference that service server shields isomery DBMS.(3) SaaS (software and service layer) adopts client layer Intel Virtualization Technology to achieve centralized transaction daily record storage and analytic system and historical quotes data management and searching platform etc. and externally provides many tenants, extendible software service.
Finance data amount after searching system integration is very large, and have many data to be the processing and process carried out same data from different perspectives by different information announcing main body, study subject and industry media, thus platform is faced with that recall precision is low, there is the challenge of a large amount of repeated and redundant information in result for retrieval.In order to improve the efficiency and convenience that information uses, improving the Experience Degree of user, using the technology of large data processing, by full-text search, data look into heavily etc. means for user provide comprehensively, information retrieval service accurately.
At present, in finance data, unstructured information accounts for more than 80% of informational capacity, and the field search technology of traditional Relational DataBase has inadequate natural endowment, especially in the process for magnanimity unstructured information to process unstructured information.Utilization global search technology solves the process to unstructured information, based on the Lucene Development Framework of increasing income, by building text retrieval system to the customized development of Lucene core layer and relevant interface.
Searching system for core, functionally can be divided into index, query and maintenance three part with Lucene full text enquiring method.Index part is used for processing the data of database purchase, sets up index structure; The retrieval request that query portion receiving front-end system is submitted, searches index; Service portion then for increasing index, revising, the maintenance work such as deletion.The implementation procedure of whole searching system comprises: carry out pre-service to document; Carry out participle and create document index.For Chinese word segmentation, what Lucene adopted is that binary divides morphology; There is provided query function, the index namely utilizing Lucene to set up is inquired about.
The Lucene Development Framework that system adopts comprises Lucene corn module and customized development module two parts.Lucene corn module comprises index/searcher layer, accumulation layer and inverted index file layers, wherein, inverted index is used for being stored in the mapping of the memory location of certain word in a document or one group of document under full-text query, is the core technology that Lucene realizes fast query.Based on the customized development module on core layer, comprise lexical analysis layer, text resolution layer and application layer.Wherein, text resolution layer is resolved mainly through the document of various document resolver to different-format, thus obtains the text invention part of convenient operation; Text is then mainly divided into word and selects suitable word to set up index by lexical analysis layer, uses corresponding Chinese analysis device for needing during Chinese retrieval.
In order to obtain better retrieval effectiveness, system also need to every day warehouse-in all kinds of finance datas look into and heavily process.The efficiency looking into weight due to raising information is significant for the experience of the performance and user that promote searching system, present invention employs new looking into and weigh framework, propose a kind of text based on paragraph topic and look into heavy-duty machine system, the subject information of text data is used to compare its similarity, realize the classification of the finance data for same subject and identical content, look into heavy effect to improve further.Taken into full account the structure of text and the distribution situation of characteristic, produced an eigenwert by each paragraph in text, thus text can be expressed as the characteristic value collection based on paragraph topic.For same text, characteristic value collection based on paragraph topic comprises more information than single features value, these information can be amplified the otherness between text when calculating the Hamming distances of characteristic value collection, thus improve the accuracy rate judged text similarity.This overall step looking into weighing method comprises: the paragraph eigenwert extracting each paragraph according to the paragraph topic of text, then the paragraph eigenwert by comparing two texts calculates the similarity of two texts, then think repeated text when last similarity exceedes setting threshold value, carry out looking into retry.
The one-piece construction that these data look into heavy framework comprises: look into restructuring part, look into reconfiguration management, look into heavily interpretation of result three part.Wherein, look into restructuring part by semantic analytics engine for carrying out word segmentation processing to data content, characteristic value generator generates the eigenwert feature of data according to word segmentation result.The eigenwert of 64 is divided equally 4 groups by same rule to carry out index stores.In eigenwert comparison process, first will carry out the dimensionality reduction of data calculating according to drawer principle, and the data feature values Hamming distances calculating data characteristics value tag and eigenwert storehouse is more than or equal to the comparison result of 3.Look into reconfiguration management and it is investigated that heavy result carries out log recording for logarithm, and can check that data look into heavy result.
In order to promote Consumer's Experience further, searching system of the present invention has also built data delivery system, adopts the propelling movement algorithm based on user behavior cluster to realize personalized data push service.This personalized push is by setting up binary relation mutually corresponding between user and data, utilizing the similarity relationships of user behavior to excavate the potential interested object of each user, and then carries out personalized propelling movement, and its essence is also a kind of information filtering.
Data delivery system is made up of the user behaviors log logging modle of user profile, the model analysis module of user preference and propelling movement algoritic module three part.Wherein user behaviors log logging modle is used for the various actions information at each business contact point recording user, comprise the residence time of the page, click sequence, the personal information of content-browsing record and user and transactions history (deriving from centralized transaction/trading system), market browsing histories (source market system) etc., these information are data bases of subsequent analysis and data-pushing; The model analysis module of user preference is used for the analysis to User action log,
The attribute of user's multi-angle is calculated and is marked, for each user sets up respective many attribute descriptions, and use professional knowledge and the numerous attribute ratings of Data Mining Tools to user to carry out cluster, namely the user with similar behavior pattern is flocked together, this system is according to the risk partiality of user, condition of assets, to hold position distribution, brisk trade degree, profitability, investment instrument preference, life cycle, data use preference, data use multiple attributes such as history to establish corresponding classify and grading user data and use a model, effective foundation of this model is the difficult point of whole supplying system, push algoritic module be then utilize combinational algorithm from data server according to classify and grading user model for user calculates the client interests degree of each data in real time, and return N bar wherein and carry out concentrating to business foreground and show, pushing algoritic module is the core link of whole supplying system.
Based on above-mentioned searching system, the present invention proposes following data base concurrency acquisition methods, make full use of cloud computing advantage, by autonomous Design with the fusion of traditional keyword query mode implementing structured, destructuring two category information, shield the structured data pattern of bottom complexity, overcome the drawback of traditional structure data query in dirigibility and practicality, make the method effectively can reduce the technical threshold in layman's query traffic data storehouse, utilize the value of business datum better.
For financial application, Structure of need data and non-structured text simultaneously, the fusion of two category informations becomes a key problem.The key addressed this problem is to seek efficient information query method, thus realizes freely inquiring about two category informations.The querying method that the present invention proposes uses Meta Search Engine structure, by as a whole for the multiple queries method integration under distributed environment, for user provides unified query interface.The inquiry request of user is converted into the form that multiple membership query method can identify, via querying method management and running, the inquiry of specification is distributed to membership query method, and Query Result returns to user with certain form the most at last.In one-piece construction, destructuring, structured data query are all as the performance element of querying method.
The present invention utilizes keyword query mode to reduce the complicacy of business datum inquiry, makes user can obtain required Query Result quickly and easily.Adopt cloud computing and vertical querying method two class technology: on the one hand, provide the functions such as resource management, Data Integration, index stores by the above-mentioned searching system based on cloud computing; On the other hand, non-structured data query service system is built by traditional directory method basic module.On technology realizes, the design of searching system adopts Hadoop Open Framework structure, relies on ZooKeeper mechanism and carries out distributed coordination, cluster metadata and Set up-conservancy, improve performance and the extended capability of system; The integral layout of querying method comprises 3 levels, i.e. retrieve layer, data collection layer, levels interface.Retrieve layer provides the modules such as index upgrade, index deletion, inquiry, participle, index database, external interface; Data collection layer is provided infrastructures and the administration module of data resource; Levels interface is for coordinating data interaction between two rank and Service delivery.Because financial class data have stronger service feature and unified format standard, in the querying method that the present invention proposes with traffic format standard for according to carrying out the design of index database.In implementation, by artificial pretreated mode, document content is divided, generate the text chunk that different key term is corresponding, using the original input of setting up as index database.On this basis, use the interface function that Servlet technology of increasing income provides, realize the foundation of index, interpolation, renewal, deletion and inquiry, form the inverted index that user inputs keyword-key term-document, and externally provide HTTP calling interface by the secondary development customized.
In view of the ease for use of keyword query mode in unstructured data retrieval, the present invention proposes keyword query technology to be applied to relational database field, realizes the finance data data base query method based on keyword.
The method carries out modeling to database structure, and the mode of use figure carrys out the topological structure of characterization database, forms structural data mode chart, Query Problem is converted into figure and inquires about problem.Structural data mode chart is a non-directed graph G=(V, E), and wherein V represents the set on summit, the relation table of each vertex correspondence in database, and E represents the set on limit, and every bar limit corresponds to a foreign key relationship between tables of data.Concrete querying method comprises following link.
Step 1: create node concordance list, node concordance list characterize each summit in structural data mode chart comprise the index structure of key word, creation method is: each field of often row in tables of data, relation table is spliced into document, to the document extracting keywords, form the inverted index of keyword to table name, row name.
Step 2: according to keyword positioning relationship table.For the keyword of user's input, come by query node concordance list the summit comprising this keyword in station-keeping mode figure.
Step 3: carry out data query centered by keyword.Expand centered by the summit that step 2 generates, generate the data query pattern of candidate, each query pattern is the subgraph of structural data mode chart, and contains all keywords.The extended mode of query pattern adopts the method for breadth first traversal, and process is as follows.
1) define queue Q and V, the Centroid of all generations is added in queue Q and V as originate mode.
2) from Q, pattern P is taken out, by the association mode { P of P 1, P 2..., P nadd in queue Q and V, wherein association mode P i(i=1,2 ..., n), meet the following conditions: 1. | P i|=| P|+1, wherein | P i| be P icomprise the number on summit; 2. P iv is not present in for connected graph.
3) travel through patterns all in Q successively, until Q be sky, choose meet following condition query pattern as Output rusults:
1. output mode needs to comprise all keywords;
2. leaf summit all comprises at least one keyword;
3. the number of vertex that output mode comprises should be less than predetermined maximal value S max(being generally set as 5).
4) according to query pattern splicing construction query language (SQL) statement.SQL query statement is all spliced to each candidate query pattern, by concordance list described in user's keyword query, obtains table name and row name information write SQL statement.SQL is used to carry out data base querying and return Query Result.
For the query processing of text, participle is as the preposition pre-treatment step setting up index, if although it is realized also can achieving the goal with independent MapReduce operation, but owing to adding MapReduce task, the treatment cycle of whole operation will be increased, but also a lot of I/O can be increased operate, thus treatment effeciency is not high.Therefore, Chinese word segmentation pre-service is embodied as an auxiliary Map process by the present invention, by itself and the MapReduce task setting up the core Map of index and Reduce process and merge into a chain type, thus completes whole operation.
The foundation of text index is the core link of text-processing, and a good solution carries out distributed search by cluster exactly, and this just requires to set up distributed index.The foundation of index is well suited for adopting MapReduce programming model to realize, and the distributed index set up by MapReduce leaves in distributed system, for follow-up distributed search is provided convenience.
Set up the key issue that inverted index table is index, output after Text Pretreatment Map nonproductive task is set up the Map input of MapReduce task as index, the output of setting up the Map task of index is the character string of each Chinese vocabulary, the degree of correlation of index terms and document and position in a document.Index terms character string and its unknown in a document can be obtained by participle software package.Set up the Reduce task of Index process for the output information of Map task being integrated, thus form inverted index list file.
In HBase, data file is just opened in startup, and remains open mode in processing procedure, is therefore more suitable for real-time retrieval.By writing MapReduce operation, index file is loaded into HBase distributed data base, in an experiment, direct by original document importing HBase, because system is not continual acquisition data from network, but just can increase text data set at set intervals, so retrieve index file, this can improve effectiveness of retrieval further.
In sum, the present invention proposes a kind of cloud platform finance data acquisition methods, overcome the drawback of traditional structure data query in dirigibility and practicality, reduce the technical threshold of layman's Query Database, utilize the value of business datum better.
Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system, or be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system, thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.
Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims (3)

1. a cloud platform data acquisition methods, for carrying out data retrieval and inquiry in based on the finance data searching system of cloud computing, is characterized in that, comprise:
Multiple queries method under distributed environment is carried out integrated, using unstructured search and structured data query all as performance element, for user provides unified query interface; The inquiry request of user is converted to the form that multiple membership query method can identify, Query Result returns to user with certain form the most at last.
2. method according to claim 1, is characterized in that, in described unstructured search, provides resource management, Data Integration, index stores by the described searching system based on cloud computing; And build non-structured data query service system; Adopt Hadoop Open Framework structure, rely on ZooKeeper mechanism and carry out distributed coordination, cluster metadata and Set up-conservancy, retrieve layer provides index upgrade, index deletion, inquiry, participle, index database, external interface module; Data collection layer is provided infrastructures and the administration module of data resource; Levels interface for coordinating data interaction between two rank and Service delivery, with traffic format standard for according to the design carrying out index database; By artificial pretreated mode, document content is divided, generate the text chunk that different key term is corresponding, using the original input of setting up as index database, use the interface function that Servlet technology of increasing income provides, realize the foundation of index, interpolation, renewal, deletion and inquiry, form the inverted index that user inputs keyword-key term-document, and externally provide HTTP calling interface by the secondary development customized;
In described structuralized query, keyword query is applied to relational database, modeling is carried out to database structure, the mode of use figure carrys out the topological structure of characterization database, form structural data mode chart, Query Problem is converted into figure and inquires about problem, described structural data mode chart is a non-directed graph G=(V, E), wherein V represents the set on summit, the relation table of each vertex correspondence in database, and E represents the set on limit, every bar limit corresponds to a foreign key relationship between tables of data, and concrete query script comprises:
Step 1: create node concordance list, in described node concordance list characterisation of structures data pattern figure each summit comprise the index structure of key word, creation method is: each field of often row in tables of data, relation table is spliced into document, to the document extracting keywords, form the inverted index of keyword to table name, row name;
Step 2: according to keyword positioning relationship table, for the keyword of user's input, comes by query node concordance list the summit comprising this keyword in station-keeping mode figure;
Step 3: carry out data query centered by keyword; Expand centered by the summit that described step 2 generates, generate the data query pattern of candidate, each query pattern is the subgraph of structural data mode chart, and contains all keywords; The expansion of query pattern adopts the method for breadth first traversal, and process is as follows:
1) define queue Q and V, the Centroid of all generations is added in queue Q and V as originate mode;
2) from Q, pattern P is taken out, by the association mode { P of P 1, P 2..., P nadd in queue Q and V, wherein association mode P i(i=1,2 ..., n), meet the following conditions: 1. | P i|=| P|+1, wherein | P i| be P icomprise the number on summit; 2. P iv is not present in for connected graph;
3) travel through patterns all in Q successively, until Q be sky, choose meet following condition query pattern as Output rusults:
1. output mode needs to comprise all keywords;
2. leaf summit all comprises at least one keyword;
3. the number of vertex that output mode comprises should be less than predetermined maximal value S max;
4) according to query pattern splicing construction query language (SQL) statement, SQL query statement is all spliced to each candidate query pattern, by concordance list described in user's keyword query, obtain table name and row name information write SQL statement, use SQL carry out data base querying and return Query Result.
3. method according to claim 2, is characterized in that, described finance data searching system comprises service server, application server, data server, integrated service device and each database; Wherein, service server carries out information retrieval by calling application server, and usage data information carries out Push Service; Application server carries out unified index and maintenance to data; Integrated service device is integrated structuring, unstructured data, adopts to look into heavy-duty machine system and data-pushing technology and realize the Classifying Sum of data and regular, and is shown as user by protocol interface and front end page and service server provides information service;
The finance data being dispersed in each Database Systems, file system and internet integrated by integrated service device, gathers and clean data, and by the Data Integration strategy based on business division territory, the Data Integration of separate sources main body formed data server; The main process of Data Integration service comprises: first inquiry request is delivered to data extraction module with XMLSchema form, data extraction module converts XML to SQL query statement, then data pick-up is carried out according to Query Result, finally the form that the result set extracted converts XML to is passed to integrated processing module, unstructured data also needs to change into XML format, then does integrated process and the data server that finally generation is unified by integrated processing module to XML document;
Utilize and look into heavy-duty machine system based on the text of paragraph topic, the subject information of text data is used to compare its similarity, realize the classification of the finance data for same subject and identical content, producing an eigenwert by each paragraph in text, is the characteristic value collection based on paragraph topic by a text representation; Calculated the similarity of two texts by the paragraph eigenwert comparing two texts, then think repeated text when last similarity exceedes setting threshold value, carry out looking into retry; The one-piece construction that these data look into heavy framework comprises: look into restructuring part, look into reconfiguration management, look into heavily interpretation of result three part; Wherein, look into restructuring part by semantic analytics engine for carrying out word segmentation processing to data content, characteristic value generator generates the eigenwert feature of data according to word segmentation result, the eigenwert of 64 is divided equally 4 groups by same rule and carries out index stores; In eigenwert comparison process, first carry out the dimensionality reduction of data calculating, and the data feature values Hamming distances calculating data characteristics value tag and eigenwert storehouse is more than or equal to the comparison result of 3; Look into reconfiguration management and it is investigated that heavy result carries out log recording for logarithm, and check that data look into heavy result;
In addition, the data delivery system in searching system adopts the propelling movement algorithm based on user behavior cluster to realize personalized data push service; By setting up binary relation mutually corresponding between user and data, utilize the similarity relationships of user behavior to excavate the potential interested object of each user, and then carry out personalized propelling movement; Data delivery system is made up of the user behaviors log logging modle of user profile, the model analysis module of user preference and propelling movement algoritic module three part; Wherein user behaviors log logging modle is used in the various actions information of each business contact point recording user, comprises the residence time of the page, clicks sequence, the personal information of content-browsing record and user and transactions history (deriving from centralized transaction/trading system), market browsing histories (source market system); The model analysis module of user preference is used for the analysis to User action log, the attribute of user's multi-angle is calculated and is marked, for each user sets up respective many attribute descriptions, and use professional knowledge and the numerous attribute ratings of Data Mining Tools to user to carry out cluster, the user with similar behavior pattern is flocked together; Push algoritic module be then utilize combinational algorithm from data server according to classify and grading user model for user calculates the client interests degree of each data in real time, and return wherein to carry out concentrating to business foreground and show.
CN201510531172.6A 2015-08-26 2015-08-26 Cloud platform data acquisition method Pending CN105205104A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510531172.6A CN105205104A (en) 2015-08-26 2015-08-26 Cloud platform data acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510531172.6A CN105205104A (en) 2015-08-26 2015-08-26 Cloud platform data acquisition method

Publications (1)

Publication Number Publication Date
CN105205104A true CN105205104A (en) 2015-12-30

Family

ID=54952788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510531172.6A Pending CN105205104A (en) 2015-08-26 2015-08-26 Cloud platform data acquisition method

Country Status (1)

Country Link
CN (1) CN105205104A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354759A (en) * 2016-08-18 2017-01-25 北京百迈客云科技有限公司 Retrieving and automatically downloading system of articles and data based on biological cloud platform
CN107402912A (en) * 2016-05-19 2017-11-28 北京京东尚科信息技术有限公司 Parse semantic method and apparatus
CN107679060A (en) * 2017-07-25 2018-02-09 平安科技(深圳)有限公司 Method for inquiring status, device, user terminal and the storage medium of electronic insurance policy
CN107741971A (en) * 2017-10-10 2018-02-27 国网浙江省电力公司电力科学研究院 A kind of method of the online visual analysis PPT based on self-defined dynamic data
CN108170731A (en) * 2017-12-13 2018-06-15 腾讯科技(深圳)有限公司 Data processing method, device, computer storage media and server
CN108268600A (en) * 2017-12-20 2018-07-10 北京邮电大学 Unstructured Data Management and device based on AI
CN108984718A (en) * 2018-07-10 2018-12-11 四川汇源吉迅数码科技有限公司 A kind of digital content interactive system and exchange method based on big data technology
CN109697066A (en) * 2018-12-28 2019-04-30 第四范式(北京)技术有限公司 Realize the method and system of tables of data splicing and automatic training machine learning model
CN109947788A (en) * 2017-10-30 2019-06-28 北京京东尚科信息技术有限公司 Data query method and apparatus
CN110096478A (en) * 2019-05-09 2019-08-06 中国联合网络通信集团有限公司 Document index generation method and equipment
TWI686704B (en) * 2015-12-31 2020-03-01 大陸商中國銀聯股份有限公司 Graph-based data processing method and system
CN111353762A (en) * 2020-03-30 2020-06-30 中国建设银行股份有限公司 Method and system for managing regulations and regulations
CN111522950A (en) * 2020-04-26 2020-08-11 成都思维世纪科技有限责任公司 Rapid identification system for unstructured massive text sensitive data
CN112231380A (en) * 2020-10-20 2021-01-15 长城计算机软件与系统有限公司 Method and system for comprehensively processing acquired data, storage medium and electronic equipment
CN112347469A (en) * 2020-11-10 2021-02-09 浙江百应科技有限公司 Low-intrusion data authority processing method and system and electronic equipment thereof
CN112836093A (en) * 2021-02-03 2021-05-25 北京字跳网络技术有限公司 Data query method and device, electronic equipment and storage medium
CN115660615A (en) * 2022-08-15 2023-01-31 江苏北辰知识产权事务所有限公司 Project automatic auxiliary system and visualization method
CN115689503A (en) * 2022-08-15 2023-02-03 江苏北辰知识产权事务所有限公司 Multi-end project cooperation system and information co-construction method thereof
CN116628129A (en) * 2023-07-21 2023-08-22 南京爱福路汽车科技有限公司 Auto part searching method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741017A (en) * 2004-05-14 2006-03-01 微软公司 Method and system for indexing and searching databases
CN101021857A (en) * 2006-10-20 2007-08-22 鲍东山 Video searching system based on content analysis
CN101149752A (en) * 2007-11-10 2008-03-26 邹昌陆 Transversely combined query computer system and method based on SQL
CN101789021A (en) * 2010-02-24 2010-07-28 浪潮通信信息系统有限公司 Universal configurable database data migration method
CN102156689A (en) * 2011-03-31 2011-08-17 百度在线网络技术(北京)有限公司 Method and device for detecting document
CN102411596A (en) * 2010-09-21 2012-04-11 阿里巴巴集团控股有限公司 Information recommendation method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741017A (en) * 2004-05-14 2006-03-01 微软公司 Method and system for indexing and searching databases
CN101021857A (en) * 2006-10-20 2007-08-22 鲍东山 Video searching system based on content analysis
CN101149752A (en) * 2007-11-10 2008-03-26 邹昌陆 Transversely combined query computer system and method based on SQL
CN101789021A (en) * 2010-02-24 2010-07-28 浪潮通信信息系统有限公司 Universal configurable database data migration method
CN102411596A (en) * 2010-09-21 2012-04-11 阿里巴巴集团控股有限公司 Information recommendation method and system
CN102156689A (en) * 2011-03-31 2011-08-17 百度在线网络技术(北京)有限公司 Method and device for detecting document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
丁杰,朱力鹏,胡斌,韩海韵,孙大雁: "面向多级调度管理的融合型搜索引擎", 《电力系统自动化》 *
吕凝: "基于内容的视频数据库多模式检索方法研究", 《中国博士学位论文全文数据库》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI686704B (en) * 2015-12-31 2020-03-01 大陸商中國銀聯股份有限公司 Graph-based data processing method and system
CN107402912A (en) * 2016-05-19 2017-11-28 北京京东尚科信息技术有限公司 Parse semantic method and apparatus
CN107402912B (en) * 2016-05-19 2019-12-31 北京京东尚科信息技术有限公司 Method and device for analyzing semantics
CN106354759A (en) * 2016-08-18 2017-01-25 北京百迈客云科技有限公司 Retrieving and automatically downloading system of articles and data based on biological cloud platform
CN106354759B (en) * 2016-08-18 2019-07-12 北京百迈客云科技有限公司 The retrieval of article and data based on biological cloud platform and automatic download system
CN107679060A (en) * 2017-07-25 2018-02-09 平安科技(深圳)有限公司 Method for inquiring status, device, user terminal and the storage medium of electronic insurance policy
CN107679060B (en) * 2017-07-25 2019-02-05 平安科技(深圳)有限公司 Method for inquiring status, device, user terminal and the storage medium of electronic insurance policy
CN107741971A (en) * 2017-10-10 2018-02-27 国网浙江省电力公司电力科学研究院 A kind of method of the online visual analysis PPT based on self-defined dynamic data
CN107741971B (en) * 2017-10-10 2021-03-19 国网浙江省电力有限公司营销服务中心 Online visual PPT analysis method based on user-defined dynamic data
CN109947788B (en) * 2017-10-30 2021-10-15 北京京东尚科信息技术有限公司 Data query method and device
CN109947788A (en) * 2017-10-30 2019-06-28 北京京东尚科信息技术有限公司 Data query method and apparatus
CN108170731A (en) * 2017-12-13 2018-06-15 腾讯科技(深圳)有限公司 Data processing method, device, computer storage media and server
CN108268600A (en) * 2017-12-20 2018-07-10 北京邮电大学 Unstructured Data Management and device based on AI
CN108268600B (en) * 2017-12-20 2020-09-08 北京邮电大学 AI-based unstructured data management method and device
CN108984718A (en) * 2018-07-10 2018-12-11 四川汇源吉迅数码科技有限公司 A kind of digital content interactive system and exchange method based on big data technology
CN109697066A (en) * 2018-12-28 2019-04-30 第四范式(北京)技术有限公司 Realize the method and system of tables of data splicing and automatic training machine learning model
CN110096478B (en) * 2019-05-09 2021-06-29 中国联合网络通信集团有限公司 Document index generation method and device
CN110096478A (en) * 2019-05-09 2019-08-06 中国联合网络通信集团有限公司 Document index generation method and equipment
CN111353762A (en) * 2020-03-30 2020-06-30 中国建设银行股份有限公司 Method and system for managing regulations and regulations
CN111522950A (en) * 2020-04-26 2020-08-11 成都思维世纪科技有限责任公司 Rapid identification system for unstructured massive text sensitive data
CN111522950B (en) * 2020-04-26 2023-06-27 成都思维世纪科技有限责任公司 Rapid identification system for unstructured massive text sensitive data
CN112231380A (en) * 2020-10-20 2021-01-15 长城计算机软件与系统有限公司 Method and system for comprehensively processing acquired data, storage medium and electronic equipment
CN112347469A (en) * 2020-11-10 2021-02-09 浙江百应科技有限公司 Low-intrusion data authority processing method and system and electronic equipment thereof
CN112836093A (en) * 2021-02-03 2021-05-25 北京字跳网络技术有限公司 Data query method and device, electronic equipment and storage medium
CN112836093B (en) * 2021-02-03 2023-11-21 北京字跳网络技术有限公司 Data query method, device, electronic equipment and storage medium
CN115660615A (en) * 2022-08-15 2023-01-31 江苏北辰知识产权事务所有限公司 Project automatic auxiliary system and visualization method
CN115689503A (en) * 2022-08-15 2023-02-03 江苏北辰知识产权事务所有限公司 Multi-end project cooperation system and information co-construction method thereof
CN116628129A (en) * 2023-07-21 2023-08-22 南京爱福路汽车科技有限公司 Auto part searching method and system
CN116628129B (en) * 2023-07-21 2024-02-27 南京爱福路汽车科技有限公司 Auto part searching method and system

Similar Documents

Publication Publication Date Title
CN105205104A (en) Cloud platform data acquisition method
US11663254B2 (en) System and engine for seeded clustering of news events
CN105159971B (en) A kind of cloud platform data retrieval method
CN105183809A (en) Cloud platform data query method
CN105139281A (en) Method and system for processing big data of electric power marketing
US20140006369A1 (en) Processing structured and unstructured data
CN103154996A (en) Providing information management
CN106066895A (en) A kind of intelligent inquiry system
Caldarola et al. Big data: A survey-the new paradigms, methodologies and tools
Truică et al. TextBenDS: a generic textual data benchmark for distributed systems
Hui et al. Integration of big data: a survey
Tang et al. Forecasting SQL query cost at Twitter
Ravichandran Big Data processing with Hadoop: a review
Benny et al. Hadoop framework for entity resolution within high velocity streams
Li et al. Automatic classification algorithm for multisearch data association rules in wireless networks
Wang et al. A scalable parallel chinese online encyclopedia knowledge denoising method based on entry tags and spark cluster
Ma et al. Api prober–a tool for analyzing web api features and clustering web apis
CN113127650A (en) Technical map construction method and system based on map database
Ben Ahmed et al. A new semi-supervised hierarchical active clustering based on ranking constraints for analysts groupization
KR20210037488A (en) Big Data Analytics-Based Advertising Marketing System
Zhang et al. A learning-based framework for improving querying on web interfaces of curated knowledge bases
Abdallah et al. Towards a gml-enabled knowledge graph platform
Dehghanzadeh et al. Optimizing SPARQL query processing on dynamic and static data based on query time/freshness requirements using materialization
Singh NoSQL: A new horizon in big data
Zhou et al. BDMCA: a big data management system for Chinese auditing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151230

RJ01 Rejection of invention patent application after publication