CN112783874A - Data analysis method, device and system - Google Patents

Data analysis method, device and system Download PDF

Info

Publication number
CN112783874A
CN112783874A CN201911088890.5A CN201911088890A CN112783874A CN 112783874 A CN112783874 A CN 112783874A CN 201911088890 A CN201911088890 A CN 201911088890A CN 112783874 A CN112783874 A CN 112783874A
Authority
CN
China
Prior art keywords
data
data source
service
calculator
logic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911088890.5A
Other languages
Chinese (zh)
Inventor
李伟进
侯会满
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201911088890.5A priority Critical patent/CN112783874A/en
Publication of CN112783874A publication Critical patent/CN112783874A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a data analysis method, a data analysis device and a data analysis system, and relates to the technical field of computers. One embodiment of the method comprises: acquiring a service logic script written for a data analysis task in advance; the service logic script comprises data source parameter information and a service logic code; executing data extraction logic of the service logic code to acquire service data from a data source specified by the data source parameter information; and executing the data processing logic of the business logic codes to call a calculator set in the data processing logic to process the data of the business data. The method acquires the service data from the specified data source by acquiring the service logic script containing the service logic code, and then calls the calculator set to realize the processing of the service data, so that the system code can be reused, and the development efficiency is high.

Description

Data analysis method, device and system
Technical Field
The present invention relates to the field of computers, and in particular, to a data analysis method, apparatus, and system.
Background
With the rapid development of information technologies such as mobile internet, cloud computing, internet of things and the like, more and more data are produced, and the whole information industry is accelerated to enter a big data era. At the same time, the value of data is receiving increased enterprise attention and is becoming a wealth of more and more enterprises. In a general business scenario, after data is generated from an OLTP (Online Transaction Processing) system, the data is further analyzed by an OLAP (Online Analytical Processing) system to mine the value of the data.
The principle of data analysis performed by the OLAP system is as follows: the method comprises the steps of reading data (Extract) of various structures from different data sources, carrying out a series of processing, processing and summarizing (Transform) on the data, storing processing results into different databases (loads), and providing a query interface. In the above process, the OLAP system needs to customize and develop each function module such as data source reading, data processing, result storage, etc. for each different analysis processing task, and it processes the incremental data in a regular iteration manner.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the system code can not be reused due to the customized development mode, and repeated development exists, so that the development efficiency is low, and human resources are wasted; incremental data are processed in a regular iteration mode, so that the flexibility is poor, the time delay is high, and a service scene with a requirement on real-time performance cannot be met.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data analysis method, apparatus, and system, which acquire service data from a specified data source by acquiring a service logic script including a service logic code, and then call a calculator set to implement processing on the service data, so that the system code can be reused, and the development efficiency is high.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data analysis method.
The data analysis method of the embodiment of the invention comprises the following steps: acquiring a service logic script written for a data analysis task in advance; the service logic script comprises data source parameter information and a service logic code; executing data extraction logic of the service logic code to acquire service data from a data source specified by the data source parameter information; and executing the data processing logic of the business logic codes to call a calculator set in the data processing logic to process the data of the business data.
Optionally, the data source parameter information includes a data source parameter tuple, and the data source parameter tuple includes data source basic information and a data read instruction; acquiring service data from a data source specified by the data source parameter information, wherein the service data comprises: establishing connection with a corresponding data source according to the basic information of the data source; and executing the data reading instruction to acquire service data from the data source.
Optionally, the calculator set includes a calculator with a single responsibility for data processing, and the calculator takes the service data or a processing result of the service data as an entry, and returns the processing result of the entry after processing the entry.
Optionally, the data analysis task is a clustering task of item question and answer data, the service data is the item question and answer data, and the calculator set includes a word segmentation calculator, a similarity calculator and a clustering calculator; calling a calculator set in the data processing logic, and processing the data of the service data, wherein the method comprises the following steps: calling the word segmentation calculator to segment words of the article question and answer data to obtain word segmentation results; calling the similarity calculator to calculate the similarity of the word segmentation records of the same article in the word segmentation results to obtain similarity calculation results; and calling the clustering calculator to cluster the item question answering data based on the similarity calculation result.
Optionally, the data source parameter information includes a parameter identifier; acquiring service data from a data source specified by the data source parameter information, wherein the service data comprises: acquiring a corresponding data source parameter tuple according to the parameter identifier; wherein the data source parameter tuple comprises data source basic information and a data reading instruction; establishing connection with a corresponding data source according to the basic information of the data source; and executing the data reading instruction to acquire service data from the data source.
Optionally, the method further comprises: receiving data source configuration information, and generating corresponding parameter identification for a data source parameter tuple of the data source configuration information; the data source configuration information comprises a data source type and a data source parameter tuple corresponding to the data source type.
Optionally, the business data includes stock data and/or incremental data, and the incremental data is stored in the data source component by subscribing to a message queue of the business system.
To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a data analysis apparatus.
A data analysis device of an embodiment of the present invention includes: the script acquisition module is used for acquiring a service logic script written for the data analysis task in advance; the service logic script comprises data source parameter information and a service logic code; the data acquisition module is used for executing the data extraction logic of the service logic code so as to acquire service data from a data source specified by the data source parameter information; and the data processing module is used for executing the data processing logic of the service logic codes so as to call a calculator set in the data processing logic and process the data of the service data.
Optionally, the data source parameter information includes a data source parameter tuple, and the data source parameter tuple includes data source basic information and a data read instruction; the data acquisition module is further configured to: establishing connection with a corresponding data source according to the basic information of the data source; and executing the data reading instruction to acquire service data from the data source.
Optionally, the calculator set includes a calculator with a single responsibility for data processing, and the calculator takes the service data or a processing result of the service data as an entry, and returns the processing result of the entry after processing the entry.
Optionally, the data analysis task is a clustering task of item question and answer data, the service data is the item question and answer data, and the calculator set includes a word segmentation calculator, a similarity calculator and a clustering calculator; the data processing module is further configured to: calling the word segmentation calculator to segment words of the article question and answer data to obtain word segmentation results; calling the similarity calculator to calculate the similarity of the word segmentation records of the same article in the word segmentation results to obtain similarity calculation results; and calling the clustering calculator to cluster the item question answering data based on the similarity calculation result.
Optionally, the data source parameter information includes a parameter identifier; the data acquisition module is further configured to: acquiring a corresponding data source parameter tuple according to the parameter identifier; wherein the data source parameter tuple comprises data source basic information and a data reading instruction; establishing connection with a corresponding data source according to the basic information of the data source; and executing the data reading instruction to acquire service data from the data source.
Optionally, the apparatus further comprises: the receiving generation module is used for receiving data source configuration information and generating corresponding parameter identification for a data source parameter tuple of the data source configuration information; the data source configuration information comprises a data source type and a data source parameter tuple corresponding to the data source type.
Optionally, the business data includes stock data and/or incremental data, and the incremental data is stored in the data source component by subscribing to a message queue of the business system.
To achieve the above object, according to another aspect of the embodiments of the present invention, a data analysis system is provided.
The data analysis system of the embodiment of the invention comprises: the management background is used for sending the service logic script written for the data analysis task to the assembler component; the service logic script comprises data source parameter information and a service logic code; the assembler component is used for executing the data extraction logic of the business logic codes so as to send a data acquisition request to the data source component; data processing logic further for executing the business logic code to send a calculator call request to a calculator component; the data source component is used for receiving the data acquisition request, acquiring service data from a data source specified by the data acquisition request and sending the service data to the assembler component; and the calculator component is used for receiving the calculator calling request, calling a calculator set specified in the calculator calling request and processing the data of the service data.
Optionally, the calculator component is further configured to: sending the machining result to the assembler component to store the machining result to the data source component by the assembler component; the system further comprises: and the retrieval system is used for retrieving the processing result from the data source component according to the data query condition.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided an electronic apparatus.
An electronic device of an embodiment of the present invention includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a data analysis method of an embodiment of the present invention.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable medium.
A computer-readable medium of an embodiment of the present invention has stored thereon a computer program that, when executed by a processor, implements a data analysis method of an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of acquiring service data from a specified data source by acquiring a service logic script containing service logic codes, and then calling a calculator set to realize processing of the service data, so that system codes can be reused, and the development efficiency is high; parameters required by data acquisition are combed in advance, so that data extraction logic is realized, and multiplexing of the data extraction logic is realized; a calculator with single responsibility for data processing is constructed in advance, so that data processing logic is realized, and multiplexing of the data processing logic is realized; through the pre-constructed word segmentation calculator, similarity calculator and clustering calculator, the clustering task of the item question answering data is realized, the development efficiency is high, and the human resources are saved; generating a parameter identifier for the data source parameter tuple, so that the service data can be acquired only by specifying the parameter identifier subsequently; the real-time processing of the incremental data is realized by subscribing the message queue of the service system; system multiplexing is realized by abstracting an assembler component, a data source component and a computer component, and zero development amount on the side of the platform is realized by submitting a service logic script in a management background; the data user inquires the data through the retrieval system without directly accessing the database, so that the system is more friendly and easy to use for the data user.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram illustrating the data analysis implemented by the OLAP system according to the prior art;
FIG. 2 is a schematic flow chart of a prior art OLAP system for data analysis;
FIG. 3 is a schematic diagram of the main steps of a data analysis method according to an embodiment of the invention;
FIG. 4 is a system diagram of a data analysis method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating the implementation principle of clustering task of item question answering data by using the data analysis method according to the embodiment of the present invention;
FIG. 6 is a schematic diagram of a main flow of a data analysis method according to an embodiment of the present invention;
FIG. 7 is a timing diagram of a configuration data source of a data analysis method according to an embodiment of the invention;
FIG. 8 is a timing diagram of processing inventory data for a data analysis method according to an embodiment of the invention;
FIG. 9 is a diagram of an implementation architecture for an OLAP system according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of the main blocks of a data analysis apparatus according to an embodiment of the present invention;
FIG. 11 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 12 is a schematic diagram of a computer apparatus suitable for use in an electronic device to implement an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram illustrating an implementation principle of data analysis performed by an OLAP system in the prior art. As shown in fig. 1, in an enterprise, different business systems generate various data structures, such as item data, User data, and User Generated Content (UGC), and store the data structures in different databases.
The OLAP system reads data (Extract) of various structures from different data sources, performs a series of processing, and summarizing (Transform) on the data, and stores the processing result in different databases (loads). DATA mining (DATA MINING), DATA VISUALIZATION (DATA VISUALIZATION), report generation (REPORTS), dashboard generation (DASHBOARDS), Alarm (ALERTS), etc. may then be performed using the DATA stored in the database.
FIG. 2 is a schematic flow chart of a prior art OLAP system for data analysis. As shown in fig. 2, the OLAP system reads data from a database, a cloud service, an application system, and an AIRFLOW, performs data quality detection, data conversion, data cleaning, and the like on data corresponding to different data sources, and then stores the processing results in a message queue, a data warehouse, a database, and the like. In fig. 2, the application system may be an android system or an apple system. AIRFLOW is a workload scheduler. KAFKA is a high throughput distributed publish-subscribe messaging system. Data backup may be implemented by S3(Simple Storage Service).
In the above process, the scripts need to be customized and developed according to specific analysis tasks in the links of script reading, data quality detection, data conversion and data cleaning, routing and mode detection (route & SCHEMA DETECTER), script writing and the like. And restarting the analysis task for processing the incremental data in a regular iteration mode. The processing mode of 'customization + incremental iteration' has the advantages that on one hand, system codes cannot be reused, and inefficient delivery and labor waste are caused by repeated development; on the other hand, incremental data processed by a regular iteration mode has poor flexibility and high time delay, and cannot meet a service scene with a requirement on real-time performance.
In order to solve the above problems in the prior art, the embodiments of the present invention optimize the design scheme of the existing OLAP system, and implement multiplexing of the OLAP system and real-time processing of incremental data to the maximum extent by abstracting three major components with single responsibility and a management background.
Fig. 3 is a schematic diagram of the main steps of a data analysis method according to an embodiment of the present invention. As shown in fig. 3, the data analysis method according to the embodiment of the present invention mainly includes the following steps:
step S301: and acquiring a service logic script written for the data analysis task in advance. Wherein the service logic script comprises: data source parameter information and business logic code. The data source parameter information may be a data source parameter tuple, or may be a parameter identifier corresponding to the data source parameter tuple. The business logic code includes: data extraction logic and data processing logic. And the data extraction logic is used for acquiring the service data from the corresponding data source according to the data source parameter information. And the data processing logic is used for setting a calculator set required by the data analysis task so as to call the calculator set to perform data processing on the service data.
Step S302: and executing the data extraction logic of the service logic code to acquire service data from the data source specified by the data source parameter information. If the data source parameter information is a data source parameter tuple which comprises data source basic information and a data reading instruction, establishing connection with a corresponding data source according to the data source basic information; and then executing a data reading instruction to acquire the service data from the data source. If the data source parameter information is a parameter identifier, acquiring a corresponding data source parameter tuple according to the parameter identifier; and then acquiring the service data according to the mode.
Step S303: and executing the data processing logic of the business logic codes to call a calculator set in the data processing logic to process the data of the business data. The calculator set comprises a calculator with single responsibility for data processing, the calculator takes the service data or the processing result of the service data as the input parameter, and the input parameter is processed and then returned to the processing result of the input parameter. And executing data processing logic to call the calculator in the calculator set so as to perform data processing such as data cleaning, data conversion, data verification and the like on the service data.
Fig. 4 is a system configuration diagram of a data analysis method according to an embodiment of the present invention. As shown in fig. 4, the data analysis system according to the embodiment of the present invention mainly includes: a data source component (DataSource), a Calculator component (Calculator), an Assembler component (Assembler), and an administration Back office (Admin). Wherein, DataSource and Call are reusable components. In FIG. 4, bdp is a large data storage, query platform.
Data Source: and multiple data analysis tasks can be multiplexed, access to a database is encapsulated, and a standard interface is provided for the Assembler to obtain service data. The service data of the different service systems are stored in various different databases, such as mysql, redis, hbase, elasticsearch, mongodb, MQ, etc. Although the databases are of various kinds, the access mode is fixed for each specific database. Therefore, the data source parameter tuple can be defined in advance according to the data source type, and the service data reading is realized. For example, by accessing mysql, a given data source parameter tuple (ip, user, password, database, sql) can read the traffic data. For another example, for accessing redis, a data source parameter tuple (ip, port, user, password, key, command) is specified to read the service data.
In an embodiment, the business data includes inventory data and/or delta data. The stock data and the incremental data are divided by a reference time, the data generated before the reference time is the stock data, and the data generated after the reference time is the incremental data. For example, the online time point of the OLAP system is T, and before T, a business system (such as a commodity evaluation system) has generated a lot of data, and the existing data is stock data. And the new data generated after T is incremental data.
In the embodiment, the DataSource provides two modes for transferring the data source parameter tuple, wherein one mode is that the DataSource is provided as a method parameter when the Assembler calls the DataSource; and the other method is characterized in that the parameters are pre-recorded in an Admin background, and the parameter identification is appointed when the Assembler is called. In addition, the data source subscribes to a Message Queue (MQ) of the service system, so that the service system can send the incremental data to the OLAP system in real time through the MQ, and real-time processing of the incremental data is realized.
Calculator: a plurality of data analysis tasks can be reused, correspond to the data cleaning and processing functions of the existing OLAP system, are called by an Assembler, and are completely independent of services. For a new data analysis task, if the existing call cannot meet the requirements, a reusable call which is not coupled with the service needs to be developed in advance according to the task requirements. Each call instance is a fully reusable data processing plant with single responsibilities, such as word segmentation, emotion analysis, face Recognition, ASR (Automatic Speech Recognition), TTS (Text To Speech, from Text To Speech), etc.
The method of the caller accepts data as an input parameter (irrelevant to the service), and returns a result after processing. For example, the entry parameter of the word segmentation calculator is text and the exit parameter is entry. For another example, the input parameter of the ASR calculator is audio and the output is a text string. Both the segmentation calculator and the ASR calculator are available in a plurality of data analysis tasks, such as the segmentation calculator in an evaluation text processing task and a question and answer task. And the Assembler calls a plurality of callcontrollers to complete data processing.
Assembler: the dispatching center of the whole data analysis system reads service data through the DataSource, calls a scheduler to clean and process the data, and finally calls the DataSource to store the result data. Assemblers can be understood as the business logic layer of data analysis tasks, so for each task, it is necessary to develop customized respective Assembler processing logic.
For a new data analysis task, Assembler is the only part which needs to be developed (if the existing call does not meet the requirement, a new call needs to be developed additionally). For a data analysis task, the business logic has two implementation modes: one is a business logic script that customizes development tasks in Assembler engineering. The other is an Assembler script that submits tasks in Admin background.
Fig. 5 is a schematic diagram illustrating an implementation principle of a clustering task of item question answering data by using the data analysis method according to the embodiment of the present invention. As shown in FIG. 5, Assembler calls DataSource to read the item question-answer data of the question-answer system, and then combines several callators for word segmentation, similarity calculation and clustering to complete the clustering task. The concrete implementation is as follows:
(1) word segmentation: the Assembler calls the Datassource to acquire all item question and answer data from a data mart (such as a bdp database), then calls a participle Calculator to participle the text, and stores the entry records into the hbase database through the Datassource, wherein the record structure is as follows: { "sku", "original text", "entry result" }.
(2) And (3) similarity calculation: the Assembler calls the DataSource to obtain entry data from the hbase, then calls a similarity calculation Calculator to pair every two entry records under each item (sku) to perform similarity score calculation, and each pair of entries obtains a similarity result (such as score); and finally, calling the DataSource to store the similarity result in the ES database.
(3) Clustering: and the Assembler calls the Datassource to read the similarity result, and calls a clustering Calculator to cluster the question and answer data under each sku. For example, sku 123 is a washing machine, and the following problems are [ "this washing machine is loud", "clean washing", "sound is loud", "washing effect is good" ], id is assumed to be 1, 2, 3, and 4, and clustering results [ {1,3}, {2,4} ].
In a preferred embodiment, before the participle Calculator is used to process the item question-answer data, an embedded (embedding) Calculator may be used to process the item question-answer data. The concrete implementation is as follows: the Assembler acquires all item question and answer data from the data mart through the Datasource, calls an embedding Calculator to perform model training, generates an embedding model and background data, and calls the Datasource to store the Datasource in a text database.
Admin: is the key for realizing zero development quantity on the platform side of the data analysis system. Admin serves as a visual interface for interaction between a user and the data analysis system, and an Admin background provides rich functions. Users can submit data source parameter tuples through Admin, view currently supported database types, view existing calculators, submit Assembler program scripts for tasks, manage task start/stop, view task status (such as progress), and the like.
In a preferred embodiment of the present invention, the data analysis system of the embodiment of the present invention further includes: search system (Search). The system provides query entry of result data, relying on DataSource. The Search system includes two modules, an SOA (Service-Oriented Architecture) and Admin. The SOA is a service end application which encapsulates the query interface, and the interface is exposed outwards in a http mode and the like, and is mainly used for research and development personnel of a user to call. Admin serves as a data query background interface, provides simple query conditions, and a data user can retrieve data processing results of the data analysis system through Admin. All data users query data through Search instead of directly accessing the database, and Search can play the role of the database and simultaneously enables the system to be more friendly and easy to use for the users.
Fig. 6 is a schematic diagram of a main flow of a data analysis method according to an embodiment of the present invention. As shown in fig. 6, the data analysis method according to the embodiment of the present invention mainly includes the following steps:
step S601: and the management background receives the data source parameter information submitted by the user and configures the data source component. And configuring a data source by a user, wherein the type of the database needs to be specified, a data source parameter tuple is provided, and then the data source parameter tuple is submitted and stored to a management background. The management background generates a parameter identifier (Id) for the data source parameter tuple, and a user can view the parameter identifier and the data source parameter tuple in the management background.
Step S602: and the management background receives a service logic script which is submitted by a user and compiled for the data analysis task. The service logic script can be customized and developed in the Assembler project and can also be submitted by a management background.
Step S603: and the management background starts a data analysis task.
Step S604: and the assembler component acquires the service data from the data source specified by the data source parameter information. Establishing connection with a corresponding data source according to the basic information of the data source; and then executing a data reading instruction to acquire the service data from the data source.
Step S605: and the assembler component calls a calculator set in the data processing logic to process the data of the service data. And calling a calculator in the calculator set, and further performing data processing such as data cleaning, data conversion, data verification and the like on the service data.
Step S606: and the assembler component calls the data source component to store the data processing result. The assembler component stores the data processing result to the data source component.
FIG. 7 is a timing diagram of a configuration data source of a data analysis method according to an embodiment of the invention. As shown in fig. 7, the user configures the data source and then sends the data source to the management back-end, the management back-end generates the data source instance and then sends the data source instance to the data source component, and the data source component verifies whether the parameters provided by the configured data source can be normally connected to the database or not, and whether the data can be normally queried. The broken line of the timing chart indicates that the flow processing ends to return. The following examples illustrate:
the data source parameter tuples are as follows:
{“type”:”mysql”,”ip”:”192.168.100.49”,”port”:3358,”user”:”root”,”password”:”pw123”,”database”:”db_comment”,”sql”:”select content from user_comment where created between‘2019-10-01’and‘2019-10-02’”}
after submission and storage, the following results are obtained:
{“dataSourceId”:1024,“type”:”mysql”,”ip”:”192.168.100.49”,”port”:3358,”user”:”root”,”password”:”pw123”,”database”:”db_comment”,”sq l”:”select content from user_comment where created between‘2019-10-01’and‘2019-10-02’”}
assembler specifies the data source parameter tuple using Id 1024, which is to call MySQL using the above parameters.
In addition, for the processing of the incremental data MQ: if a certain data analysis task needs to subscribe the incremental data MQ of the service system, when a user manages a background configuration data source, the user needs to additionally specify a data source parameter tuple (ip, user, topoic and the like) of the MQ, the data source calls back the Assembler after receiving the incremental data, and the incremental data is logically processed according to the service script.
Fig. 8 is a timing diagram of processing inventory data for a data analysis method according to an embodiment of the invention. As shown in fig. 8, the user configures the stock data task and then sends the stock data task to the background management; after the background management checks the calculator, sending to the assembler component; the assembler component verifies the data source instance. And after the verification is passed, the management background starts a stock data task. The assembler component pulls data from the data source component, calls the computer component to process the data, saves the processing result and returns to the task state.
The configuration stock data task corresponds to step S601 and step S602. The concrete implementation is as follows: the user can check the data source parameter tuple and the parameter identification Id which are stored by the user in the management background, and the calculator and the access reference description are supported currently. Then, the user writes own Assembler program script, which is the core of the data analysis task and is responsible for combining and using the DataSource and the Calculator, and adding the service processing logic of the task.
The Assembler program script contains information such as parameter identification Id (or directly provides data source parameter tuple), task logic code and the like. The save is committed and the task can then be initiated. If the Assembler program script specifies the parameter identifier Id, the task runtime will find the data source parameter tuple from the saved record and then read the data source. In addition, the check calculator inputs the input parameter and observes the input parameter.
Fig. 9 is an implementation architecture diagram of the data analysis method for the OLAP system according to the embodiment of the present invention. As shown in FIG. 9, the OLAP system includes OLAP-DataSource, OLAP-Assembler, OLAP-Calculator, and OLAP-Admin. The OLAP-database acquires the inventory quantity and the data processing result from the database. OLAP-Calculator develops a new Calculator. And starting a data analysis task by the OLAP-Admin and configuring a data source. And the OLAP-Assembler develops a task logic script, draws stock data or incremental data from the OLAP-database, and calls a Calculator in the OLAP-Calculator to process the drawn data.
In addition, the Search system includes a Search-SOA that receives a data query request from a Client and queries data from an OLAP-DataSource, and a Search-Admin.
According to the data analysis method, the business logic script containing the business logic code is obtained, the business data are obtained from the specified data source, and then the calculator set is called to realize the processing of the business data, so that the system code can be reused, and the development efficiency is high; parameters required by data acquisition are combed in advance, so that data extraction logic is realized, and multiplexing of the data extraction logic is realized; a calculator with single responsibility for data processing is constructed in advance, so that data processing logic is realized, and multiplexing of the data processing logic is realized; through the pre-constructed word segmentation calculator, similarity calculator and clustering calculator, the clustering task of the item question answering data is realized, the development efficiency is high, and the human resources are saved.
Generating a parameter identifier for the data source parameter tuple, so that the service data can be acquired only by specifying the parameter identifier subsequently; the real-time processing of the incremental data is realized by subscribing the message queue of the service system; system multiplexing is realized by abstracting an assembler component, a data source component and a computer component, and zero development amount on the side of the platform is realized by submitting a service logic script in a management background; the data user inquires the data through the retrieval system without directly accessing the database, so that the system is more friendly and easy to use for the data user.
Fig. 10 is a schematic diagram of main blocks of a data analysis apparatus according to an embodiment of the present invention. As shown in fig. 10, a data analysis apparatus 1000 according to an embodiment of the present invention mainly includes:
the script obtaining module 1001 is configured to obtain a service logic script written for a data analysis task in advance. The service logic script comprises data source parameter information and service logic codes. The data source parameter information may be a data source parameter tuple, or may be a parameter identifier corresponding to the data source parameter tuple. The business logic code includes: data extraction logic and data processing logic. And the data extraction logic is used for acquiring the service data from the corresponding data source according to the data source parameter information. And the data processing logic is used for setting a calculator set required by the data analysis task so as to call the calculator set to perform data processing on the service data.
A data obtaining module 1002, configured to execute the data extraction logic of the service logic code, so as to obtain service data from the data source specified by the data source parameter information. If the data source parameter information is a data source parameter tuple which comprises data source basic information and a data reading instruction, establishing connection with a corresponding data source according to the data source basic information; and then executing a data reading instruction to acquire the service data from the data source. If the data source parameter information is a parameter identifier, acquiring a corresponding data source parameter tuple according to the parameter identifier; and then acquiring the service data according to the mode.
And the data processing module 1003 is configured to execute the data processing logic of the service logic code, so as to call a calculator set in the data processing logic, and perform data processing on the service data. The calculator set comprises a calculator with single responsibility for data processing, the calculator takes the service data or the processing result of the service data as the input parameter, and the input parameter is processed and then returned to the processing result of the input parameter. And executing data processing logic to call the calculator in the calculator set so as to perform data processing such as data cleaning, data conversion, data verification and the like on the service data.
In addition, the data analysis apparatus 1000 according to the embodiment of the present invention may further include: a receive generation module (not shown in fig. 10). The module is used for receiving data source configuration information and generating a corresponding parameter identifier for a data source parameter tuple of the data source configuration information; the data source configuration information comprises a data source type and a data source parameter tuple corresponding to the data source type.
It can be seen from the above description that, by receiving a service logic script containing a service logic code, service data is acquired from a specified data source, and then a calculator set is called to implement processing of the service data, so that the system code can be reused, and the development efficiency is high.
Fig. 11 shows an exemplary system architecture 1100 to which the data analysis method or data analysis apparatus of an embodiment of the present invention may be applied.
As shown in fig. 11, the system architecture 1100 may include terminal devices 1101, 1102, 1103, a network 1104, and a server 1105. The network 1104 is a medium to provide communication links between the terminal devices 1101, 1102, 1103 and the server 1105. Network 1104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 1101, 1102, 1103 to interact with a server 1105 over a network 1104 to receive or send messages or the like. Various communication client applications, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 1101, 1102, 1103.
The terminal devices 1101, 1102, 1103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 1105 may be a server that provides various services, such as a background management server that processes business logic scripts transmitted by an administrator using the terminal devices 1101, 1102, 1103. The background management server can execute the service logic script and feed back a processing result (such as a data processing result) to the terminal equipment.
It should be noted that the data analysis method provided in the embodiment of the present application is generally executed by the server 1105, and accordingly, the data analysis apparatus is generally provided in the server 1105.
It should be understood that the number of terminal devices, networks, and servers in fig. 11 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The invention also provides an electronic device and a computer readable medium according to the embodiment of the invention.
The electronic device of the present invention includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a data analysis method of an embodiment of the present invention.
The computer-readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements a data analysis method of an embodiment of the present invention.
Referring now to FIG. 12, shown is a block diagram of a computer system 1200 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 12, the computer system 1200 includes a Central Processing Unit (CPU)1201, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for the operation of the computer system 1200 are also stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.
In particular, the processes described above with respect to the main step diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program containing program code for performing the method illustrated in the main step diagram. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 1201.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a script acquisition module, a data acquisition module, and a data processing module. The names of these modules do not form a limitation on the module itself in some cases, and for example, the script acquisition module may also be described as a "module that acquires a service logic script written in advance for a data analysis task".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring a service logic script written for a data analysis task in advance; the service logic script comprises data source parameter information and a service logic code; executing data extraction logic of the service logic code to acquire service data from a data source specified by the data source parameter information; and executing the data processing logic of the business logic codes to call a calculator set in the data processing logic to process the data of the business data. .
From the above description, it can be seen that the service data is acquired from the specified data source by acquiring the service logic script containing the service logic code, and then the calculator set is called to implement the processing of the service data, so that the system code can be multiplexed, and the development efficiency is high; parameters required by data acquisition are combed in advance, so that data extraction logic is realized, and multiplexing of the data extraction logic is realized.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method of data analysis, comprising:
acquiring a service logic script written for a data analysis task in advance; the service logic script comprises data source parameter information and a service logic code;
executing data extraction logic of the service logic code to acquire service data from a data source specified by the data source parameter information;
and executing the data processing logic of the business logic codes to call a calculator set in the data processing logic to process the data of the business data.
2. The method of claim 1, wherein the data source parameter information comprises a data source parameter tuple comprising data source base information and a data read instruction;
acquiring service data from a data source specified by the data source parameter information, wherein the service data comprises:
establishing connection with a corresponding data source according to the basic information of the data source;
and executing the data reading instruction to acquire service data from the data source.
3. The method of claim 1, wherein the set of calculators includes a calculator with a single responsibility for data processing, and the calculator uses the business data or the processing result of the business data as an entry, and returns the processing result of the entry after processing the entry.
4. The method according to claim 2, wherein the data analysis task is a clustering task of item question and answer data, the business data is the item question and answer data, and the calculator set comprises a word segmentation calculator, a similarity calculator and a clustering calculator;
calling a calculator set in the data processing logic, and processing the data of the service data, wherein the method comprises the following steps:
calling the word segmentation calculator to segment words of the article question and answer data to obtain word segmentation results;
calling the similarity calculator to calculate the similarity of the word segmentation records of the same article in the word segmentation results to obtain similarity calculation results;
and calling the clustering calculator to cluster the item question answering data based on the similarity calculation result.
5. The method of claim 1, wherein the data source parameter information comprises a parameter identification;
acquiring service data from a data source specified by the data source parameter information, wherein the service data comprises:
acquiring a corresponding data source parameter tuple according to the parameter identifier; wherein the data source parameter tuple comprises data source basic information and a data reading instruction;
establishing connection with a corresponding data source according to the basic information of the data source;
and executing the data reading instruction to acquire service data from the data source.
6. The method of claim 5, further comprising:
receiving data source configuration information, and generating corresponding parameter identification for a data source parameter tuple of the data source configuration information; the data source configuration information comprises a data source type and a data source parameter tuple corresponding to the data source type.
7. The method of claim 1, wherein the business data comprises inventory data and/or delta data, and wherein the delta data is stored to a data source component by subscribing to a message queue of the business system.
8. A data analysis apparatus, comprising:
the script acquisition module is used for acquiring a service logic script written for the data analysis task in advance; the service logic script comprises data source parameter information and a service logic code;
the data acquisition module is used for executing the data extraction logic of the service logic code so as to acquire service data from a data source specified by the data source parameter information;
and the data processing module is used for executing the data processing logic of the service logic codes so as to call a calculator set in the data processing logic and process the data of the service data.
9. A data analysis system, comprising:
the management background is used for sending the service logic script written for the data analysis task to the assembler component; the service logic script comprises data source parameter information and a service logic code;
the assembler component is used for executing the data extraction logic of the business logic codes so as to send a data acquisition request to the data source component; data processing logic further for executing the business logic code to send a calculator call request to a calculator component;
the data source component is used for receiving the data acquisition request, acquiring service data from a data source specified by the data acquisition request and sending the service data to the assembler component;
and the calculator component is used for receiving the calculator calling request, calling a calculator set specified in the calculator calling request and processing the data of the service data.
10. The system of claim 9, wherein the calculator component is further configured to: sending the machining result to the assembler component to store the machining result to the data source component by the assembler component;
the system further comprises: and the retrieval system is used for retrieving the processing result from the data source component according to the data query condition.
11. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN201911088890.5A 2019-11-08 2019-11-08 Data analysis method, device and system Pending CN112783874A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911088890.5A CN112783874A (en) 2019-11-08 2019-11-08 Data analysis method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911088890.5A CN112783874A (en) 2019-11-08 2019-11-08 Data analysis method, device and system

Publications (1)

Publication Number Publication Date
CN112783874A true CN112783874A (en) 2021-05-11

Family

ID=75748449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911088890.5A Pending CN112783874A (en) 2019-11-08 2019-11-08 Data analysis method, device and system

Country Status (1)

Country Link
CN (1) CN112783874A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377752A (en) * 2021-06-04 2021-09-10 深圳力维智联技术有限公司 Data cleaning method, device and system and computer readable storage medium
CN113485763A (en) * 2021-07-02 2021-10-08 中国建设银行股份有限公司 Data processing method and device, electronic equipment and computer readable medium
CN113590445A (en) * 2021-07-30 2021-11-02 杭州玳数科技有限公司 Method for detecting running of data processing task, computer equipment and storage medium
CN115202851A (en) * 2022-09-13 2022-10-18 创新奇智(浙江)科技有限公司 Data task execution system and data task execution method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468626A (en) * 2014-09-04 2016-04-06 腾讯科技(北京)有限公司 Data processing method and system
CN108417222A (en) * 2017-02-10 2018-08-17 三星电子株式会社 Weighted finite state converter decodes system and speech recognition system
US20190129996A1 (en) * 2017-11-01 2019-05-02 Tahlent Inc DBA SwoopTalent Systems and methods for disparate data source aggregation, self-adjusting data model and api
CN110097412A (en) * 2018-01-31 2019-08-06 阿里巴巴集团控股有限公司 Item recommendation method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468626A (en) * 2014-09-04 2016-04-06 腾讯科技(北京)有限公司 Data processing method and system
CN108417222A (en) * 2017-02-10 2018-08-17 三星电子株式会社 Weighted finite state converter decodes system and speech recognition system
US20190129996A1 (en) * 2017-11-01 2019-05-02 Tahlent Inc DBA SwoopTalent Systems and methods for disparate data source aggregation, self-adjusting data model and api
CN110097412A (en) * 2018-01-31 2019-08-06 阿里巴巴集团控股有限公司 Item recommendation method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
衡友跃;李江华;仲红;刘锋;: "设计模式在业务逻辑层中的应用", 电子设计工程, no. 07 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377752A (en) * 2021-06-04 2021-09-10 深圳力维智联技术有限公司 Data cleaning method, device and system and computer readable storage medium
CN113485763A (en) * 2021-07-02 2021-10-08 中国建设银行股份有限公司 Data processing method and device, electronic equipment and computer readable medium
CN113590445A (en) * 2021-07-30 2021-11-02 杭州玳数科技有限公司 Method for detecting running of data processing task, computer equipment and storage medium
CN115202851A (en) * 2022-09-13 2022-10-18 创新奇智(浙江)科技有限公司 Data task execution system and data task execution method

Similar Documents

Publication Publication Date Title
US10547747B1 (en) Configurable natural language contact flow
CN112783874A (en) Data analysis method, device and system
US20120254221A1 (en) Systems and methods for performing record actions in a multi-tenant database and application system
CN110262807B (en) Cluster creation progress log acquisition system, method and device
CN109446274B (en) Method and device for managing BI metadata of big data platform
CN110321544B (en) Method and device for generating information
CN112765166A (en) Data processing method, device and computer readable storage medium
CN112861529A (en) Method and device for managing error codes
CN111125064A (en) Method and device for generating database mode definition statement
CN112860744A (en) Business process processing method and device
CN112000734A (en) Big data processing method and device
CN112818026A (en) Data integration method and device
CN113010542B (en) Service data processing method, device, computer equipment and storage medium
CN113326305A (en) Method and device for processing data
CN109213743B (en) Data query method and device
CN110764769B (en) Method and device for processing user request
CN113836235B (en) Data processing method based on data center and related equipment thereof
CN114661289A (en) Knowledge and data driving-based micro application development system and method
CN113672671B (en) Method and device for realizing data processing
CN113434572A (en) Data query method, query system, device, storage medium and program product
CN110472055B (en) Method and device for marking data
CN113760240A (en) Method and device for generating data model
CN112131287A (en) Method and device for reading data
CN112015790A (en) Data processing method and device
CN112965946B (en) Method, device, equipment and computer readable medium for processing document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination