CN117406960A - A low-code social data computing platform and device for agile analysis scenarios - Google Patents
A low-code social data computing platform and device for agile analysis scenarios Download PDFInfo
- Publication number
- CN117406960A CN117406960A CN202311296667.6A CN202311296667A CN117406960A CN 117406960 A CN117406960 A CN 117406960A CN 202311296667 A CN202311296667 A CN 202311296667A CN 117406960 A CN117406960 A CN 117406960A
- Authority
- CN
- China
- Prior art keywords
- data
- analysis
- module
- data source
- social
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 80
- 230000006870 function Effects 0.000 claims abstract description 53
- 238000007781 pre-processing Methods 0.000 claims abstract description 38
- 238000007405 data analysis Methods 0.000 claims abstract description 34
- 238000010276 construction Methods 0.000 claims abstract description 8
- 238000010586 diagram Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims description 31
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000004140 cleaning Methods 0.000 claims description 4
- 230000008602 contraction Effects 0.000 claims description 3
- 238000013500 data storage Methods 0.000 claims description 3
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000005111 flow chemistry technique Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 2
- 238000000034 method Methods 0.000 abstract description 34
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000011161 development Methods 0.000 abstract description 3
- 230000006978 adaptation Effects 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 25
- 230000008569 process Effects 0.000 description 23
- 238000013135 deep learning Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000003012 network analysis Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000004141 dimensional analysis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/20—Software design
- G06F8/24—Object-oriented
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及计算机技术领域,尤其涉及一种敏捷分析场景的低代码社交数据计算平台和装置。The present invention relates to the field of computer technology, and in particular to a low-code social data computing platform and device for agile analysis scenarios.
背景技术Background Art
随着社交平台的快速发展,大量用户产生的海量数据给数据分析带来了挑战。此外,数据源的异构性和数据特征的不断变化也给数据分析带来了一定的困难。针对特定应用场景,往往需要重新撰写业务代码。然而这些业务场景的数据处理往往有着相似之处。With the rapid development of social platforms, the massive amount of data generated by a large number of users has brought challenges to data analysis. In addition, the heterogeneity of data sources and the constant changes in data characteristics have also brought certain difficulties to data analysis. For specific application scenarios, it is often necessary to rewrite business code. However, the data processing of these business scenarios often has similarities.
现有的社交平台数据分析系统和方法大多使用传统的数据分析架构和技术,这些技术往往需要大量的编程和调试工作,并且对开发人员的技能水平要求较高。因此,需要一种低代码的社交平台数据分析处理方法与系统,以提高数据分析的效率和质量,同时减少开发成本和时间。Most of the existing social platform data analysis systems and methods use traditional data analysis architectures and technologies, which often require a lot of programming and debugging work and require a high level of developer skills. Therefore, a low-code social platform data analysis processing method and system is needed to improve the efficiency and quality of data analysis while reducing development costs and time.
发明内容Summary of the invention
为此,本发明首先提出一种敏捷分析场景的低代码社交数据计算平台和装置,应用于社交数据分析系统,其特征在于,所属系统包括数据源定义流程和数据源处理流程;To this end, the present invention first proposes a low-code social data computing platform and device for agile analysis scenarios, which are applied to social data analysis systems, and are characterized in that the system includes a data source definition process and a data source processing process;
数据源定义流程对社交平台数据源数据进行定义,包含4个模块:The data source definition process defines the data source data of the social platform, which includes 4 modules:
数据格式定义模块,对异构社交平台数据格式进行定义,对社交平台特有的关系和属性进行字段绑定;The data format definition module defines the data format of heterogeneous social platforms and performs field binding for the relationships and attributes unique to social platforms.
预处理流定义模块,用户通过编排若干预定义的数据预处理函数或添加自定义预处理函数,对该数据源数据预处理模块的流处理流程进行定义;Preprocessing flow definition module, the user defines the flow processing flow of the data source data preprocessing module by arranging several predefined data preprocessing functions or adding custom preprocessing functions;
分析任务编排模块,用户基于系统分析服务中台提供的分析服务,依据不同的分析结果,选择不同的分析服务,并将结果进行组合编排;Analysis task arrangement module: users select different analysis services based on the analysis services provided by the system analysis service platform according to different analysis results, and combine and arrange the results;
输出结果定义模块,根据用户需求,定义输出结果的格式和内容;Output result definition module, which defines the format and content of output results according to user requirements;
数据源处理流程对定义的处理流程进行运行,包含5个模块:The data source processing flow runs the defined processing flow and includes 5 modules:
数据接入模块利用注册回调机制,以微服务函数的形式实现对数据源的灵活接入,将数据格式定义模块定义的数据接入系统,对接收到的数据进行初步的清洗和格式化,并将数据发送到数据预处理模块进行进一步处理;The data access module uses the registration callback mechanism to achieve flexible access to the data source in the form of microservice functions, accesses the data defined by the data format definition module to the system, performs preliminary cleaning and formatting on the received data, and sends the data to the data preprocessing module for further processing;
数据预处理模块基于预处理流定义模块定义的函数,以流式计算组织模式,针对不断新增且无序的社交平台数据进行预处理;The data preprocessing module preprocesses the continuously increasing and disordered social platform data in a streaming computing organization mode based on the functions defined in the preprocessing stream definition module;
社交关联图构建模块通过消息订阅框架,输入数据预处理模块预处理数据,将数据存储流程与处理流程解耦,将数据构建成社交关联图,并存入图数据库;The social association graph construction module uses the message subscription framework to input the data preprocessing module to preprocess the data, decouple the data storage process from the processing process, construct the data into a social association graph, and store it in the graph database;
数据分析模块基于分析任务编排模块的选择,输入社交关联图,并定制化输出数据分析结果;The data analysis module inputs the social association graph based on the selection of the analysis task orchestration module and customizes the output of the data analysis results;
定制化输出模块基于输出结果定义模块的定义,依据用户的分析需求,注册对应的数据分析模块,并定期收集相应的分析结果,构建出具体的业务输出。The customized output module is based on the definition of the output result definition module. According to the user's analysis needs, it registers the corresponding data analysis module, regularly collects the corresponding analysis results, and constructs specific business outputs.
所述数据源由控制面板提供的拖拽编辑页面以低代码的方式进行编辑配置,对于每一个新建的数据源,控制面板会创建对应的数据源控制器,对数据源的创建、查询、更新和删除进行统一管理,数据源控制器会定期向控制面板发送心跳信号,汇报自身状态,让数据源管理器能够对控制器失联情况采取及时的反应措施。The data source is edited and configured in a low-code manner using the drag-and-drop editing page provided by the control panel. For each newly created data source, the control panel will create a corresponding data source controller to uniformly manage the creation, query, update, and deletion of the data source. The data source controller will periodically send heartbeat signals to the control panel to report its own status, allowing the data source manager to take timely response measures to the loss of connection with the controller.
数据源控制器在创建之后,会通知资源协调器给具体的流程模块分配资源,数据源A控制器在通知资源协调器创建函数后,分别在物理节点1和物理节点n上创建了预处理函数1和与处理函数m,一个物理节点上能够存在多个处理函数,创建的函数会定时向从属的管理器发送心跳信号,告知函数以及函数运行容器的运行状态,并针对负载情况进行扩缩容处理。After being created, the data source controller notifies the resource coordinator to allocate resources to specific process modules. After notifying the resource coordinator to create a function, the data source A controller creates a preprocessing function 1 and a processing function m on physical node 1 and physical node n, respectively. There can be multiple processing functions on a physical node. The created function will periodically send heartbeat signals to the subordinate manager to inform the function and the function's running container of the running status, and perform capacity expansion and contraction according to the load conditions.
本发明所要实现的技术效果在于:The technical effects to be achieved by the present invention are:
本发明提供的面向社交网络敏捷分析场景的低代码数据计算平台构建方法,能够敏捷适应快速变换的针对社交网络的分析业务需求,基于海量、异构的社交平台数据,以低代码的形式进行业务涉及,并进行高效、快速的处理和分析,最后输出定制化的分析结果,满足各类社交平台分析的下游业务。该方法包括多源数据接入模块、数据预处理模块、社交关联图构建模块、数据分析模块和分析结果定制化输出模块。通过这些模块的高度解耦和低代码形式的算子提供,用户能够快速、灵活地对数据进行处理和分析。同时,该方法利用微服务函数和流式计算等新兴技术,实现了对数据源的灵活接入和对数据特征的快速适应,从而提高了数据分析的效率和质量。The low-code data computing platform construction method for social network agile analysis scenarios provided by the present invention can agilely adapt to rapidly changing analysis business needs for social networks, and based on massive, heterogeneous social platform data, it carries out business involvement in the form of low code, and performs efficient and rapid processing and analysis, and finally outputs customized analysis results to meet the downstream business of various social platform analyses. The method includes a multi-source data access module, a data preprocessing module, a social association graph construction module, a data analysis module, and an analysis result customized output module. Through the high degree of decoupling of these modules and the provision of operators in the form of low code, users can process and analyze data quickly and flexibly. At the same time, the method utilizes emerging technologies such as microservice functions and streaming computing to achieve flexible access to data sources and rapid adaptation to data features, thereby improving the efficiency and quality of data analysis.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1敏捷分析场景的低代码数据计算平台构建方法架构;Figure 1: Architecture of the low-code data computing platform construction method for agile analysis scenarios;
图2数据源创建过程;Figure 2 Data source creation process;
图3数据源控制器架构;Figure 3 Data source controller architecture;
图4数据整体的分析计算流程;Figure 4 shows the overall data analysis and calculation process;
图5数据整体的分析计算流程。Figure 5: The overall analysis and calculation process of the data.
具体实施方式DETAILED DESCRIPTION
以下是本发明的优选实施例并结合附图,对本发明的技术方案作进一步的描述,但本发明并不限于此实施例。The following is a preferred embodiment of the present invention and in conjunction with the accompanying drawings, further describing the technical solution of the present invention, but the present invention is not limited to this embodiment.
本发明提出了一种敏捷分析场景的低代码数据计算平台构建方法。The present invention proposes a method for constructing a low-code data computing platform for agile analysis scenarios.
具体的,本发明公开了一种面向社交网络敏捷分析场景的低代码数据计算平台构建系统,对于每个数据源,其包括数据源定义流程和数据源处理流程:Specifically, the present invention discloses a low-code data computing platform construction system for social network agile analysis scenarios, which includes a data source definition process and a data source processing process for each data source:
具体的,数据源定义流程对社交平台数据源数据进行了定义,其包含4个模块:Specifically, the data source definition process defines the data source data of the social platform, which includes 4 modules:
数据格式定义模块,对异构社交平台数据格式进行定义,包括但不限于文本、图片、视频等数据格式,对社交平台特有的若干关系(点赞、转发和评论)和若干属性(账号id、用户昵称)进行字段绑定外,其他字段由用户自行定义数据类型与含义;The data format definition module defines the data formats of heterogeneous social platforms, including but not limited to text, image, video and other data formats. In addition to field binding for several relationships (likes, reposts and comments) and several attributes (account ID, user nickname) that are unique to social platforms, the data types and meanings of other fields are defined by users.
其示例格式如下:The example format is as follows:
其中,系统定义字段是社交平台数据具备的基本属性的最小集合,包含了构建关联图谱所需的最小信息集合:The system-defined fields are the minimum set of basic attributes of social platform data, including the minimum set of information required to build a correlation graph:
text_fields:定义了文本字段。示例中只包含一个content字段,其类型为"text"。text_fields: defines the text fields. The example contains only one content field, whose type is "text".
image_fields:定义了图片字段。示例中只包含一个image_url字段,其类型为"image"。image_fields: defines the image fields. The example only contains one image_url field, whose type is "image".
video_fields:定义了视频字段。示例中只包含一个video_url字段,其类型为"video"。video_fields: defines the video fields. The example only contains one video_url field, whose type is "video".
relation_fields:定义了关系字段,包括点赞、评论和分享。示例中分别定义了like、comment和share字段,它们的类型都为"relation"。这些关系字段可以与其他相关的字段(如account_id和user_name)进行绑定。relation_fields: defines relation fields, including like, comment, and share. In the example, the like, comment, and share fields are defined respectively, and their types are all "relation". These relation fields can be bound to other related fields (such as account_id and user_name).
用户定义字段,则可以作为数据分析的额外信息,用来增强分析结果的真实性与可靠性:User-defined fields can be used as additional information for data analysis to enhance the authenticity and reliability of analysis results:
user_defined_fields:定义了用户自定义字段。示例中定义了两个自定义字段custom_field1和custom_field2,其类型分别为"text"和"number"。用户可以根据需要定义更多的自定义字段。user_defined_fields: defines user-defined fields. In the example, two custom fields, custom_field1 and custom_field2, are defined, and their types are "text" and "number" respectively. Users can define more custom fields as needed.
采用上述数据定义后,平台能够对多个来源的数据进行统一接入,使得数据源与后续业务充分解耦,使能复杂分析业务。After adopting the above data definition, the platform can uniformly access data from multiple sources, fully decouple the data source from subsequent business, and enable complex analysis business.
预处理流定义模块,用户通过编排若干预定义的数据预处理函数或添加自定义预处理函数,对该数据源数据预处理模块的流处理流程进行定义。具体方包括但不限于数据清洗、去重、过滤、归一化等处理流程,以确保后续数据处理和分析的准确性和高效性。其配置示例如下:In the preprocessing flow definition module, users define the flow processing flow of the data source data preprocessing module by arranging several predefined data preprocessing functions or adding custom preprocessing functions. The specific methods include but are not limited to data cleaning, deduplication, filtering, normalization and other processing processes to ensure the accuracy and efficiency of subsequent data processing and analysis. The configuration example is as follows:
pipeline是预处理流的数组,定义了一系列预处理函数以及它们的参数。Pipeline is an array of preprocessing streams, defining a series of preprocessing functions and their parameters.
每个预处理函数都由一个对象表示,其中包含name和params两个字段。name字段表示预处理函数的名称,例如"dataCleaning"、"dataDeduplicatio n"、"dataFiltering"。params字段是一个对象,用于设置预处理函数的参数。具体参数根据每个函数而定。Each preprocessing function is represented by an object, which contains two fields: name and params. The name field indicates the name of the preprocessing function, such as "dataCleaning", "dataDecoding", and "dataFiltering". The params field is an object used to set the parameters of the preprocessing function. The specific parameters depend on each function.
在示例中,"dataCleaning"函数有两个参数:removeStopwords和removePunctuation,确定是否移除停用词和标点符号。In the example, the "dataCleaning" function has two parameters: removeStopwords and removePunctuation, which determine whether to remove stop words and punctuation.
"dataDeduplication"函数有一个参数:keyField,指定用于去重的关键字段。The "dataDeduplication" function has one parameter: keyField, which specifies the key field used for deduplication.
"dataFiltering"函数有一个参数:filterCondition,设置过滤条件,例如筛选出点赞数大于1000的数据。The "dataFiltering" function has one parameter: filterCondition, which sets the filtering condition, for example, filtering out data with more than 1000 likes.
分析任务编排模块,用户基于系统分析服务中台提供的分析服务,依据不同的分析需求,选择不同的分析服务,并将结果进行组合编排。该模块支持多种分析任务的编排并以无服务的方式运行。这些分析服务算法可以根据不同的分类进行归类,包括但不限于机器学习、深度学习和用户自定义算法。In the analysis task orchestration module, users can select different analysis services based on different analysis needs and combine and orchestrate the results based on the analysis services provided by the system analysis service platform. This module supports the orchestration of multiple analysis tasks and runs them in a serverless manner. These analysis service algorithms can be classified according to different categories, including but not limited to machine learning, deep learning, and user-defined algorithms.
对于机器学习算法,支持支持向量机(Support Vector Machine)、随机森林(Random Forest)、逻辑回归(Logistic Regression)、K最近邻算法(K-NearestNeighbors)、主成分分析(Principal Component Analysis)、AdaBoost算法、XGBoost算法等。For machine learning algorithms, support is provided for Support Vector Machine, Random Forest, Logistic Regression, K-Nearest Neighbors, Principal Component Analysis, AdaBoost, XGBoost, etc.
对于深度学习算法,支持卷积神经网络(Convolutional Neural Networ k)、循环神经网络(Recurrent Neural Network)、长短期记忆网络(Long Short-Term Memory)、双向循环神经网络(Bidirectional Recurrent Neural Network)、生成对抗网络(Generative Adversarial Network)、Transformers、深度强化学习算法For deep learning algorithms, support is provided for Convolutional Neural Network, Recurrent Neural Network, Long Short-Term Memory, Bidirectional Recurrent Neural Network, Generative Adversarial Network, Transformers, and Deep Reinforcement Learning Algorithms.
对于用户自定义算法,用户可以根据自身的需求和数据特点,开发自己的算法进行分析。这些算法可以是基于传统的统计方法、特定领域的知识模型或者其他自定义的分析方法。For user-defined algorithms, users can develop their own algorithms for analysis based on their own needs and data characteristics. These algorithms can be based on traditional statistical methods, knowledge models in specific fields, or other customized analysis methods.
分析任务编排模块支持用户选择不同的算法,并配置其参数,以满足具体的分析需求。用户可以根据数据特点、问题类型和业务场景,灵活选择适合的算法,并将多个算法结果进行组合与编排,得到所需的分析结果。这样可以实现对数据的多角度、多维度的分析,为用户提供更全面、准确的分析服务。其配置示例如下:The analysis task orchestration module supports users to select different algorithms and configure their parameters to meet specific analysis needs. Users can flexibly select suitable algorithms based on data characteristics, problem types and business scenarios, and combine and orchestrate multiple algorithm results to obtain the required analysis results. This can achieve multi-angle and multi-dimensional analysis of data, providing users with more comprehensive and accurate analysis services. The configuration example is as follows:
analysisTasks是分析任务的数组,定义了一系列分析任务以及它们的算法和参数。每个分析任务都由一个对象表示,其中包含name、algorithm和params三个字段。analysisTasks is an array of analysis tasks, defining a series of analysis tasks and their algorithms and parameters. Each analysis task is represented by an object, which contains three fields: name, algorithm and params.
name字段是分析任务的名称,例如"sentimentAnalysis"、"topicClassification"。The name field is the name of the analysis task, such as "sentimentAnalysis" and "topicClassification".
algorithm字段指定使用的算法分类,如"machineLearning"、"deepLear ning"或其他用户自定义的算法。The algorithm field specifies the algorithm category to be used, such as "machineLearning", "deepLearning" or other user-defined algorithms.
params字段是一个对象,用于设置分析任务的参数。具体参数根据每个算法而定。The params field is an object used to set the parameters of the analysis task. The specific parameters depend on each algorithm.
在示例中,"sentimentAnalysis"分析任务使用了机器学习算法进行情感分析,它会对"text"字段进行分析,并将结果存储在"sentiment"字段中。In the example, the "sentimentAnalysis" analysis task uses a machine learning algorithm to perform sentiment analysis. It analyzes the "text" field and stores the results in the "sentiment" field.
"topicClassification"分析任务使用了深度学习算法进行主题分类,它会对"text"字段进行分析,并将结果存储在"topic"字段中。The "topicClassification" analysis task uses a deep learning algorithm for topic classification. It analyzes the "text" field and stores the results in the "topic" field.
"entityExtraction"分析任务使用了自然语言处理算法进行实体抽取,它会对"text"字段进行分析,并将结果存储在"entities"字段中。The "entityExtraction" analysis task uses a natural language processing algorithm for entity extraction. It analyzes the "text" field and stores the results in the "entities" field.
"networkAnalysis"分析任务使用了图处理算法进行社交网络分析,它会对"accountID"、"likes"、"comments"、"shares"等字段进行分析,并将结果存储在"networkMetrics"字段中。The "networkAnalysis" analysis task uses a graph processing algorithm to perform social network analysis. It analyzes fields such as "accountID", "likes", "comments", and "shares" and stores the results in the "networkMetrics" field.
根据实际需求,你可以根据这个示例来自定义分析任务、算法和参数,并根据需要设置要进行分析的字段。Based on actual needs, you can customize the analysis tasks, algorithms, and parameters according to this example, and set the fields to be analyzed as needed.
输出结果定义模块:根据用户需求,定义输出结果的格式和内容,包括但不限于统计报表、图表、社交关联图等。该模块支持多种输出结果的定义、多种数据展示方式以及多种数据的暴露方式,以满足不同的业务需求。其配置示例如下:Output result definition module: Define the format and content of output results according to user needs, including but not limited to statistical reports, charts, social association diagrams, etc. This module supports multiple output result definitions, multiple data display methods, and multiple data exposure methods to meet different business needs. The configuration example is as follows:
outputResults是输出结果的数组,定义了一系列输出结果及其类型和字段。每个输出结果都由一个对象表示,其中包含type和相应的字段。outputResults is an array of output results, defining a series of output results and their types and fields. Each output result is represented by an object, which contains type and corresponding fields.
type字段指定输出结果的类型,如"statistics"、"chart"或"socialG raph"。The type field specifies the type of output result, such as "statistics", "chart", or "socialGraph".
在示例中,第一个输出结果的类型是"statistics",它会生成统计信息,并通过指定的字段进行计算,这里是"sentiment"和"topic"。In the example, the first output is of type "statistics", which generates statistics, calculated over the specified fields, in this case "sentiment" and "topic".
第二个输出结果的类型是"chart",它会生成柱状图,并使用"topic"字段作为X轴,"sentiment"字段作为Y轴。The second output result is of type "chart", which generates a bar chart using the "topic" field as the X-axis and the "sentiment" field as the Y-axis.
第三个输出结果的类型是"socialGraph",它会生成社交关联图,并使用"entities"字段作为节点信息,"networkMetrics"字段作为链接信息。The third output result type is "socialGraph", which generates a social relationship graph and uses the "entities" field as node information and the "networkMetrics" field as link information.
根据实际需求,你可以根据这个示例来自定义输出结果及其类型和相关字段。Based on actual needs, you can customize the output results and their types and related fields according to this example.
具体的,数据源处理流程对定义的处理流程进行运行,包含5个模块:Specifically, the data source processing flow runs the defined processing flow and includes 5 modules:
数据接入模块:该模块利用注册回调机制,以微服务函数的形式实现对数据源的灵活接入,将不同来源的社交平台数据接入系统。接入模块会对接收到的数据进行初步的清洗和格式化,并将数据发送到数据预处理模块进行进一步处理;Data access module: This module uses the registration callback mechanism to achieve flexible access to data sources in the form of microservice functions, and connects social platform data from different sources to the system. The access module will perform preliminary cleaning and formatting on the received data, and send the data to the data preprocessing module for further processing;
数据预处理模块:该模块以流式计算为主要组织模式,针对不断新增且无序的社交平台数据进行预处理。针对新增的数据特征,能够通过增加流式处理环节,进行快速适应。其伪代码如下所示:Data preprocessing module: This module uses streaming computing as the main organizational model to preprocess the continuously increasing and disordered social platform data. It can quickly adapt to the newly added data features by adding streaming processing links. Its pseudo code is as follows:
社交关联图构建模块:该模块通过消息订阅框架,将数据存储流程与处理流程解耦,将数据构建成社交关联图,并存入图数据库。这有助于将社交平台数据转化为图形化数据,方便后续的数据分析和定制化输出;Social association graph construction module: This module decouples the data storage process from the processing process through the message subscription framework, constructs the data into a social association graph, and stores it in the graph database. This helps to convert social platform data into graphical data, facilitating subsequent data analysis and customized output;
数据分析模块:该模块以服务中台(Serverless)的形式,实现分析能力的高可扩展性。可以灵活地选择各种数据分析算法,并提供预置的低代码算子和可视化组件,使得用户可以方便地定制化输出数据分析结果;Data analysis module: This module achieves high scalability of analysis capabilities in the form of a serverless service. It can flexibly select various data analysis algorithms and provide preset low-code operators and visualization components, so that users can easily customize the output of data analysis results;
上述伪代码中定义了AnalysisModule接口,所有的数据分析模块都需要实现这个接口,并提供Analyze方法来执行具体的数据分析逻辑。之后调用了两个数据分析服务,SentimentAnalysis和TopicClassification,它们分别对数据进行情感分析和主题分类。最后,在CombineAnalysisModules函数中,将不同的分析模块进行组合,并按顺序调用它们的Analyze方法来完成数据分析。可以根据实际需求创建不同的分析模块,并将它们与原始数据一起传入Co mbineAnalysisModules函数进行分析。最后,系统可以对分析结果进行处理和输出。The above pseudo code defines the AnalysisModule interface. All data analysis modules need to implement this interface and provide the Analyze method to execute specific data analysis logic. Then two data analysis services are called, SentimentAnalysis and TopicClassification, which perform sentiment analysis and topic classification on the data respectively. Finally, in the CombineAnalysisModules function, different analysis modules are combined and their Analyze methods are called in sequence to complete the data analysis. Different analysis modules can be created according to actual needs and passed into the CombineAnalysisModules function together with the original data for analysis. Finally, the system can process and output the analysis results.
定制化输出模块:该模块依据用户的分析需求,注册对应的数据分析模块,并定期收集相应的分析结果,构建出画像等具体的业务输出,使能下游业务。输出模块还支持自定义输出格式和渠道,以满足不同用户的输出需求。Customized output module: This module registers the corresponding data analysis module based on the user's analysis needs, and regularly collects the corresponding analysis results to build specific business outputs such as portraits to enable downstream businesses. The output module also supports customized output formats and channels to meet the output needs of different users.
其中,数据源定义流程和数据源处理流程相互关联对应,数据格式定义模块对多源数据接入模块进行配置,处理流定义模块对数据预处理模块进行配置,分析任务编排模块对数据分析模块进行配置,输出结果定义模块对定制化输出模块进行配置同时配置下游任务的接受方案。Among them, the data source definition process and the data source processing process are interrelated and correspond to each other, the data format definition module configures the multi-source data access module, the processing flow definition module configures the data preprocessing module, the analysis task orchestration module configures the data analysis module, and the output result definition module configures the customized output module and configures the acceptance scheme of the downstream tasks.
其中,数据源由控制面板提供的拖拽编辑页面以低代码的方式进行编辑配置,用户无需对数据流控制进行详细编码操作。Among them, the data source is edited and configured in a low-code manner through the drag-and-drop editing page provided by the control panel, and users do not need to perform detailed coding operations for data flow control.
对于每一个新建的数据源,控制面板会创建对应的数据源控制器,对数据源的创建、查询、更新和删除进行统一管理。数据源控制器会定期向控制面板发送心跳信号,汇报自身状态,让数据源管理器能够对控制器失联情况采取及时的反应措施。For each newly created data source, the control panel will create a corresponding data source controller to centrally manage the creation, query, update, and deletion of the data source. The data source controller will periodically send heartbeat signals to the control panel to report its own status, so that the data source manager can take timely response measures if the controller loses connection.
数据源控制器在创建之后,会通知资源协调器给具体的流程模块分配资源。以预处理管理器为例,资源协调器依据预处理管理器中设定的默认函数数量,在集群内寻找适宜的空闲的物理节点,并在之上创建预处理函数。After the data source controller is created, it will notify the resource coordinator to allocate resources to specific process modules. Taking the preprocessing manager as an example, the resource coordinator searches for suitable idle physical nodes in the cluster based on the default number of functions set in the preprocessing manager and creates preprocessing functions on them.
如下图2所见,资源数据源A控制器在通知资源协调器创建函数后,分别在物理节点1和物理节点n上创建了预处理函数1和与处理函数m。同时,一个物理节点上能够存在多个处理函数。As shown in Figure 2 below, after notifying the resource coordinator to create a function, the resource data source A controller creates a pre-processing function 1 and a processing function m on physical node 1 and physical node n, respectively. At the same time, there can be multiple processing functions on a physical node.
扩容函数的设计是为了根据负载情况对资源进行动态调整,使得系统能够适应不同的工作负载和提高资源利用效率。The expansion function is designed to dynamically adjust resources according to load conditions, so that the system can adapt to different workloads and improve resource utilization efficiency.
当资源使用率超过阈值时,resource_usage>threshold,可以根据实际情况设定阈值或通过调优确定;如果低于阈值,则无需扩容。When resource usage exceeds the threshold, resource_usage>threshold, you can set the threshold based on actual conditions or determine it through tuning; if it is lower than the threshold, there is no need to expand capacity.
计算需要扩容的实例数量:num_instances=int(resource_usage/thre shold)+1。根据资源使用率和阈值的比例计算需要扩容的实例数量。Calculate the number of instances that need to be expanded: num_instances = int (resource_usage/threshold) + 1. Calculate the number of instances that need to be expanded based on the ratio of resource usage to the threshold.
对计算出的实例数量进行限制,不超过最大实例数:num_instances=min(num_instances,max_instances)。The number of calculated instances is limited to not exceed the maximum number of instances: num_instances = min(num_instances, max_instances).
返回计算出的实例数量,表示需要扩容的数量。其代码示例如下:Returns the calculated number of instances, indicating the number of instances that need to be expanded. The code example is as follows:
扩容函数可以根据实际情况进行调整和修改。其中的资源使用率阈值、最大实例数等参数可以根据系统需求和实际情况来设定,以满足需求和提高系统性能。The expansion function can be adjusted and modified according to the actual situation. The resource utilization threshold, maximum number of instances and other parameters can be set according to system requirements and actual conditions to meet the needs and improve system performance.
创建的函数会定时向从属的管理器发送心跳信号,告知函数以及函数运行容器的运行状态,并针对负载情况进行扩缩容处理。The created function will periodically send heartbeat signals to the subordinate manager to inform the function and the running status of the function's container, and perform capacity expansion and contraction based on the load conditions.
特别的,对于数据分析中台,可以理解为一组以无服务形式组织好的规范化后的算法库,由于其运算能力可能依赖图形计算部件等专用的设备,其可单独部署于另一套硬件设备环境。数据处理流程与服务中台,通过通信协议进行协调,让部件能够充分解耦合,提高整体的稳健性和可用性;In particular, the data analysis platform can be understood as a set of standardized algorithm libraries organized in a serverless form. Since its computing power may rely on dedicated devices such as graphics computing components, it can be deployed separately in another set of hardware equipment environments. The data processing flow and the service platform are coordinated through communication protocols, so that the components can be fully decoupled and the overall robustness and availability can be improved;
图3介绍了定义数据源后,数据整体的分析计算流程。Figure 3 introduces the overall data analysis and calculation process after the data source is defined.
步骤一:数据源A以软件提供的API或者用通用的消息队列SDK,将获取的消息推送至消息队列。消息可以以单条数据文本的格式推送,也能以批量的形式进行推送。Step 1: Data source A pushes the acquired message to the message queue using the API provided by the software or the general message queue SDK. Messages can be pushed in the form of single data text or in batches.
步骤二:数据源A管理器下属的函数实例,会定期从消息队列中拉取新的数据,在修改了消息的topic后,留待后继的函数进行处理。最终以顺序的形式,执行完全部的预处理流程。Step 2: The function instance under the data source A manager will regularly pull new data from the message queue, and after modifying the topic of the message, leave it for subsequent functions to process. Finally, the entire preprocessing process is executed in sequence.
步骤三:处理完后的数据,会在消息队列中等待社交关联图构建器的函数实例捕获。属性向量嵌入函数,会将数据中的属性字段,依据预先定义的类型进行归一化,并转换为向量形式。录入数据库后,留待后续分析模块使用。数据接着会以图的形式组织,并存入图数据库。两者共同为后续的分析业务提供数据底层支持。Step 3: The processed data will wait in the message queue for the function instance of the social association graph builder to capture. The attribute vector embedding function will normalize the attribute fields in the data according to the pre-defined types and convert them into vector form. After entering the database, it will be used by the subsequent analysis module. The data will then be organized in the form of a graph and stored in the graph database. Together, they provide underlying data support for subsequent analysis services.
步骤四:数据分析管理器,会定期依据配置的指标,调用相应的数据分析中台提供的分析算法。分析任务总体分为两种类型:Step 4: The data analysis manager will periodically call the analysis algorithm provided by the corresponding data analysis platform based on the configured indicators. Analysis tasks are generally divided into two types:
如果分析指标为独立类型,即不需要依赖其他数据,而只是依赖诸如max之类的统计值,其在计算后直接更新图数据库相应节点的指标。If the analysis indicator is of an independent type, that is, it does not need to rely on other data, but only relies on statistical values such as max, it directly updates the indicator of the corresponding node in the graph database after calculation.
如果分析指标依赖其他数据而存在,如需要在整个图上进行游走,并且新节点的加入会带来其他节点指标的更新。这类计算任务通常需要较长时间的计算,计算任务会在集群总体负载较低,定时或者累计相同任务到计数器阈值的时候,触发计算任务。在更新数据库时,会锁定数据库,等待更新完成后再重新提供服务。If the analysis indicator depends on other data, such as the need to roam the entire graph, and the addition of new nodes will bring updates to other node indicators. Such computing tasks usually take a long time to calculate, and the computing tasks will be triggered when the overall cluster load is low, the timing or accumulation of the same tasks reaches the counter threshold. When updating the database, the database will be locked and the service will be provided again after the update is completed.
步骤五:输出管理器会定时以配置的要求的形式,对分析指标数据进行重新整理,并最终提供给下游应用,总体分为两种方式:Step 5: The output manager will periodically reorganize the analysis indicator data in the form of configuration requirements and finally provide it to downstream applications. There are two general methods:
如果用户配置的是被动接收方式,输出管理器会将结果定期整理,并发送至消息队列。或者调用用户配置的hook函数,将结果发送至相应的地址。If the user configures the passive receiving mode, the output manager will regularly organize the results and send them to the message queue, or call the hook function configured by the user to send the results to the corresponding address.
如果用户配置的是主动接收方式,用户只需设定系统对外提供的调用api,即可获取最新状态下的分析结果。If the user configures the active receiving mode, the user only needs to set the call API provided by the system to obtain the latest analysis results.
重复执行步骤一至步骤五。Repeat steps 1 to 5.
Claims (3)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311296667.6A CN117406960A (en) | 2023-10-09 | 2023-10-09 | A low-code social data computing platform and device for agile analysis scenarios |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311296667.6A CN117406960A (en) | 2023-10-09 | 2023-10-09 | A low-code social data computing platform and device for agile analysis scenarios |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117406960A true CN117406960A (en) | 2024-01-16 |
Family
ID=89497151
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311296667.6A Pending CN117406960A (en) | 2023-10-09 | 2023-10-09 | A low-code social data computing platform and device for agile analysis scenarios |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117406960A (en) |
-
2023
- 2023-10-09 CN CN202311296667.6A patent/CN117406960A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11727039B2 (en) | Low-latency streaming analytics | |
US11836148B1 (en) | Data source correlation user interface | |
US20200326870A1 (en) | Data pipeline architecture for analytics processing stack | |
US11954127B1 (en) | Determining affinities for data set summarizations | |
CN111339071B (en) | Method and device for processing multi-source heterogeneous data | |
US20190171950A1 (en) | Method and system for auto learning, artificial intelligence (ai) applications development, operationalization and execution | |
US12190083B2 (en) | Key-based logging for processing of structured data items with executable logic | |
US20180314393A1 (en) | Linking data set summarizations using affinities | |
CN112149838A (en) | Method, device, electronic equipment and storage medium for realizing automatic model building | |
CN113168362A (en) | Dedicated audit port for implementing recoverability of output audit data | |
CN112862013A (en) | Problem diagnosis method and device for quantitative transaction strategy | |
CN110781180A (en) | Data screening method and data screening device | |
Fu et al. | Configuring competing classifier chains in distributed stream mining systems | |
CN117406960A (en) | A low-code social data computing platform and device for agile analysis scenarios | |
CN112365239A (en) | Event-based cloud service management handling method and system | |
CN118233483A (en) | Linkage method based on rail transit, electronic equipment and storage medium | |
CN117371773A (en) | Business process arranging method, device, electronic equipment and medium | |
CN112235367B (en) | Method, system, terminal and storage medium for subscribing entity behavior relation message | |
CN116243999A (en) | Event response processing system, event response processing method, computer device, and storage medium | |
WO2023113847A1 (en) | Application programming interface (api) server for correlation engine and policy manager (cpe), method and computer program product | |
Sharon et al. | Event processing network-A conceptual model | |
CN111756836A (en) | Information sending method and device based on event management model | |
US12217064B2 (en) | Correlation engine and policy manager (CPE), method and computer program product | |
CN118229305A (en) | Anti-fraud early warning system, method, medium and terminal | |
US20240346423A1 (en) | Continuous data sensing of functional states of networked computing devices to determine efficiency metrics for servicing electronic messages asynchronously |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |