CN116860720A

CN116860720A - Multi-source heterogeneous data model modeling system oriented to big data analysis

Info

Publication number: CN116860720A
Application number: CN202310859780.4A
Authority: CN
Inventors: 徐俊山; 孔小强; 马廷; 吕太轩; 宋磊; 姬廷; 董临治; 徐生明; 常河; 周超; 王璐
Original assignee: Yulin High Tech Zone Xinhui New Energy Co ltd
Current assignee: Yulin High Tech Zone Xinhui New Energy Co ltd
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-10-10

Abstract

The invention discloses a multi-source heterogeneous data model modeling system oriented to big data analysis, which belongs to the technical field of multi-source data processing and comprises an information module, an analysis module and a modeling module; the information module is used for users to arrange and upload enterprise demand information and determine corresponding target classes based on the enterprise demand information; the analysis module is used for analyzing each target class and determining an initial processing model of the target multi-source heterogeneous data; the modeling module is used for establishing a multi-source heterogeneous data processing model required by a user, acquiring a target multi-source heterogeneous data initial processing model, and adjusting the target multi-source heterogeneous data initial processing model to acquire a corresponding multi-source heterogeneous data processing model; through the mutual coordination among the information module, the analysis module and the modeling module, personalized establishment of a multi-source heterogeneous data processing model meeting the user requirements is realized; and through the arrangement of the information module, the real modeling requirement of the enterprise user is truly analyzed.

Description

Multi-source heterogeneous data model modeling system oriented to big data analysis

Technical Field

The invention belongs to the technical field of multi-source data processing, and particularly relates to a multi-source heterogeneous data model modeling system for big data analysis.

Background

With the advent of the big data age, hundreds of millions of data are produced at each moment. Based on the massive data, people need to extract useful information from the data to know and even guide people's daily life and work. Thus, big data analysis has grown and is becoming an increasingly popular area.

However, for a big data analysis task, how to acquire the data set needed for the task is a very critical issue. In many data analysis algorithms, especially most machine learning algorithms, data plays a critical role and data plays a decisive role in the quality of the analysis results. However, one often assumes that the data set is already presented. However, the data sets of most data analysis tasks are still often acquired by experts or institutions in this field by means of manual acquisition. The manual data collection method can ensure the data quality and is feasible under the condition of less data quantity, but once the data quantity is increased, the manual data collection method by relying on domain experts or institutions is not practical, and huge manpower, material resources and financial resources are consumed, so that the method is expensive.

Especially in the field of new energy power generation industry, various auxiliary software assisting the work of the field is often applied to each department in an enterprise, so that a large amount of multi-source heterogeneous data can be generated in the enterprise, the coding system of each power station is possibly inconsistent, the complete information of the related data of the equipment cannot be checked by effectively combining the data, and because the requirements of business, operation and the like are met, analysis is possibly needed based on the large data, and therefore, the external related data is needed, and a large amount of multi-source heterogeneous data processing requirements are met; however, for the related middle and small micro enterprises at present, the effect of corresponding data cannot be fully exerted due to the influence of multiple factors such as technology, cost and the like; therefore, in order to solve the modeling requirement of each corresponding enterprise on the multi-source heterogeneous data model, the invention provides a multi-source heterogeneous data model modeling system oriented to big data analysis.

Disclosure of Invention

In order to solve the problems of the scheme, the invention provides a multi-source heterogeneous data model modeling system oriented to big data analysis.

The aim of the invention can be achieved by the following technical scheme:

a multi-source heterogeneous data model modeling system oriented to big data analysis comprises an information module, an analysis module and a modeling module;

the information module is used for users to arrange and upload enterprise demand information and determine corresponding target classes based on the enterprise demand information.

Further, the working method of the information module comprises the following steps:

identifying enterprise demand information uploaded by a user, and acquiring a corresponding target end and modeling demand; and determining corresponding data classes according to the target end, and screening each data class to obtain the corresponding target class.

Further, when the user fills in the enterprise demand information, a corresponding demand information template is preset, and the user fills in corresponding data according to the demand information template.

Further, the method for determining the data class according to the target end comprises the following steps:

gradually establishing and perfecting a target end information base, wherein the target end information base is used for storing various data types corresponding to various target ends;

matching corresponding data classes from a target end information base according to the identified target end;

identifying a target end which is not matched with the data class from the target end information base, and marking the target end as an end to be supplemented; searching corresponding various data types according to the to-be-supplemented terminal, and sorting the data types into corresponding data types;

and supplementing the to-be-supplemented terminal and the corresponding data class into a target terminal information base for storage.

Further, the method for screening each data class comprises the following steps:

a demand analysis model is established, the data class and the enterprise demand information are analyzed through the demand analysis model, basic scores and correction scores corresponding to the data classes are obtained, corresponding evaluation scores are calculated according to the obtained basic scores and correction scores, and the data class with the evaluation score larger than a threshold value X1 is marked as a target class.

Further, the evaluation score calculating method includes:

and respectively marking the obtained basic score and the correction score as JF and XF, and calculating the corresponding evaluation score PGL according to an evaluation formula PGL=b1×JF+b2×XF, wherein b1 and b2 are proportionality coefficients, and the value range is 0< b1 less than or equal to 1, and 0< b2 less than or equal to 1.

And the analysis module is used for analyzing each target class and determining an initial processing model of the target multi-source heterogeneous data.

Further, the working method of the analysis module comprises the following steps:

establishing a model library, wherein each multi-source heterogeneous data initial processing model and a corresponding data processing range are stored in the model library;

and identifying each target class, forming a corresponding target class set, and screening each to-be-selected multi-source heterogeneous data initial processing model based on the target class set to match the corresponding to-be-selected multi-source heterogeneous data initial processing model and the corresponding similarity in the model library to obtain a corresponding target multi-source heterogeneous data initial processing model.

Further, the method for screening the initial processing model of each multi-source heterogeneous data to be selected comprises the following steps:

identifying redundant data classes corresponding to the initial processing model of the multi-source heterogeneous data to be selected, and carrying out similarity correction according to the identified redundant data classes and enterprise demand data to obtain corresponding similarity values and foreground values; removing the to-be-selected multi-source heterogeneous data initial processing model with the similarity value lower than the threshold value X2; identifying cost values corresponding to the initial processing models of the multi-source heterogeneous data to be selected, respectively marking the obtained cost values, foreground values and similar values as CBZ, QJZ and XSZ, and calculating corresponding priority values according to a priority formula KPL=QJZ+XSZ-c×CBZ, wherein c is a cost value adjustment coefficient; and selecting the to-be-selected multi-source heterogeneous data initial processing model with the highest priority value as a target multi-source heterogeneous data initial processing model.

The modeling module is used for establishing a multi-source heterogeneous data processing model required by a user, acquiring a target multi-source heterogeneous data initial processing model, and adjusting the target multi-source heterogeneous data initial processing model to acquire a corresponding multi-source heterogeneous data processing model.

Compared with the prior art, the invention has the beneficial effects that:

through the mutual coordination among the information module, the analysis module and the modeling module, personalized establishment of a multi-source heterogeneous data processing model meeting the user requirements is realized; the real modeling requirements of enterprise users are truly analyzed through the arrangement of the information module, the types of multi-source heterogeneous data processing required by the enterprise users are accurately determined, and accurate processing is facilitated; meanwhile, through personalized service, the enterprise user is helped to reduce the cost of establishing the multi-source heterogeneous data processing model to the greatest extent, the popularization of the system is facilitated, the competitiveness in small and medium-sized enterprises is improved, the problem that the corresponding enterprises use the previous processing mode still because of the cost and the like is avoided, and a large amount of enterprise data cannot be fully applied; by correcting the similarity, a part of the to-be-selected multi-source heterogeneous data initial processing model is removed in advance, so that the amount of the subsequent analysis data is reduced; and the possible subsequent development of the enterprise is combined, and the attention degree of the enterprise to the cost is screened.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 is a functional block diagram of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a multi-source heterogeneous data model modeling system for big data analysis comprises an information module, an analysis module and a modeling module;

the information module is used for users to arrange and upload enterprise demand information, including software, systems, modeling demands, enterprise information and other data used by departments in the enterprise, the modeling demands are processing model demands for processing data in any direction, such as cost analysis demands, business progress demands and the like, a plurality of demands can be set according to actual conditions of the enterprise, and specific demands are set according to own demands of the enterprise; because for enterprises, the cost, the actual processing requirement and other factors need to be considered, some enterprises may have only one requirement, and some enterprises may have a plurality of requirements, and the follow-up model establishment cannot be carried out according to the same requirement, so that resource waste is avoided, higher cost is brought to users, and personalized setting of enterprise users is realized; the method is convenient for reducing the use cost of enterprises, and is an important selection factor especially for small and medium-sized micro enterprises; and processing the enterprise demand information uploaded by the user to obtain target demand data. The specific process is as follows:

setting a requirement information template, filling relevant data of an enterprise by a user according to the requirement information template, acquiring enterprise requirement information after filling, and uploading the enterprise requirement information;

identifying the uploaded enterprise demand information, and identifying a corresponding target end and modeling demands, wherein the target end is software, a system and the like used by each department in the enterprise; determining a corresponding data class according to the target end, wherein the data class is the data type corresponding to each target end, and is marked with the label of each target end and used for indicating which target end belongs to, and one data class possibly has a plurality of target end labels, because each target end possibly has the same type of data; the specific data class determination method is as follows; according to each data class and the corresponding modeling requirement, analyzing, determining each data class which has relevance with the modeling requirement, marking the data class as a target class, namely analyzing which data class data needs to be used in the process of realizing the modeling requirement, regarding the corresponding data class as the target class, mainly comprehensively analyzing by referring to the modeling requirement, enterprise information and enterprise history related item data, and further limiting and confirming the modeling requirement based on the enterprise information and the enterprise history related item data, wherein even though the modeling requirement is the same, other data classes are the same, but the target classes which need to be applied by the enterprise have differences due to different management modes, scales and the like of the enterprise; accordingly, there is a need to analyze characteristics of an enterprise in combination with enterprise information and enterprise history related project data.

Specifically, a corresponding demand analysis model can be established based on a neural network, wherein the neural network comprises a CNN network, a DNN network and the like; training is carried out through the established training set, the training set comprises enterprise demand information, data classes and basic scores and correction scores corresponding to the data classes which are correspondingly arranged, the basic scores are arranged without reference to enterprise information, enterprise history related project data and the like, namely, the basic scores are only arranged according to modeling demands and the data classes, and the correction scores are scores for correction which are analyzed according to the actual conditions of enterprises such as the enterprise information, the enterprise history related project data and the like; analyzing through a demand analysis model after successful training to obtain basic components and correction components corresponding to each data class, respectively marking the obtained basic components and correction components as JF and XF, and calculating corresponding evaluation components PGL according to an evaluation formula PGL=b1×JF+b2×XF, wherein b1 and b2 are proportionality coefficients, and carrying out self-adjustment by enterprise users according to own demands, wherein the value range is 0< b1 less than or equal to 1, and 0< b2 less than or equal to 1; the data class whose evaluation score is greater than the threshold value X1 is marked as the target class.

The method for determining the data class according to the target terminal comprises the following steps:

identifying a target end which is not matched with the data class from the target end information base, and marking the target end as an end to be supplemented; because various related software, systems and the like are available in the current market, the target information base can only basically cover software with higher popularity and the like with higher popularity when being established; according to the to-be-supplemented terminal, acquiring various data types possibly generated by the terminal from the Internet or other existing channels, and sorting the data types into corresponding data types;

The real modeling requirements of enterprise users are truly analyzed through the arrangement of the information module, the types of multi-source heterogeneous data processing required by the enterprise users are accurately determined, and accurate processing is facilitated; meanwhile, through personalized services, enterprise users are helped to reduce the cost of establishing the multi-source heterogeneous data processing model to the greatest extent, the popularization of the system is facilitated, the competitiveness in small and medium-sized micro enterprises is improved, the problem that the corresponding enterprises use the previous processing mode because of the cost and the like is avoided, and a large amount of enterprise data cannot be fully applied.

The analysis module is used for analyzing each target class and determining a multi-source heterogeneous data initial processing model closest to the requirements, namely determining the multi-source heterogeneous data initial processing model which is most in line with the target class set in the existing model library according to the target class set of each enterprise, and the specific process comprises the following steps:

establishing corresponding multiple multi-source heterogeneous data initial processing models according to service ranges and market demands in a manual mode, setting corresponding multi-source heterogeneous data type processing ranges for each multi-source heterogeneous data initial processing model, and establishing a corresponding model library after finishing;

identifying each target class to form a corresponding target class set, wherein the target class set is a set formed by a plurality of target classes; according to the obtained target class set and the corresponding processing multi-source heterogeneous data type range of each multi-source heterogeneous data initial processing model in the model library, matching is carried out, so that corresponding to-be-selected multi-source heterogeneous data initial processing models and corresponding similarity are obtained, wherein the to-be-selected multi-source heterogeneous data initial processing models refer to the corresponding multi-source heterogeneous data type range covering the corresponding target class set, namely, the types which are equal to or larger than the target class set, if the target class set cannot be fully included, the matching cannot be successfully carried out, and the corresponding multi-source heterogeneous data initial processing models are not used as the to-be-selected multi-source heterogeneous data initial processing models; the similarity is calculated according to the ratio of the number of the target classes to the number of the data classes in the corresponding range; and screening each to-be-selected multi-source heterogeneous data initial processing model to obtain a target multi-source heterogeneous data initial processing model, namely the multi-source heterogeneous data initial processing model closest to the user demand.

The method for screening the initial processing model of each multi-source heterogeneous data to be selected comprises the following steps:

identifying redundant data classes corresponding to each to-be-selected multi-source heterogeneous data initial processing model, and carrying out similarity correction according to each identified redundant data class and enterprise demand data to obtain corresponding similarity values and foreground values; removing the to-be-selected multi-source heterogeneous data initial processing model with the similarity value lower than the threshold value X2; identifying a cost value corresponding to each to-be-selected multi-source heterogeneous data initial processing model, wherein the cost value is converted according to the estimated model establishment cost relative to a user enterprise and is used for carrying out unit conversion post-calculation, setting is carried out based on the preset cost corresponding to each multi-source heterogeneous data initial processing model, matching is carried out subsequently, when the manual price and the like are changed, corresponding adjustment can be carried out, the cost refers to all costs, including subsequent manual adjustment cost and the like, namely all paying costs of estimated enterprise users; the obtained cost value, foreground value and similar value are marked as CBZ, QJZ and XSZ respectively, corresponding priority values are calculated according to a priority formula KPL=QJZ+XSZ-c×CBZ, wherein c is a cost value adjustment coefficient which is set by an enterprise user according to needs, the cost is emphasized, c is set to be more than 1, otherwise, is set to be less than 1, and if not, is defaulted to be 1; and selecting the to-be-selected multi-source heterogeneous data initial processing model with the highest priority value as a target multi-source heterogeneous data initial processing model.

By correcting the similarity, a part of the to-be-selected multi-source heterogeneous data initial processing model is removed in advance, so that the amount of the subsequent analysis data is reduced; and the possible subsequent development of the enterprise is combined, and the attention degree of the enterprise to the cost is screened.

And carrying out similarity correction according to the identified redundant data classes and enterprise demand data, namely correcting according to the relevance of the redundant data classes to the development of the enterprise and the future demand, taking the next development of the enterprise and the model demand change into consideration, and correcting the target classes possibly added to obtain corresponding similar values and foreground values, specifically, establishing a corresponding correction model based on a CNN (computer numerical network) or a DNN (computer numerical network), and establishing a corresponding training set to train in a manual mode, wherein the training set comprises the redundant data classes, the enterprise demand data, the similarity and the correspondingly set corrected similar values and foreground values, and analyzing through the corrected model after successful training to obtain the corresponding similar values and foreground values.

The modeling module is used for establishing a multi-source heterogeneous data processing model of user demands, obtaining a target multi-source heterogeneous data initial processing model, and adjusting the target multi-source heterogeneous data initial processing model according to the user enterprise demands in a manual mode to obtain a corresponding multi-source heterogeneous data processing model.

The above formulas are all formulas with dimensions removed and numerical values calculated, the formulas are formulas which are obtained by acquiring a large amount of data and performing software simulation to obtain the closest actual situation, and preset parameters and preset thresholds in the formulas are set by a person skilled in the art according to the actual situation or are obtained by simulating a large amount of data.

The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.

Claims

1. The multi-source heterogeneous data model modeling system for big data analysis is characterized by comprising an information module, an analysis module and a modeling module;

the information module is used for users to arrange and upload enterprise demand information and determine corresponding target classes based on the enterprise demand information;

the analysis module is used for analyzing each target class and determining an initial processing model of the target multi-source heterogeneous data;

2. The multi-source heterogeneous data model modeling system for big data analysis according to claim 1, wherein the working method of the information module comprises:

3. The modeling system of a multi-source heterogeneous data model for big data analysis according to claim 2, wherein when the user fills in the enterprise demand information, a corresponding demand information template is preset, and the user fills in the corresponding data according to the demand information template.

4. A multi-source heterogeneous data model modeling system for big data analysis according to claim 3, wherein the method for determining the data class according to the target end comprises:

5. The multi-source heterogeneous data model modeling system for big data analysis of claim 4, wherein the method for screening each data class comprises:

6. The multi-source heterogeneous data model modeling system for big data analysis according to claim 5, wherein the evaluation score calculating method comprises:

7. The multi-source heterogeneous data model modeling system for big data analysis according to claim 1, wherein the working method of the analysis module comprises:

8. The modeling system of multi-source heterogeneous data model for big data analysis according to claim 7, wherein the method for screening each of the candidate multi-source heterogeneous data initial processing models comprises: