CN113779215A

CN113779215A - Data processing platform

Info

Publication number: CN113779215A
Application number: CN202110983226.8A
Authority: CN
Inventors: 王培凯; 周召安
Original assignee: Hainan Hard Shell Technology Co ltd
Current assignee: Hainan Hard Shell Technology Co ltd
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-12-10

Abstract

The invention discloses a data processing platform, which comprises a data center platform and a uniform resource management platform based on a super-fusion architecture, wherein the data center platform comprises a data management unit, a first data processing unit, a second data processing unit, a database unit and a data calling unit, the data management unit is used for acquiring data and managing the data, the first data processing unit is used for processing unstructured data to obtain a corresponding first data tag, the second data processing unit is used for processing structured data to obtain a corresponding second data tag, the database unit is used for storing and managing the data, the first data tag and the second data tag, and the data calling unit is used for calling the data in the database unit. The data processing platform provided by the invention is simple to operate and high in efficiency and matching degree when used for searching data.

Description

Data processing platform

Technical Field

The invention relates to the technical field of data retrieval, in particular to a data processing platform.

Background

In the prior art, after a user inputs a query sentence, the data in the database is searched by the system according to keywords in the query sentence, and then the data is screened, so that the matching degree is low, the further screening is troublesome, and the accuracy is not high.

Disclosure of Invention

The invention aims to provide a data processing platform which is simple to operate and high in efficiency and matching degree when used for searching data.

The technical scheme adopted by the data processing platform disclosed by the invention is as follows:

a data processing platform comprises a data center platform based on a super-fusion architecture and a uniform resource management platform, wherein the data center platform comprises a data management unit, a first data processing unit, a second data processing unit, a database unit and a data calling unit, the data management unit is used for acquiring data and managing the data, the first data processing unit is used for processing unstructured data to obtain a corresponding first data tag, the second data processing unit is used for processing structured data to obtain a corresponding second data tag, the database unit is used for storing and managing the data, the first data tag and the second data tag, and the data calling unit is used for calling the data in the database unit.

As a preferred scheme, the first data processing unit includes a data obtaining module and a text processing module, the data obtaining module is configured to obtain unstructured data from the data management unit, and the text processing module is configured to perform word frequency calculation on the unstructured data and obtain a corresponding first data tag.

Preferably, the data retrieving unit includes a visualization module, and the visualization module is configured to retrieve data in the database unit according to the first data tag and the second data tag.

Preferably, the visualization module includes at least one of a dashboard, an instrument map and a cockpit.

Preferably, the data retrieving unit includes a query module, and the query module is configured to retrieve data in the database unit according to the keyword.

As a preferred scheme, the voice recognition system further comprises an NLP voice recognition unit and an AI learning unit, wherein the NLP voice recognition unit is used for recognizing voice and extracting keywords in the voice, and the AI learning unit is used for performing matching training on the keywords and data.

Preferably, the NLP speech recognition unit includes a basic speech recognition module for recognizing speech and extracting keywords, and a continuous speech recognition module for recognizing continuous speech and extracting continuous keywords.

As a preferred scheme, the database unit comprises an ETL module, a data fusion unit and an entity library which are sequentially connected, the ETL module is respectively connected with the first data processing unit and the second data processing unit, and the entity library is connected with the data retrieval unit.

The data processing platform disclosed by the invention has the beneficial effects that: the data management unit manages all the acquired data in a unified mode, the first data processing unit processes the unstructured data to obtain corresponding first data tags, and the second data processing unit processes the structured data to obtain corresponding second data tags, so that all the data are tagged. And then the data, the first data tag and the second data tag are stored in the database unit, and a user can call the data in the database unit through the data calling unit. According to the scheme, a series of problems brought by a traditional virtualization framework can be solved through the super-fusion framework, and the super-fusion framework has the characteristics of high service availability, data security and integration of automatic operation and maintenance management. The super-integration architecture simplifies the construction of the basic architecture, reduces the operation and maintenance management cost, and enables users to put more energy into business innovation. Then, the data are labeled to form a first data label and a second data label, so that the data are quickly called, the operation is simple, the efficiency is high, and the precision of calling the data is improved.

Drawings

FIG. 1 is a block diagram of a data processing platform according to the present invention.

FIG. 2 is a schematic diagram of a data center structure of the data processing platform of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the embodiments and drawings of the specification:

referring to fig. 1 and fig. 2, the data processing platform includes a data center platform and a uniform resource management platform based on a hyper-converged framework. The data center station comprises a data management unit, a first data processing unit, a second data processing unit, a database unit and a data calling unit. The data management unit is used for acquiring data and managing the data, the first data processing unit is used for processing unstructured data to obtain a corresponding first data tag, and the second data processing unit is used for processing structured data to obtain a corresponding second data tag. The database unit is used for storing and managing data, the first data tags and the second data tags, and the data calling unit is used for calling the data in the database unit.

The data management unit manages all the acquired data in a unified mode, the first data processing unit processes the unstructured data to obtain corresponding first data tags, and the second data processing unit processes the structured data to obtain corresponding second data tags, so that all the data are tagged. And then the data, the first data tag and the second data tag are stored in the database unit, and a user can call the data in the database through the data calling unit. According to the scheme, a series of problems brought by a traditional virtualization framework can be solved through the super-fusion framework, and the super-fusion framework has the characteristics of high service availability, data security and integration of automatic operation and maintenance management. The super-integration architecture simplifies the construction of the basic architecture, reduces the operation and maintenance management cost, and enables users to put more energy into business innovation. Then, the data are labeled to form a first data label and a second data label, so that the data are quickly called, the operation is simple, the efficiency is high, and the precision of calling the data is improved.

In this embodiment, all data is divided into structured data and unstructured data, and for the structured data, the data tag thereof is obtained through a mapping relationship, and for the unstructured data, the data tag thereof may be obtained through statistics after text extraction, or may be directly added manually.

In this embodiment, the data source of the data management unit includes various table documents, such as Excel, PDF, and other documents, data in various databases, such as MySQL, Oracel, and other databases, and various data from the terminal, such as PC or APP, and other databases. In combination with project scenarios, data sources can be divided into internal data acquisition and external data acquisition. Internal data such as data generated by the service object itself (an enterprise ERP system, a CRM system, etc.), an RDBMS, NoSQL, a data warehouse, a system log collection system, etc.; external data are, for example, internet collected data (network collection program), internet of things device collected data, purchase of third party data (data interface), public data (government public data, etc.).

In this embodiment, the first data processing unit includes a data obtaining module and a text processing module, where the data obtaining module is configured to obtain the unstructured data from the data management unit, and the text processing module is configured to perform word frequency calculation on the unstructured data and obtain the corresponding first data tag. Specifically, the text processing module extracts a plurality of phrases from the data through texts, then performs word frequency statistics, and uses the phrases therein as data labels.

In this embodiment, the data retrieving unit includes a visualization module, and the visualization module is configured to retrieve data in the database unit according to the first data tag and the second data tag. Specifically, when the user searches the first data tag and the second data tag, the visualization module counts the related data, and may be represented in the form of a report chart, an instrument chart, or a cockpit.

In this embodiment, the data retrieving unit includes a query module, and the query module is configured to retrieve data in the database unit according to the keyword. The voice recognition system further comprises an NLP voice recognition unit and an AI learning unit, wherein the NLP voice recognition unit is used for recognizing voice and extracting keywords in the voice, and the AI learning unit is used for performing matching training on the keywords and data. The NLP voice recognition unit comprises a basic voice recognition module used for recognizing voice and extracting keywords and a continuous voice recognition module used for recognizing continuous voice and extracting continuous keywords. Specifically, when the continuous speech recognition module performs speech recognition, it is determined whether speech input is completed after each section of speech is acquired, and if not, the next section of speech is continuously acquired until the speech input is completed. And the continuous voice recognition module can extract keywords in each section of voice, and then the AI learning unit carries out the AND relation or the OR relation among a plurality of keywords and then carries out matching with the data label. In this embodiment, the AI learning unit includes a matching model for matching the keyword with the data tag, and the AI learning unit further corrects the matching model through machine learning, so that the accuracy of matching the keyword with the data tag is improved.

In this embodiment, a hyper-converged infrastructure (HCI) may solve a series of problems caused by a conventional virtualization architecture, and is an integrated solution that defines storage and computing resources virtualization by a high-performance standard X86 server with a large number of hard disks and high-speed network interfaces and has characteristics of high availability of services, data security, and automated operation and maintenance management. The super-convergence framework transfers an integrated framework (Build) which needs to be built originally into a mode of direct purchase (Buy) through a software definition technology, and simultaneously modularizes various resources which are horizontally converged into a standard framework which is vertically converged and transversely expanded, thereby simplifying the construction of a basic framework, reducing the operation and maintenance management cost and putting more energy of users into business innovation.

First, we need to solve the current resource utilization problem well through computational virtualization. Virtualization is simply the encapsulation of hardware, operating system, and applications together into a migratable virtual machine file. Software in a conventional server must be combined with hardware, and each machine can only run a single operating system, each operating system having one or more application loads.

A virtualization layer is added on a physical server through a software defined computing technology to form a bare metal framework, computing resources such as CPUs (central processing units), internal memories and the like of a plurality of physical servers are integrated into a unified computing resource pool, a plurality of operating systems and application loads are operated on each physical server, and the utilization rate of hardware resources of the servers is fully utilized. After the resource virtualization is realized, the unified management can be carried out on the data by the super-fusion management platform, the data center platform and the unified resource management platform, and the unified management of computing resources and storage resources is realized. The brief summary is as follows:

the unplanned single point of service failure of the service can be solved through the HA high availability function of the virtualization platform.

Host switch migration and Storage data migration of application online can be realized through virtualized online migration (vMotion/Storage vMotion).

The virtual distributed resource scheduling DRS can realize that different application loads dynamically and automatically migrate among different hosts, realize automatic resource allocation according to requirements and solve the problem of unbalanced resource load of a single host.

The unified resource centralized management of all the computing and storing resources can be realized through the virtualization management center, and the management work is simplified.

The old application can be migrated to the virtualization platform through the P2V and V2V technologies without redeploying the application.

The safe application delivery and load balancing system can simplify the management of the service system, flexibly control the data flow of the service, eliminate service interruption, improve the application performance and improve the safety of the service system. Meanwhile, the service and data security of the client running on the super-fusion platform is ensured by means of various security technologies such as a trusted hardware platform, virtualization security reinforcement and virtualization backup.

In the embodiment, the fusion and sharing of data of each service system (platform) are realized through the construction of the data middle station and a powerful middle station management system, various data can be randomly called, the requirement of platform modeling analysis data is met, one-key switching and real-time calling of each service system of application data are realized, and big data called by voice are better displayed.

The data center is a bottom layer big data processing cluster of the shared exchange platform. The data center is a place for collecting and sorting mass data, and comprises a data resource directory, various service data of each service system, and process data for maintaining various directories and applying interfaces. By using the Hadoop ecological ring assembly, the storage of massive structured data can be met, unstructured data can also be stored, and the specific stack structure is as follows.

HDFS (Hadoop distributed File System): the distributed file system is a bottom file system of a large data cluster environment. Data and unstructured data of the data warehouse can be stored in the HDFS file system, the data warehouse is a cluster architecture of distributed storage and calculation, and storage and calculation resources are conveniently expanded horizontally.

HIVE: the data warehouse management tool of the big data cluster executes an HQL statement, and is a data warehouse operation statement executed in HIVE, similar to an SQL statement. Each HQL statement is converted into MapReduce for calculation by a YARN calculation engine of the Hadoop cluster, and high-throughput calculation of mass data can be completed very quickly.

HBASE: a NoSQL column database can store data with scales of hundreds of millions of rows and millions of columns in a cluster, is convenient for a client system to interactively call and query, and can summarize and produce data report data.

SPARK: the system is an engine based on memory computing and constructed on Hadoop, a data modeling algorithm in a built-in MLlib supports distributed computing, modeling, analyzing and mining are carried out based on mass data, and the efficiency of data processing is more than one hundred times that of reading and processing of a traditional single hard disk.

In this embodiment, a uniform resource management platform is established to manage all the service data and transparently manage all the data, especially the management of non-structural data such as video, pictures, audio, and the like, so that the data can be distributed to people and departments for continuous optimization. The distributed file module mainly solves the problems of complex operation and maintenance and incapability of meeting the service requirement of directory authority caused by parallel use of a plurality of sets of storage systems by constructing a distributed file storage access service. And then by establishing a virtual file system, various storage systems are mounted on the file system, a mapping relation is established, and functions of file calling according to voice, online preview, video and audio playing and the like are realized.

The unified resource management platform mainly does two things, as follows.

(1) Summarizing the data resource directories of all the service systems through a data resource directory module: the part reports the existing data faithfully according to the data resource condition of the service system as required to form data resource catalog metadata, including data indication, fields and field types, sharing modes, updating frequency and the like. And other later-stage systems can know data shared by other service system departments through the platform, and are convenient to apply and use and compare collisions.

And (3) uploading the catalog: and each business system uploads a data directory of itself.

And (4) directory auditing: and the shared exchange platform administrator is responsible for auditing the uploaded catalogues.

And (4) directory publishing: and issuing a catalog on the platform for other systems to check and apply.

(2) Storing mass files of each system entity through a distributed file module: files with the size of 4K-500M in the daily working process can be stored in the file server cluster to support horizontal expansion. And according to the authority division, the operation of uploading, inquiring, downloading and deleting is supported. The part does not support online preview and video playing of files, only supports downloading, and can be opened by windows self-contained media software after downloading.

In this embodiment, the database unit is used as a part of the data warehouse, the database unit includes an ETL module, a data fusion module, and an entity library, which are connected in sequence, the ETL module is connected to the first data processing unit and the second data processing unit, respectively, and the entity library is connected to the data retrieving unit. Specifically, the ETL module uses the Sqoop to complete the extraction, conversion and loading of the collected data, and then integrates various data through data fusion and stores the data in an entity library. The Sqoop component is suitable for data transmission between the relational database and the Hadoop system. Specifically, the Sqoop component supports all major relational database system (RDBMS) connections, such as Oracle, SQLServer, MySQL, and the like. Data may be transmitted from RDBMS to Hive or Hbase and vice versa. The data is cleaned (for example, null value processing is performed, null values are loaded or replaced by other meaning data, and shunt loading to different target libraries can be realized according to field null values), the data format is normalized, field format constraint definition is realized, and the data loading format of time, numerical values, characters and the like in a data source is customized. And splitting data, decomposing fields according to business requirements, and splitting various information contained in one field. And establishing main external key constraint to ensure the loading of the only record of the main key.

In this embodiment, the data warehouse allows organizations to integrate data from different systems into a common data model to support operational functionality, compliance requirements, and business intelligence. Data warehouses can improve data consistency by reducing data redundancy, enabling organizations to use data more efficiently. A data warehouse is made up of multiple parts in which data can be migrated. During the migration process, the structure and format of the data can be changed so as to be collected into the general table, and the data consumer can access the general table. The generic table can be used directly for reporting or as input by downstream applications.

Basic principles of data warehouse management: focusing on business objectives, designing idealization from the end, action creation realisation, summarization and optimization should be put at the final stage, not in a mode of doing at the beginning, promoting transparency and self-service, building metadata, not all applicable.

In this embodiment, the data mining is based on a general data analysis logic, and the data mining modeling process includes: firstly, the problem to be solved by the service is known, and after the service problem is fully understood, relevant data sources or data sets are collected and integrated to understand whether the data sets can support service implementation or not. The data structures with various characteristics such as structured data, semi-structured data and unstructured data exist in the massive data, and a missing value or an abnormal value can occur, the semi-structured data and the unstructured data need to be converted into the structured data through a data cleaning technology, if the data are complete and high in quality, data can be directly displayed in a pre-visualization mode and later-stage data exploration can be conducted, if the data have the missing value and the abnormal value, data preprocessing work needs to be conducted, and the data preparation stage is generally called as the data preparation stage. After the data preparation is finished, carrying out statistical exploration and visual display (correlation analysis and the like) on the data, then establishing a regression or classification prediction model according to actual business problems, evaluating a plurality of models by using model performance evaluation indexes, selecting an optimal model, and finally deploying and applying.

Further, data preprocessing acquires data required for analysis from a data warehouse, and if data needs to be supplemented, data acquired from multiple channels needs to be integrated, whether the data are incomplete in different degrees or not is searched, and if a missing value exists, how to fill up the situation becomes a problem to be solved urgently. In general, there are two methods for handling missing values, deletion and padding. When the data volume is huge, if single data is missing, simple deletion processing is carried out, and the influence on the data as a whole is not too large; if a plurality of data are missing under the multidata dimensionality, deletion processing is carried out, and potential influence is brought to later modeling analysis. The data filling methods include a mode searching and supplementing method (non-fixed distance data), a mean value interpolation method (fixed distance data), a similar mean value interpolation method (a method using a hierarchical clustering model), a regression interpolation method and the like, and some methods in machine learning (artificial neural network and the like) can be used for filling missing values. In any filling mode, the data modeling prediction accuracy or classification accuracy in the later period is influenced to a certain extent, so that the data cleaning part is very important in the whole model building process.

After the incomplete data set is completely filled by a more rigorous method, preliminary data exploration in the early stage is needed to be carried out on the data before data modeling, and correlation analysis is carried out on data dimensions needed by modeling, so that the strength of correlation between dependent variables and independent variables and the linear relation of the data are obtained.

In general, when a business problem is solved, the data features of the relevant dimensions are dozens of times less and hundreds of times more. If all the components are added to the model during modeling, not only the calculation amount is increased, but also the model becomes complicated, and the prediction accuracy is affected. In data mining, a dimension reduction method, namely principal component analysis, can extract representative arguments of the arguments, namely, a plurality of components obtained through the principal component analysis, and the components can completely describe data information in all dimensions. When project data are processed and data features are more, data are processed by considering a dimension reduction mode, and whether the dimension reduction mode can be used for reducing the data features or not is discussed, so that the model training speed is accelerated, and the calculation cost is reduced.

And performing data modeling and model optimization, wherein the premise of modeling is to define a data source and a modeling purpose, ensure the data quality and properly fill or delete incomplete data, so that the data per se is a guarantee for building a model. According to the results of the data preprocessing, and according to the business requirements, a prediction model (multiple regression, punishment regression, regression tree, artificial neural network (BP neural network, etc.) or a classification model (logistic regression, decision tree, neural network model, etc.) is constructed for the data, in the model construction, starting from the most basic model, a plurality of different models are constructed for repeated training and optimization adjustment, the classification accuracy is judged by checking and comparing the performance evaluation indexes of the models, the AUC value or the confusion matrix which is commonly used in the classification is close to 1, the classification model with the high AUC value and the high accuracy is selected for deployment through comparison, the fit data of the AIC (akachi information criterion), the root Mean Square Error (MSE), the mean square error (RMSE), the Mean Absolute Error (MAE) and the R ^2 evaluation model which are commonly used in the regression prediction is good, and the selection error is small, and carrying out deployment prediction on the model with smaller AIC and larger R2 value.

After the prediction type model or the classification model is built, the prediction type model or the classification model is deployed into a system or the result is written into a database, the software reads the result in the database, and visual display (chart display) is carried out at the front end. When new data is obtained, the model reads relevant features in the database and generates a prediction result, and the result is combined with business theory knowledge to be applied to early warning and reminding, for example, whether flood occurs or not and the occurrence probability are judged by utilizing a plurality of factor data influencing flood occurrence in a water conservancy project, and early warning and reminding are generated in advance.

Furthermore, the model script can be deployed in a server system, and the result is stored in a database after execution, or the model is developed into a model API to provide computing service for the outside.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the protection scope of the present invention, although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. The data processing platform is characterized by comprising a data center platform and a uniform resource management platform based on a super-fusion architecture, wherein the data center platform comprises a data management unit, a first data processing unit, a second data processing unit, a database unit and a data calling unit, the data management unit is used for acquiring data and managing the data, the first data processing unit is used for processing unstructured data to obtain a corresponding first data tag, the second data processing unit is used for processing structured data to obtain a corresponding second data tag, the database unit is used for storing and managing the data, the first data tag and the second data tag, and the data calling unit is used for calling the data in the database unit.

2. The data processing platform of claim 1, wherein the first data processing sheet comprises a data acquisition module and a text processing module, the data acquisition module is configured to acquire unstructured data from the data management unit, and the text processing module is configured to perform word frequency calculation on the unstructured data and obtain a corresponding first data tag.

3. The data processing platform of claim 1, wherein the data retrieval unit includes a visualization module for retrieving data in the database unit based on the first data tag and the second data tag.

4. The data processing platform of claim 3, wherein the visualization module comprises at least one of a report chart, an instrument chart, and a cockpit.

5. The data processing platform of claim 1, wherein the data retrieval unit comprises a query module for retrieving data in the database unit based on the key.

6. The data processing platform of claim 5, further comprising an NLP speech recognition unit for recognizing speech and extracting keywords in the speech, and an AI learning unit for performing match training of the keywords with data.

7. The data processing platform of claim 6, wherein the NLP speech recognition unit comprises a basic speech recognition module for recognizing speech and extracting keywords, and a continuation speech recognition module for recognizing continuation speech and extracting continuation keywords.

8. The data processing platform of claim 1, wherein the database unit comprises an ETL module, a data fusion and an entity library connected in sequence, the ETL module is connected with the first data processing unit and the second data processing unit respectively, and the entity library is connected with the data retrieval unit.