CN115145904A - Big data cleaning method and big data acquisition system for AI cloud computing training - Google Patents

Big data cleaning method and big data acquisition system for AI cloud computing training Download PDF

Info

Publication number
CN115145904A
CN115145904A CN202210786105.9A CN202210786105A CN115145904A CN 115145904 A CN115145904 A CN 115145904A CN 202210786105 A CN202210786105 A CN 202210786105A CN 115145904 A CN115145904 A CN 115145904A
Authority
CN
China
Prior art keywords
noise
big data
acquisition
target
cleaning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210786105.9A
Other languages
Chinese (zh)
Other versions
CN115145904B (en
Inventor
杨焕荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhengyuanda Technology Co ltd
Original Assignee
Zaozhuang Hongyu Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zaozhuang Hongyu Digital Technology Co ltd filed Critical Zaozhuang Hongyu Digital Technology Co ltd
Priority to CN202210786105.9A priority Critical patent/CN115145904B/en
Publication of CN115145904A publication Critical patent/CN115145904A/en
Application granted granted Critical
Publication of CN115145904B publication Critical patent/CN115145904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The embodiment of the application provides a big data cleaning method and a big data acquisition system for AI cloud computing training, which can perform noise prediction on big data sample acquisition event data, output the noise acquisition feature point distribution of the big data sample acquisition event data, and then determine big data acquisition cleaning decision information of the big data sample acquisition event data based on the noise acquisition feature point distribution of the big data sample acquisition event data, so that corresponding big data acquisition cleaning configuration is performed on an AI cloud computing training node based on the big data acquisition cleaning decision information, and compared with the noise field feature matching screening mode after the big data acquisition is completed in the prior art, the big data cleaning process can be performed in the big data acquisition process, and the big data cleaning efficiency and accuracy can be improved.

Description

Big data cleaning method and big data acquisition system for AI cloud computing training
Technical Field
The application relates to the technical field of big data acquisition and cleaning, in particular to a big data cleaning method and a big data acquisition system for AI cloud computing training.
Background
The basic principle of the big data acquisition and cleaning is that after the manual preprocessing is completed, dirty data is converted into data meeting the data quality requirement by using related technologies such as mathematical statistics, data mining or predefined cleaning rules, for example, in an AI training process, a large number of big data samples need to be collected, and in the process of acquiring the big data samples, a lot of noise data may exist and need to be cleaned so as to ensure the reliability of subsequent AI training. In the related art, the noise field feature matching screening is usually performed after the big data is acquired so as to acquire and clean the big data, and the method cannot be used for cleaning in the big data acquisition process, can only be used for cleaning after the big data is acquired, thereby affecting the big data cleaning efficiency and being difficult to better ensure the accuracy of the big data cleaning.
Disclosure of Invention
In a first aspect, the present application provides a big data cleaning method for AI cloud computing training, which is applied to a big data acquisition system, where the big data acquisition system is in communication connection with a plurality of AI cloud computing training nodes, and the method includes:
acquiring big data sample acquisition event data of a target AI training initiating task after receiving a training noise indication output by the target AI training initiating task of the AI cloud computing training node;
carrying out noise prediction on the big data sample acquisition event data, and outputting noise acquisition characteristic point distribution of the big data sample acquisition event data, wherein the noise acquisition characteristic point distribution comprises noise positioning element information of a sample acquisition target of a target sample acquisition example in the big data sample acquisition event data;
determining big data collection cleaning decision information of the big data sample collection event data based on the noise collection characteristic point distribution, wherein the big data collection cleaning decision information comprises collection cleaning field distribution of a sample collection target of the target sample collection example in the big data sample collection event data;
and performing corresponding big data acquisition cleaning configuration on the AI cloud computing training node based on the big data acquisition cleaning decision information.
In a second aspect, an embodiment of the present application further provides a big data cleaning system for AI cloud computing training, where the big data cleaning system for AI cloud computing training includes a big data acquisition system and multiple AI cloud computing training nodes in communication connection with the big data acquisition system;
the big data acquisition system is used for:
acquiring big data sample acquisition event data of a target AI training initiating task after receiving a training noise indication output by the target AI training initiating task of the AI cloud computing training node;
carrying out noise prediction on the big data sample acquisition event data, and outputting noise acquisition characteristic point distribution of the big data sample acquisition event data, wherein the noise acquisition characteristic point distribution comprises noise positioning element information of a sample acquisition target of a target sample acquisition example in the big data sample acquisition event data;
determining big data acquisition cleaning decision information of the big data sample acquisition event data based on the noise acquisition feature point distribution, wherein the big data acquisition cleaning decision information comprises acquisition cleaning field distribution of a sample acquisition target of the target sample acquisition instance in the big data sample acquisition event data;
and performing corresponding big data acquisition cleaning configuration on the AI cloud computing training node based on the big data acquisition cleaning decision information.
By adopting the technical scheme of any aspect, the embodiment of the application can perform noise prediction on the big data sample acquisition event data, output the noise acquisition characteristic point distribution of the big data sample acquisition event data, and then determine the big data acquisition cleaning decision information of the big data sample acquisition event data based on the noise acquisition characteristic point distribution of the big data sample acquisition event data, so that the corresponding big data acquisition cleaning configuration is performed on the AI cloud computing training node based on the big data acquisition cleaning decision information, and compared with the noise field characteristic matching screening mode after the big data acquisition is completed in the prior art, the big data cleaning process can be performed in the big data acquisition process, and the big data cleaning efficiency and accuracy can be improved.
Drawings
Fig. 1 is a schematic flow chart of a big data cleaning method for AI cloud computing training according to an embodiment of the present invention.
Detailed Description
The architecture of the big data washing system 10 for AI cloud computing training according to an embodiment of the present invention is described below, and the big data washing system 10 for AI cloud computing training may include a big data acquisition system 100 and an AI cloud computing training node 200 communicatively connected to the big data acquisition system 100. The big data collection system 100 and the AI cloud computing training node 200 in the big data cleaning system 10 for AI cloud computing training may cooperatively perform the big data cleaning method for AI cloud computing training described in the following method embodiments, and the detailed description of the method embodiments may be referred to in the following steps of the big data collection system 100 and the AI cloud computing training node 200.
Before describing the embodiments of the present application, a first exemplary scenario of big data sample acquisition event data under an initial AI training initiation task is described below. In the present application, a batch of example big data sample acquisition event data may be selected for noise decision capability learning, that is, an example big data sample acquisition event data set for noise decision capability learning is selected, where the example big data sample acquisition event data set includes first example big data sample acquisition event data under an initial AI training initiation task and second example big data sample acquisition event data under a target AI training initiation task. The first example big data sample acquisition event data under the initial AI training initiating task means that: the acquisition training data of the initial AI training initiation task includes example big data sample acquisition event data and data with a priori noise information (which may be denoted as example noise localization element information). In addition, the example big data sample acquisition event data under the initial AI training initiating task is the big data sample acquisition event data generated by the gray scale on-line node simulation, that is, the big data sample acquisition event data generated by the gray scale on-line node automatic simulation can be generated. The second example big data sample acquisition event data under the target AI training initiation task means: and updating the acquisition training data of the target AI training initiation task in the training node, namely the big data sample acquisition event data under the actual training node. And the example big data sample acquisition event data under the target AI training initiating task only has big data sample acquisition event data and does not have any prior noise information, that is, the target AI training initiating task is the AI training initiating task concerned by the optimization indication of the AI training initiating task.
For some exemplary design considerations, the data may be generated based on the first example big data sample acquisition event data of the initial AI training initiating task, the example noise localization element information of the sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data, and the second example big data sample acquisition event data of the target AI training initiating task, and adjusting and selecting a modulus parameter layer of the noise decision initialization model.
One example is: the initial AI training initiation task generates big data sample acquisition event data and example noise localization element information of a sample acquisition target thereof for gray scale online node simulation, the example noise localization element information including at least one noise localization element. The target AI training initiation task collects event data for big data samples in the actual training nodes. And, the specific number of the sample collection targets of the example noise localization element information can be flexibly set. Wherein the number of example noise localization element information in the first example big data sample acquisition event data is not less than the number of sample acquisition targets included in the final big data sample acquisition event data.
For some exemplary design considerations, the big data acquisition system 100 performs tuning and selection of the modulus parameter layer for the noise decision making initialization model based on the first example big data sample acquisition event data of the initial AI training initiation task, the example noise localization element information of the sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data, and the second example big data sample acquisition event data of the target AI training initiation task. For example, the noise decision initialization model includes a noise acquisition feature point parsing branch and a noise acquisition feature point aggregation branch, and in each noise decision learning phase, first example big data sample acquisition event data of an initial AI training initiation task and second example big data sample acquisition event data of a target AI training initiation task are loaded together into the noise decision initialization model, the first example big data sample acquisition event data including example noise localization element information of a sample acquisition target of a target sample acquisition instance, and the target sample acquisition instance is at least one noise localization element, the second example big data sample acquisition event data being without the example noise localization element information. And respectively carrying out feature analysis on the first example big data sample acquisition event data and the second example big data sample acquisition event data based on the noise acquisition feature point analysis branch to respectively obtain noise acquisition feature point distribution corresponding to the first example big data sample acquisition event data and noise acquisition feature point distribution corresponding to the second example big data sample acquisition event data. Then, based on the noise collection characteristic point distribution corresponding to the first example big data sample collection event data, the real noise field distribution of the noise collection characteristic point distribution corresponding to the second example big data sample collection event data, and the example noise positioning element information of the sample collection target of the target sample collection instance of the first example big data sample collection event data, the modulus parameter layer of the noise decision initialization model is optimized and selected.
When the noise decision initialization model matches the model deployment conditions, the noise decision initialization model is output as a noise collection feature point decision model, and the noise collection feature point decision model can be used for deciding the noise collection feature points of the sample collection target of the target sample collection instance in the big data sample collection event data of the target AI training initiation task. For example, the big data collection system 100 determines big data collection cleansing decision information of the big data sample collection event data based on determining a noise collection feature point distribution of the big data sample collection event data, and determines a collection cleansing field distribution of a sample collection target of a target sample collection instance in the big data sample collection event data based on the noise collection feature point distribution.
Therefore, under the condition that prior knowledge addition is not needed for big data sample acquisition event data of a target AI training initiation task, a noise decision model which is distributed and output by the example big data sample acquisition event data of the initial AI training initiation task and has example noise positioning element information and the real noise field of the target AI training initiation task example big data sample acquisition event data without the example noise positioning element information is migrated to candidate big data sample acquisition event data of the target AI training initiation task, and therefore the prior knowledge quantity of the big data sample acquisition event data of the target AI training initiation task is reduced.
The big data cleaning method for AI cloud computing training provided by this embodiment may be executed by the big data acquisition system 100, and is described in detail below with reference to fig. 1.
The Process110 obtains big data sample acquisition event data of a target AI training initiation task.
For some exemplary design considerations, the target AI training initiation task may refer to any training task for subsequent AI applications, such as, but not limited to, a training task for user interest analysis, a training task for security vulnerability mining, and so on. The big data sample collection event data may refer to recorded event data of a big data sample collection task indicated by the target AI training initiation task, for example, for a training task for user interest analysis, may refer to recorded event data of a big data sample collection task for user attention behavior data.
And the Process120 performs noise prediction on the event data acquired by the big data sample, and outputs the noise acquisition characteristic point distribution of the event data acquired by the big data sample.
The noise collection characteristic point distribution comprises noise positioning element information of a sample collection target of a target sample collection example in the big data sample collection event data. And the noise collection characteristic point distribution specifically comprises a forward noise collection characteristic point and a backward noise collection characteristic point, wherein the forward noise collection characteristic point comprises a plurality of interpretation characteristic members, and each interpretation characteristic member represents the decision support degree of a multi-party coupling noise item of a sample collection target taking sample collection event unit data in large data sample collection event data corresponding to the interpretation characteristic member as a target sample collection example. And the noise acquisition characteristic points of the noise field link range attribute and the noise trigger field intervals of the noise acquisition characteristic points of the noise field penetration path attribute are consistent. In addition, the noise trigger field interval of the forward noise acquisition characteristic point is the same as the noise trigger field interval of the backward noise acquisition characteristic point. For example, the noise trigger field interval of the forward noise acquisition feature point and the noise trigger field interval of the backward noise acquisition feature point are both (z 1, z 2.. Once.. Zn).
For example, the representation manner of the embodiment of the present application to the noise collection feature point is a forward + backward form. The noise collection feature point distribution classifies the two, namely the forward noise collection feature point and the backward noise collection feature point. Wherein the dimension of the forward noise collection feature point is (z 1, z 2.... ·, zn). R, and the dimension of the backward interpretation feature point is (z 1, z 2.... ·, zn). 2, R is the specific number of target sample collection instances to be decided. Each interpretation feature member on the noise acquisition feature point respectively expresses the decision support degree of a multi-party coupled noise item of a sample acquisition target of a target sample acquisition example and the decision support degree of a noise field link range noise field penetration path at the acquisition feature point.
And the Process130 determines big data collection cleaning decision information of the big data sample collection event data based on the noise collection characteristic point distribution, wherein the big data collection cleaning decision information includes collection cleaning field distribution of a sample collection target of the target sample collection instance in the big data sample collection event data.
For some exemplary design considerations, collecting the cleaning field distribution may include a noise localization element, and collecting the cleaning field distribution may include, for example: the big data sample acquisition event data is decision support of a multi-party coupled noise item of a sample acquisition target of the target sample acquisition instance, and the big data sample acquisition event data is a noise field link range and a noise field penetration path of a noise localization element corresponding to the sample acquisition target of the target sample acquisition instance.
For some exemplary design considerations, the noise collection feature point distribution includes forward noise collection feature points and backward noise collection feature points. The forward noise acquisition characteristic point comprises decision support degree of a multi-party coupled noise item of a sample acquisition target, each sample acquisition event unit data in the big data sample acquisition event data is a target sample acquisition instance, and the backward noise acquisition characteristic point comprises a noise field link range and noise field permeation path data corresponding to each sample acquisition event unit data in the big data sample acquisition event data.
For some exemplary design considerations, first, the big data acquisition system 100 determines a multi-party coupled noise term for a sample acquisition target of a target sample acquisition instance in the big data sample acquisition event data based on forward noise acquisition feature points. Then, the big data acquisition system 100 outputs a noise localization element of the sample acquisition target of the target sample acquisition instance in the big data sample acquisition event data based on the multi-party coupled noise item and the noise field link range and the noise field penetration path data corresponding to the sample acquisition event unit data at the multi-party coupled noise item. Finally, the big data collection system 100 outputs the noise localization element of the sample collection target of the target sample collection instance as the collection cleaning field distribution of the sample collection target of the target sample collection instance.
For example, each sample acquisition event unit data in the big data sample acquisition event data corresponds to one interpretation feature member in the noise acquisition feature point distribution. Therefore, big data collection cleaning decision information of the big data sample collection event data can be determined, and the big data collection cleaning decision information comprises collection cleaning field distribution of sample collection targets of target sample collection instances in the big data sample collection event data.
And the Process140 performs corresponding big data acquisition and cleaning configuration on the AI cloud computing training node based on the big data acquisition and cleaning decision information.
In this embodiment, after the corresponding big data acquisition cleaning configuration is performed on the AI cloud computing training node based on the big data acquisition cleaning decision information, the big data acquisition operation may be performed on the AI cloud computing training node after the big data acquisition cleaning configuration.
Based on the steps, noise prediction can be performed on the big data sample acquisition event data, noise acquisition feature point distribution of the big data sample acquisition event data is output, then big data acquisition cleaning decision information of the big data sample acquisition event data can be determined based on the noise acquisition feature point distribution of the big data sample acquisition event data, corresponding big data acquisition cleaning configuration is performed on the AI cloud computing training node based on the big data acquisition cleaning decision information, and therefore, compared with a mode of noise field feature matching screening after big data acquisition is completed in the prior art, a big data cleaning process can be performed in the big data acquisition process, and big data cleaning efficiency and accuracy can be improved.
For some exemplary design considerations, the big data collecting system 100 may perform noise prediction on the big data sample collecting event data based on the noise collecting feature point decision model, and output the noise collecting feature point distribution of the big data sample collecting event data. The noise collection characteristic point decision model is output based on noise point prediction training on sample collection event data of an example big data sample of a target AI training initiating task. For example, the noise collection feature point decision model is trained based on the first example big data sample collection event data of the initial AI training initiating task, the example noise localization element information of the sample collection target of the target sample collection instance in the first example big data sample collection event data, and the second example big data sample collection event data of the target AI training initiating task.
In the above scheme, the second example big data sample acquisition event data of the target AI training initiation task may be trained without adding prior knowledge to obtain the noise acquisition feature point decision model, and the noise decision capability learning is performed by using the real noise field distribution of the example big data sample acquisition event data of the target AI training initiation task to obtain the noise acquisition feature point decision model. Finally, the noise collection characteristic point decision model obtained by training can directly carry out noise prediction on the big data sample collection event data of the target AI training initiation task, so that big data collection cleaning decision information is obtained, and the prior knowledge adding work of the big data sample collection event data of the target AI training initiation task can be reduced.
For some exemplary design considerations, the noise collection feature point decision model includes a noise collection feature point parsing branch and a noise collection feature point aggregation branch. For example, the cleaning control decision model includes a noise collection feature point analysis branch and a noise collection feature point aggregation branch. An example design of the big data acquisition system 100 for noise prediction of big data sample acquisition event data based on a noise acquisition feature point decision model, outputting a noise acquisition feature point distribution of the big data sample acquisition event data, may include: the big data acquisition system 100 performs characteristic analysis on the big data sample acquisition event data based on the noise acquisition characteristic point analysis branch, and outputs a fuzzy noise acquisition characteristic point of the big data sample acquisition event data; and aggregating the big data sample acquisition event data and the fuzzy noise acquisition feature points based on the noise acquisition feature point aggregation branch, and outputting the noise acquisition feature point distribution of the big data sample acquisition event data. The noise collection characteristic point analysis branch can be composed of a convolution layer, batch regularization, nonlinear activation, a pooling layer and the like. The noise collection characteristic point analysis branch can effectively extract a noise field penetration path characteristic (namely, a fuzzy noise collection characteristic point) of input big data sample collection event data (big data sample collection event data).
For some exemplary design ideas, first, the big data acquisition system 100 performs feature selection based on a penalty term on big data sample acquisition event data and a fuzzy noise acquisition feature point based on a noise acquisition feature point aggregation branch, and outputs a first noise acquisition feature point; then, the big data acquisition system 100 performs embedding processing on the big data sample acquisition event data and the fuzzy noise acquisition feature points based on the noise acquisition feature point aggregation branch, outputs cost evaluation indexes corresponding to the fuzzy noise acquisition feature points, performs feature relationship communication on the fuzzy noise acquisition feature points based on the cost evaluation indexes, and outputs second noise acquisition feature points; finally, the big data collecting system 100 aggregates the first to-be-fused noise collecting feature points and the second to-be-fused noise collecting feature points, and outputs the noise collecting feature point distribution of the big data sample collecting event data.
For example, the noise collection feature point aggregation branch may include a first function processing layer, which may be, for example, FPN, and a second function processing layer. The FPN is the low-level of the fuzzy noise acquisition characteristic points and the noise acquisition characteristic point representation of the noise field permeation path layer of the large data sample acquisition event data, so that first noise acquisition characteristic points are obtained. The basic operational unit of the FPN is also a meta-operation of convolutional layers, batch regularization, nonlinear activation, pooling layers. The second function processing layer may be an embedded processing layer, for example, the fuzzy noise collection feature points may be pooled globally and processed in an excitation-based manner, and a cost evaluation index is output. Finally, the big data acquisition system 100 comprehensively judges the learning cost value of the cost evaluation index and the fuzzy noise acquisition feature point, and outputs a second noise acquisition feature point.
Finally, the big data collecting system 100 aggregates the first noise collecting feature points and the second noise collecting feature points, thereby obtaining the noise collecting feature point distribution of the big data sample collecting event data. Of course, after the fuzzy noise collection feature points of the large data sample collection event data are processed based on the FPN, the output first noise collection feature points are input to the embedded processing layer, and then the second noise collection feature points are obtained. And finally, outputting a second noise acquisition characteristic point obtained by processing the embedded processing layer based on the first noise acquisition characteristic point as the noise acquisition characteristic point distribution of the large data sample acquisition event data.
For some exemplary design ideas, a big data cleaning method for AI cloud computing training provided by the examples of the present application includes the following steps.
The Process210 obtains an example to-be-learned noise point feature data sequence including first example big data sample acquisition event data of an initial AI training initiation task, example noise localization element information of a sample acquisition target of a target sample acquisition instance in the first example big data sample acquisition event data, and second example big data sample acquisition event data of a target AI training initiation task.
For some exemplary design ideas, the initial AI training initiation task refers to an actual training node, and the target AI training initiation task refers to a gray scale online node simulation generation training node. The first exemplary big data sample acquisition event data is then: event data is collected by big data samples under actual training nodes.
The first example big data sample collection event data refers to grayscale on-line node simulation generation big data sample collection event data, and may be automatically generated by some grayscale on-line node simulation generation applications, for example. For example, the example noise localization element information may be, for example, a noise localization element. The noise positioning element of the sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data is also automatically labeled by the application generated based on the gray level on-line node simulation. By the design, the sample noise positioning element information of the sample acquisition target of the target sample acquisition instance in the first sample big data sample acquisition event data and the first sample big data sample acquisition event data is generated by the gray scale on-line node simulation, and compared with manually collecting and labeling the big data sample acquisition event data, the efficiency of processing the big data sample acquisition event data is improved.
In addition, the second example big data sample acquisition event data is big data sample acquisition event data under an actual training node. The second big data sample acquisition event data may be the big data sample acquisition event data arbitrarily selected by the big data acquisition system 100, and of course, the big data sample acquisition event data in the big data sample acquisition event database are all big data sample acquisition event data under the actual training node.
A Process220 for tuning and selecting a model-to-digital parameter layer for a noise decision initialization model based on noise decision training of the second example big-data sample acquisition event data and noise decision capability learning of the first example big-data sample acquisition event data.
For example, noise decision training refers to optimizing model parameter layer information of a noise decision initialization model based on the true noise field distribution of the second example big data sample acquisition event data. The learning means that a second target noise learning cost value is calculated based on the first example big data sample acquisition event data, and model parameter layer tuning and selection are carried out on the noise decision initialization model based on the second target noise learning cost value.
For some exemplary design considerations, the big data acquisition system 100 determines a first target noise learning cost value of the noise decision initialization model based on the true noise field distribution of the second example big data sample acquisition event data. The big data acquisition system 100 outputs a second target noise learning cost value of the noise decision initialization model based on the first example big data sample acquisition event data and the example noise localization element information of the sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data. The big data acquisition system 100 performs tuning and selection of the modulus parameter layer on the noise decision initialization model based on the first target noise learning cost value and the second target noise learning cost value.
For some exemplary design considerations, the big data acquisition system 100 performs feature analysis on the first example big data sample acquisition event data based on a noise decision initialization model, and outputs a first noise interpretation feature of the first example big data sample acquisition event data. The big data acquisition system 100 then determines a second target noise learning cost value for the noise decision initialization model based on the first noise interpretation features and the example noise localization element information.
For some exemplary design considerations, the noise decision initialization model may be, for example, an initialization network model that enables noise decision, thereby outputting a distribution of all noise acquisition feature points in the large data sample acquisition event data that are likely to have noise features.
For some exemplary design considerations, the noise decision initialization model may include a base noise collection feature point parsing branch and a base noise collection feature point aggregation branch. The big data acquisition system 100 performs feature analysis on the first example big data sample acquisition event data of the initial AI training initiation task based on the noise decision initialization model, and outputting the first noise interpretation features of the first example big data sample acquisition event data may include, for example: the big data acquisition system 100 performs feature analysis on the first example big data sample acquisition event data based on the basic noise acquisition feature point analysis branch, and outputs a fuzzy noise acquisition feature point of the first example big data sample acquisition event data; the big data acquisition system 100 aggregates the fuzzy noise acquisition feature points of the first example big data sample acquisition event data based on the basic noise acquisition feature point aggregation branch, and outputs a first noise interpretation feature of the first example big data sample acquisition event data. For example, the architectures of the basic noise collection feature point analysis branch and the basic noise collection feature point aggregation branch that correspond to each other may be referred to specifically as the architectures of the noise collection feature point analysis branch and the noise collection feature point aggregation branch that correspond to each other described above. The basic noise collection feature point aggregation branch may further include a first basic function processing layer and a second basic function processing layer, where the structure of the first basic function processing layer may be specifically referred to the first function processing layer, and the structure of the second basic function processing layer may be specifically referred to the second function processing layer.
For some exemplary design considerations, the big data acquisition system 100 performs feature analysis on the second example big data sample acquisition event data based on a noise decision initialization model, and outputs a second noise interpretation feature of the second example big data sample acquisition event data. The big data acquisition system 100 then determines a first target noise learning cost value for the noise decision initialization model based on the true noise field distribution of the second noise interpretation features.
For some exemplary design considerations, the execution step of the big data acquisition system 100 "performing feature analysis on the second example big data sample acquisition event data of the target AI training initiation task based on the noise decision initialization model, and outputting the second noise interpretation feature of the second example big data sample acquisition event data" may specifically refer to the execution step of the big data acquisition system 100 "performing feature analysis on the first example big data sample acquisition event data of the initial AI training initiation task based on the noise decision initialization model, and outputting the first noise interpretation feature of the first example big data sample acquisition event data" in the Process220, which is not described herein again in this embodiment of the present application.
Wherein, in a noise decision learning phase of a noise decision initialization model based on first example big data sample acquisition event data of an initial AI training initiation task and second example big data sample acquisition event data of a target AI training initiation task, the first example big data sample acquisition event data and the second example big data sample acquisition event data are input into the noise decision initialization model simultaneously. For some exemplary design considerations, the data of one noise decision learning stage includes a plurality of first example big data sample acquisition event data and an equal number of second example big data sample acquisition event data, and of course, the number of the first example big data sample acquisition event data and the second example big data sample acquisition event data input into the noise decision initialization model in each noise decision learning stage may not be consistent, which is not specifically limited in this application.
And the Process230, when the noise decision initialization model matching the model deployment condition matches the model deployment condition, outputs the noise decision initialization model matching the model deployment condition as a noise collection feature point decision model, performs noise prediction on the input big data sample collection event data based on the noise collection feature point decision model, and outputs the noise collection feature point distribution of the big data sample collection event data.
For some exemplary design considerations, the first target noise learning cost value comprises a third noise learning cost value and a fourth noise learning cost value, and the second target noise learning cost value comprises the first noise learning cost value and the second noise learning cost value. The big data acquisition system 100 acquires a first cost evaluation index corresponding to the first noise learning cost value, a second cost evaluation index corresponding to the second noise learning cost value, a third cost evaluation index corresponding to the third noise learning cost value, and a fourth cost evaluation index corresponding to the third noise learning cost value; then, the big data acquisition system 100 performs comprehensive judgment on the learning cost value of the second target noise learning cost value and the first target noise learning cost value based on the first cost evaluation index, the second cost evaluation index, the third cost evaluation index and the fourth cost evaluation index, and outputs a target noise learning cost value; finally, the big data acquisition system 100 performs model parameter layer tuning and selection on the noise decision initialization model based on the target noise learning cost value. Subsequently, when the tuned and selected noise decision initialization model in the model parameter layer includes the matched model deployment condition, the tuned and selected noise decision initialization model in the model parameter layer is output as the noise acquisition feature point decision model.
The model deployment condition may be: when the tuning and selecting times of the model parameter layer of the noise decision initialization model reach threshold times, for example 300 times, the noise decision initialization model matches the model deployment conditions; when the difference metric value between the predicted big data acquisition cleaning decision information corresponding to each example big data sample acquisition event data and the actual big data acquisition cleaning decision information corresponding to each example big data sample acquisition event data is smaller than the difference metric value threshold, initializing a noise decision-initiating model to match the deployment condition of the model; and when the difference between the most recent two model parameter layers of the noise decision-making initialization model and the predicted big data acquisition cleaning decision information corresponding to the selected big data sample acquisition event data is smaller than the preset difference, matching the noise decision-making initialization model with the model deployment condition. The example big data sample acquisition event data may be first example big data sample acquisition event data or second example big data sample acquisition event data.
For some exemplary design considerations, the present application provides a flow of a method for determining a first target noise learning cost value, which is applied to the big data acquisition system 100, and corresponds to a specific embodiment of the Process220, and includes the following steps.
The Process310 performs recursive feature removal on the second forward interpretation feature or the second backward interpretation feature included in the second noise interpretation feature, and outputs the second forward interpretation feature or the second backward interpretation feature after the recursive feature removal.
For some exemplary design considerations, the big data acquisition system 100 performs noise prediction on the second example big data sample acquisition event data based on a noise-based decision-making initialization model, and outputs a second noise interpretation feature of the second example big data sample acquisition event data. Wherein the second noise interpretation feature comprises a second forward interpretation feature and a second backward interpretation feature. The big data acquisition system 100 may perform recursive feature elimination on the second forward interpretation feature, and output the second forward interpretation feature after the recursive feature elimination. Correspondingly, the big data acquisition system 100 may also perform recursive feature elimination on the second backward interpretation feature, and output the second backward interpretation feature after the recursive feature elimination.
And the Process320 calculates a real noise field distribution for each of the plurality of interpretation feature members, and determines a third noise learning cost value based on the real noise field distributions of all the interpretation feature members, the noise field link range of the second noise interpretation feature and the noise field penetration path.
For some exemplary design considerations, the recursive feature eliminated second forward interpretation feature comprises a plurality of interpretation feature members, each interpretation feature member corresponding to a sample acquisition event unit data in the second example big data sample acquisition event data. The big data collecting system 100 calculates the true noise field distribution (information entropy) for each of the plurality of interpretation feature members, respectively.
The Process330 calculates a maximum square value of the noise learning cost value for each of the plurality of interpretation feature members, and determines a fourth noise learning cost value based on the maximum square values of the noise learning cost values of all the interpretation feature members, the noise field link range of the second noise interpretation feature, and the noise field penetration path.
For some exemplary design considerations, the recursive feature eliminated second forward interpretation feature comprises a plurality of interpretation feature members, each interpretation feature member corresponding to one sample acquisition event unit data in the second example big data sample acquisition event data. The big data collecting system 100 calculates a maximum square value of the noise learning cost value for each of the plurality of interpretation feature members, respectively.
And a Process340, which outputs a first target noise learning cost value of the noise decision initialization model based on the third noise learning cost value and the fourth noise learning cost value.
For some exemplary design ideas, the big data acquisition system 100 obtains a third cost evaluation index corresponding to a third noise learning cost value, and obtains a fourth cost evaluation index corresponding to a fourth noise learning cost value. Then, the big data acquisition system 100 performs comprehensive judgment on the learning cost value of the third noise learning cost value and the fourth noise learning cost value based on the third cost evaluation index and the fourth cost evaluation index, and outputs the first target noise learning cost value of the noise decision initialization model.
Finally, the big data acquisition system 100 performs weighted calculation on the second target noise learning cost value and the first target noise learning cost value, and outputs a target noise learning cost value.
For some exemplary design considerations, the present application provides a method for determining a second target noise learning cost value, where the method is applied to the big data collecting system 100, and the method includes the following steps, corresponding to a specific embodiment of the Process 220.
A Process410 that outputs a first noise learning cost value based on the first forward interpretation feature, the multi-party coupled noise term of the sample acquisition target of the target sample acquisition instance, and the quantity of the first example big data sample acquisition event data.
The example noise localization element information of the sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data specifically includes a noise field link range, a noise field percolation path, and a multi-party coupled noise term of the sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data.
For some exemplary design considerations, the big data acquisition system 100 performs noise prediction on the first example big data sample acquisition event data based on a noise-based decision-making initialization model, and outputs a first noise interpretation feature of the first example big data sample acquisition event data. Wherein the first noise interpretation characteristics comprise first forward interpretation characteristics comprising decision support for a multi-party coupled noise term for a sample acquisition target for a target sample acquisition instance per sample acquisition event unit data in the first example big data sample acquisition event data.
Process420 outputs a second noise learning cost value based on the first backward interpretation feature, the number of the first example big data sample acquisition event data, the noise field link range of the noise localization element, and the noise field penetration path.
For some exemplary design considerations, the big data acquisition system 100 performs noise prediction on the first example big data sample acquisition event data based on a noise-based decision-making initialization model, and outputs a first noise interpretation feature of the first example big data sample acquisition event data. Wherein the first noise interpretation features comprise first backward interpretation features comprising noise field link ranges and noise field percolation path data corresponding to each sample acquisition event unit data in the first example big data sample acquisition event data. The example noise localization element information of the sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data includes a noise field link range, a noise field percolation path, and a multi-party coupled noise term of the sample acquisition target of the target sample acquisition instance for the noise localization element of the sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data.
Process430, based on the first noise learning cost value and the second noise learning cost value, outputs a second target noise learning cost value of the noise decision initialization model.
For some exemplary design considerations, the big data acquisition system 100 obtains a first cost evaluation index corresponding to a first noise learning cost value and obtains a second cost evaluation index corresponding to a second noise learning cost value. Then, the big data acquisition system 100 comprehensively determines the learning cost value of the first noise learning cost value and the second noise learning cost value based on the first cost evaluation index and the second cost evaluation index, and outputs a second target noise learning cost value of the noise decision initialization model.
For some exemplary design ideas, the embodiment of the present application further provides a big data collection and cleaning method based on artificial intelligence, which includes the following steps.
STEP110, acquiring the cleaning service node information distributed in each relevant collecting and cleaning field in the corresponding target big data collecting and cleaning control model based on the big data collecting and cleaning decision information.
For some exemplary design ideas, the cleaning service node information corresponding to the distribution of each relevant collection cleaning field may be determined based on a preset cleaning service node mapping relationship library. The cleaning service node information may include a sequence formed by each service node that needs to be configured for big data collection and cleaning.
STEP120, based on the information of the cleaning service node distributed in each relevant acquisition cleaning field, determining the cleaning control path distributed in each relevant acquisition cleaning field.
For some exemplary design ideas, the cleaning control path may be obtained by processing cleaning service node information based on a knowledge graph algorithm, and the cleaning control path may represent a cleaning control association relationship between each cleaning service node.
And the STEP130 is used for performing related node communication on the cleaning control paths distributed in the related acquisition cleaning fields and outputting the target cleaning control path of the target big data acquisition cleaning control model.
For some exemplary design ideas, based on the fact that related nodes of cleaning control paths distributed in the related acquisition cleaning fields are communicated to obtain a target cleaning control path of the whole target big data acquisition cleaning control model, the target cleaning control path can reflect relationship information of cleaning control examples of cleaning service node information distributed in the related acquisition cleaning fields in the big data acquisition cleaning control model, and big data acquisition cleaning logic information of the target big data acquisition cleaning control model can be accurately expressed.
The STEP140 performs model control instruction allocation on the target big data acquisition and cleaning control model based on the target cleaning control path, and outputs at least one model control instruction of the target big data acquisition and cleaning control model.
STEP150, based on at least one model control instruction of the target big data acquisition and cleaning control model, performs corresponding big data acquisition and cleaning configuration for the AI cloud computing training node 200.
For some exemplary design ideas, since the target cleaning control path is obtained by aggregating cleaning control paths distributed based on the respective related collection cleaning fields, the model control instruction allocation of the target big data collection cleaning control model executed based on the target cleaning control path may be more accurate.
For some exemplary design considerations, a specific implementation embodiment of STEP120 can be seen in the following description.
STEP210, for the cleaning service node information distributed in each relevant acquisition cleaning field, performing cleaning control instance information generation on the cleaning service node information distributed in the relevant acquisition cleaning field, and outputting the cleaning control instance information corresponding to the cleaning service node information distributed in the relevant acquisition cleaning field.
For some exemplary design ideas, the cleaning service node information distributed in each relevant collection cleaning field is respectively input into a cleaning control decision model meeting the requirement of model convergence, one or more times of cleaning control feature point output is performed based on the cleaning control decision model so as to perform feature analysis on the cleaning service node information, and cleaning control instance information corresponding to the cleaning service node information distributed in the relevant collection cleaning field is output.
STEP220, performing big data acquisition cleaning instance relation variable analysis on the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field, and outputting at least one big data acquisition cleaning instance relation variable of the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field.
STEP230, extracting a logic characteristic diagram of each big data acquisition cleaning instance relation variable in the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field, and outputting the logic characteristic diagram of each big data acquisition cleaning instance relation variable of the cleaning service node information distributed in the relevant acquisition cleaning field.
For some exemplary design ideas, after identifying the big data collection cleaning instance relation variables, logical feature diagram generation may be performed on instance variables corresponding to the big data collection cleaning instance relation variables in the cleaning control instance information, and logical feature diagrams corresponding to the big data collection cleaning instance relation variables are output.
STEP240, based on the mapping information of each big data acquisition cleaning instance relation variable of the cleaning service node information distributed in the relevant acquisition cleaning field to the target big data acquisition cleaning control model, aggregating the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field and the logic characteristic graph of each big data acquisition cleaning instance relation variable, and outputting the cleaning control path distributed in the relevant acquisition cleaning field.
For some exemplary design considerations, a specific implementation embodiment of STEP240 can be seen in the following description.
The STEP241 is configured to output a feature relationship corresponding to each big data acquisition cleaning instance relationship variable of the cleaning service node information distributed in the relevant acquisition cleaning field based on mapping information of each big data acquisition cleaning instance relationship variable of the cleaning service node information distributed in the relevant acquisition cleaning field to the target big data acquisition cleaning control model.
STEP242, based on the characteristic relationship, performing characteristic relationship communication on the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field and the logic characteristic graph of the relationship variable of each big data acquisition cleaning instance, and outputting the cleaning control path distributed in the relevant acquisition cleaning field.
For some exemplary design considerations, a specific implementation embodiment of STEP130 can be seen in the following description.
STEP131, clustering the cleaning control paths distributed by each relevant acquisition cleaning field, outputting at least one cluster, and determining cluster center path variables output as cluster center in each cluster.
STEP132, for each cluster, calculating a cleaning control path node variable of a non-cluster center path variable and a cluster center path variable in the cluster, and outputting a cleaning control path node variable set of the cluster.
And STEP133, aggregating the cleaning control path node variable sets of each cluster, and outputting a target cleaning control path of the target big data acquisition cleaning control model.
For some exemplary design considerations, a specific implementation embodiment of STEP131 can be found in the following description.
STEP1311 outputs the number N of clusters, where N is a positive integer equal to or greater than 2.
STEP1312, selecting N cleaning control paths from the cleaning control paths distributed in the relevant collecting cleaning fields, and outputting the N cleaning control paths as cluster center path variables of N clusters respectively.
STEP1313, calculating the association degree between the cleaning control path distributed by each relevant acquisition cleaning field and each cluster center path variable.
For some exemplary design considerations, the association between the purge control path and the cluster core path variables may represent a match between the two. The greater the degree of association, the greater the degree of matching. The manner of calculating the degree of association between the purge control path and the cluster center path variables may be calculated based on a cosine distance or a euclidean distance or the like.
And the STEP1314 is used for loading each cleaning control path into the cluster to which the cluster center path variable with the maximum association degree with the cleaning control path belongs and outputting N clusters.
STEP1315, selecting a cleaning control path meeting a cluster center condition from the clusters according to each cluster, outputting the cleaning control path as a new cluster center path variable, returning to the STEP of calculating the association degree between the cleaning control path distributed by each relevant acquisition cleaning field and each cluster center path variable until the cluster center path variable of each cluster matches a cluster ending condition, determining N clusters, and obtaining the cluster center path variable output as a cluster center in each cluster.
For some exemplary design ideas, for each cluster, whether a latest cluster center path variable of the cluster is consistent with a cluster center path variable adopted at the maximum time in a cluster process is calculated, that is, whether the association degree between the latest cluster center path variable and the cluster center path variable is 0 is calculated. If the cluster centers of the clusters are consistent, the cluster center of the cluster is considered to be converged, if the cluster centers of all the clusters are converged, the clustering process is completed, N clusters are output, and cluster center path variables which are output as the cluster centers in all the clusters are obtained; if the cluster centers of all clusters do not converge, go back to STEP1313 until the cluster centers of each cluster converge.
For some exemplary design ideas, in STEP220, cleaning control instance information is generated for the cleaning service node information distributed in the relevant acquisition cleaning field, and when cleaning control instance information corresponding to the cleaning service node information distributed in the relevant acquisition cleaning field is output, cleaning control instance information is generated for the cleaning service node information distributed in the relevant acquisition cleaning field based on a target neural network model, and cleaning control instance information corresponding to the cleaning service node information distributed in the relevant acquisition cleaning field is output.
In STEP220, when the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field is subjected to big data acquisition cleaning instance relation variable analysis, and at least one big data acquisition cleaning instance relation variable of the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field is output, the big data acquisition cleaning instance relation variable analysis can be performed on the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field based on the target neural network model, and at least one big data acquisition cleaning instance relation variable of the cleaning control instance information of the cleaning service node information distributed in the relevant acquisition cleaning field is output.
When the target big data collecting and cleaning control model is subjected to model control instruction distribution based on the target cleaning control path in STEP140 and at least one model control instruction of the target big data collecting and cleaning control model is output, the target big data collecting and cleaning control model may be subjected to model control instruction distribution based on the target neural network model and based on the target cleaning control path and at least one model control instruction of the target big data collecting and cleaning control model is output.
For some exemplary design considerations, the target neural network model may be a residual network, a densely connected convolutional network, and so on.
For some exemplary design ideas, the embodiment of the present application further provides a big data acquisition cleaning training method based on artificial intelligence, which includes the following steps.
STEP401, obtaining example big data sample collection event data, where the example big data sample collection event data includes cleaning service node information of a target big data collection cleaning control model of the target big data collection cleaning control model, and an actual model control instruction corresponding to the target big data collection cleaning control model.
STEP402, based on the target neural network model, performing cleaning control instance information generation on cleaning service node information of the target big data acquisition cleaning control model, outputting cleaning control instance information corresponding to the cleaning service node information of the target big data acquisition cleaning control model, performing big data acquisition cleaning instance relation variable analysis on the cleaning control instance information of the cleaning service node information of the target big data acquisition cleaning control model, and outputting at least one analyzed big data acquisition cleaning instance relation variable of the cleaning control instance information of the cleaning service node information of the target big data acquisition cleaning control model.
The STEP403, performing feature analysis on each analytic big data acquisition cleaning instance relation variable in the cleaning control instance information of the cleaning service node information of the target big data acquisition cleaning control model, outputting a logic feature graph of each analytic big data acquisition cleaning instance relation variable of the cleaning service node information of the target big data acquisition cleaning control model, aggregating the cleaning control instance information of the cleaning service node information of the target big data acquisition cleaning control model and the logic feature graph of each analytic big data acquisition cleaning instance relation variable to the target big data acquisition cleaning control model based on each analytic big data acquisition cleaning instance relation variable of the cleaning service node information of the target big data acquisition cleaning control model, and outputting a cleaning control path of the cleaning service node information of the target big data acquisition cleaning control model.
STEP404, performing relevant node communication on the cleaning control path of the cleaning service node information of each target big data acquisition cleaning control model, and outputting the target cleaning control path of the target big data acquisition cleaning control model.
And the STEP405 outputs execution performance values of the target big data acquisition and cleaning control model on each preset model control instruction based on the target cleaning control path.
STEP406, calculating a first noise learning cost value between the executive performance value and the executive performance value of the actual model control instruction of the target big data acquisition and cleaning control model.
STEP407, calculating a gradient descending value of the first noise learning cost value to a target cleaning control path of the target big data acquisition cleaning control model, and calculating a confidence sequence corresponding to cleaning control instance information of the cleaning service node information of the target big data acquisition cleaning control model based on the gradient descending value.
STEP408, outputting model control instruction information of the target big data acquisition and cleaning control model based on the execution performance value of the target big data acquisition and cleaning control model.
STEP409, when the model control instruction information of the target big data acquisition cleaning control model is consistent with the actual model control instruction, based on the confidence sequence, obtaining a big data acquisition cleaning instance relation variable of cleaning control instance information of cleaning service node information of the target big data acquisition cleaning control model, and setting the obtained big data acquisition cleaning instance relation variable as the actual big data acquisition cleaning instance relation variable of the cleaning service node information of the target big data acquisition cleaning control model.
STEP410, when the model control instruction information of the target big data collecting and cleaning control model is not matched with the actual model control instruction, based on the confidence sequence, acquiring a non-big data collecting and cleaning instance relation variable of the cleaning control instance information of the cleaning service node information of the target big data collecting and cleaning control model, and setting the acquired non-big data collecting and cleaning instance relation variable as the non-actual big data collecting and cleaning instance relation variable of the cleaning service node information of the target big data collecting and cleaning control model.
STEP411, based on the actual big data collection cleaning instance relation variable and the non-actual big data collection cleaning instance relation variable, calculating a second noise learning cost value of the analysis big data collection cleaning instance relation variable of the cleaning service node information of the target big data collection cleaning control model.
And STEP412, based on the first noise learning cost value and the second noise learning cost value, performing optimization and selection of a model parameter layer on a target neural network model, and outputting the target neural network model matched with a preset training termination condition.
For some exemplary design ideas, a back propagation algorithm may be adopted to perform model parameter layer tuning and selection on a target neural network model, so that a first noise learning cost value between an execution performance value obtained based on the target neural network model and an execution performance value of an actual model control instruction is smaller than a target cost value, and the target cost value may be set as small as possible to improve the performance of the target neural network model.
Generally, if the execution performance value of the target neural network model on a certain preset model control instruction exceeds a threshold, the target big data acquisition cleaning control model may be considered as the big data acquisition cleaning control model on the preset model control instruction. In the noise decision learning stage of the target neural network model, if the model control instruction information decided by the target neural network model is consistent with the actual model control instruction, that is, the allocation is correct, a confidence sequence can be obtained through analysis based on the parameters involved in the allocation process, the relation variable analysis of the big data acquisition cleaning instance can be performed based on the confidence sequence, and the actual big data acquisition cleaning instance relation variable of the cleaning service node information of the target big data acquisition cleaning control model is output.
In the noise decision learning stage of the target neural network model, if the model control instruction information decided by the target neural network model is not matched with the actual model control instruction, that is, the model control instruction of the target big data acquisition and cleaning control model is wrongly distributed based on the target neural network model, a confidence sequence can be obtained through analysis based on the parameters involved in the distribution process, and the non-actual big data acquisition and cleaning instance relation variable of the cleaning service node information of the target big data acquisition and cleaning control model is obtained based on the confidence sequence.
In some embodiments, big data acquisition system 100 may include a processor 110, a machine-readable storage medium 120, a bus 130, and a communication unit 140.
The processor 110 may perform various suitable actions and processes in accordance with a program stored in the machine-readable storage medium 120, such as program instructions related to the big data cleansing method for AI cloud computing training described in the foregoing embodiments. The processor 110, the machine-readable storage medium 120, and the communication unit 140 perform signal transmission through the bus 130.
In particular, the processes described in the above exemplary flow diagrams may be implemented as computer software programs, according to embodiments of the present invention. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication unit 140, and when executed by the processor 110, performs the above-described functions defined in the methods of the embodiments of the present invention.
The invention further provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when the computer-executable instructions are executed by a processor, the computer-executable instructions are used for implementing the big data cleansing method for AI cloud computing training according to any one of the above embodiments.
Yet another embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the big data cleansing method for AI cloud computing training according to any of the above embodiments.
It should be understood that, although the various operation steps are indicated by arrows in the flow chart of the embodiment of the present invention, the implementation order of the steps is not limited to the order indicated by the arrows. In some implementation scenarios of embodiments of the present invention, the implementation steps in the flowcharts may be performed in other sequences as needed, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include several sub-steps or several stages according to an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, and individual ones of these sub-steps or stages may also be performed at different times. In a scenario where the execution time is different, the execution sequence of the sub-steps or phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present invention.
The foregoing is only an alternative embodiment of a part of implementation scenarios of the present invention, and it should be noted that those skilled in the art should also be able to protect the scope of the embodiments of the present invention based on other similar implementation means according to the technical idea of the present invention without departing from the technical idea of the present invention.

Claims (10)

1. A big data cleaning method for AI cloud computing training is applied to a big data acquisition system, and comprises the following steps:
acquiring big data sample acquisition event data of a target AI training initiating task after receiving a training noise indication output by the target AI training initiating task of the AI cloud computing training node;
carrying out noise prediction on the big data sample acquisition event data, and outputting noise acquisition characteristic point distribution of the big data sample acquisition event data, wherein the noise acquisition characteristic point distribution comprises noise positioning element information of a sample acquisition target of a target sample acquisition example in the big data sample acquisition event data;
determining big data collection cleaning decision information of the big data sample collection event data based on the noise collection characteristic point distribution, wherein the big data collection cleaning decision information comprises collection cleaning field distribution of a sample collection target of the target sample collection example in the big data sample collection event data;
and carrying out corresponding big data acquisition cleaning configuration on the AI cloud computing training node based on the big data acquisition cleaning decision information.
2. The big data cleaning method for AI cloud computing training according to claim 1, wherein the step of performing noise prediction on the big data sample acquisition event data and outputting the noise acquisition feature point distribution of the big data sample acquisition event data specifically includes:
and performing noise prediction on the big data sample acquisition event data based on a noise acquisition characteristic point decision model, and outputting noise acquisition characteristic point distribution of the big data sample acquisition event data, wherein the noise acquisition characteristic point decision model is output by performing noise point prediction training on the example big data sample acquisition event data of the target AI training initiating task.
3. The big data cleansing method for AI cloud computing training of claim 2, wherein prior to the noise prediction of the big data sample acquisition event data based on a noise acquisition feature point decision model, the method further comprises:
acquiring a noise point characteristic data sequence to be learned of an example, wherein the noise point characteristic data sequence to be learned of the example comprises first example big data sample acquisition event data of an initial AI training initiating task and second example big data sample acquisition event data of a target AI training initiating task;
and outputting the noise decision initialization model matching the model deployment condition as a noise collection feature point decision model when the noise decision initialization model matching the model deployment condition matches the model deployment condition, wherein the noise collection feature point decision model is used for deciding the noise positioning element information of the sample collection target of the target sample collection instance in the large data sample collection event data of the target AI training initiation task.
4. The big data washing method for AI cloud computing training according to claim 3, wherein the example to-be-learned noise point feature data sequence further includes example noise localization element information of a sample acquisition target of a target sample acquisition instance in the first example big data sample acquisition event data;
the step of tuning and selecting a modulus parameter layer for a noise decision initialization model based on training a noise decision for the second example big data sample acquisition event data and learning a noise decision capability for the first example big data sample acquisition event data specifically comprises:
performing feature analysis on the second example big-data sample acquisition event data based on the noise decision initialization model, and outputting a second noise interpretation feature of the second example big-data sample acquisition event data;
performing recursive feature elimination on a second forward interpretation feature or a second backward interpretation feature included in the second noise interpretation feature, and outputting the second forward interpretation feature or the second backward interpretation feature after the recursive feature elimination, wherein the second forward interpretation feature or the second backward interpretation feature after the recursive feature elimination includes a plurality of interpretation feature members, and each interpretation feature member corresponds to one sample acquisition event unit data in the second example big data sample acquisition event data;
respectively calculating real noise field distribution for each interpretation feature member in the plurality of interpretation feature members, and determining a third noise learning cost value based on the real noise field distribution of all interpretation feature members, the noise field link range of the second noise interpretation feature and the noise field penetration path;
calculating the maximum square value of the noise learning cost value of each interpretation feature member in the plurality of interpretation feature members, and determining a fourth noise learning cost value based on the maximum square values of the noise learning cost values of all the interpretation feature members, the noise field link range of the second noise interpretation feature and the noise field penetration path;
determining a first target noise learning cost value of the noise decision initialization model based on the third noise learning cost value and the fourth noise learning cost value;
performing feature analysis on the first example big-data sample acquisition event data based on the noise decision initialization model, outputting a first noise interpretation feature of the first example big-data sample acquisition event data;
determining a second target noise learning cost value for the noise decision initialization model based on the first noise interpretation features and the example noise localization element information;
and performing model parameter layer tuning and selection on the noise decision initialization model based on the first target noise learning cost value and the second target noise learning cost value.
5. The big data cleansing method for AI cloud computing training of claim 4, wherein the first noise interpretation features comprise a first forward interpretation feature and a first backward interpretation feature, the example noise localization element information comprises a noise field link range, a noise field penetration path, and a multi-party coupled noise term of a noise localization element of a sample acquisition target of the target sample acquisition instance in the first example big data sample acquisition event data;
the step of determining a second target noise learning cost value of the noise decision initialization model based on the first noise interpretation features and the example noise localization element information includes:
outputting a first noise learning cost value based on the first forward interpretation feature, a multi-party coupled noise term of a sample acquisition target of the target sample acquisition instance, and a quantity of the first example big data sample acquisition event data;
outputting a second noise learning cost value based on the first backward interpretation features, the number of the first example big data sample acquisition event data, the noise field link range of the noise localization element, and a noise field penetration path;
determining a second target noise learning cost value of the noise decision initialization model based on the first noise learning cost value and the second noise learning cost value.
6. The big data washing method for AI cloud computing training of claim 4, wherein the first target noise learning cost value comprises a third noise learning cost value and a fourth noise learning cost value and the second target noise learning cost value comprises a first noise learning cost value and a second noise learning cost value;
the step of performing model parameter layer tuning and selection on the noise decision initialization model based on the first target noise learning cost value and the second target noise learning cost value specifically includes:
acquiring a first cost evaluation index corresponding to the first noise learning cost value, a second cost evaluation index corresponding to the second noise learning cost value, a third cost evaluation index corresponding to the third noise learning cost value, and a fourth cost evaluation index corresponding to the third noise learning cost value;
performing comprehensive judgment on the learning cost value of the second target noise learning cost value and the first target noise learning cost value based on the first cost evaluation index, the second cost evaluation index, the third cost evaluation index and the fourth cost evaluation index, and outputting a target noise learning cost value;
and adjusting and selecting a model parameter layer of the noise decision initialization model based on the target noise learning cost value.
7. The big data washing method for AI cloud computing training according to claim 2, wherein the noise collection feature point decision model includes a noise collection feature point parsing branch and a noise collection feature point aggregation branch;
the step of performing noise prediction on the big data sample acquisition event data based on the noise acquisition feature point decision model and outputting the noise acquisition feature point distribution of the big data sample acquisition event data specifically includes:
performing characteristic analysis on the big data sample acquisition event data based on the noise acquisition characteristic point analysis branch, and outputting a fuzzy noise acquisition characteristic point of the big data sample acquisition event data;
carrying out feature selection based on a penalty term on the fuzzy noise acquisition feature points based on the noise acquisition feature point aggregation branch, and outputting first noise acquisition feature points;
embedding the fuzzy noise acquisition feature points based on the noise acquisition feature point aggregation branch, outputting cost evaluation indexes corresponding to the fuzzy noise acquisition feature points, communicating feature relations of the fuzzy noise acquisition feature points based on the cost evaluation indexes, and outputting second noise acquisition feature points;
and aggregating the first noise acquisition characteristic points and the second noise acquisition characteristic points, and outputting the noise acquisition characteristic point distribution of the big data sample acquisition event data.
8. The big data washing method for AI cloud computing training according to any one of claims 1 to 7, wherein the noise collection feature point distribution includes a forward noise collection feature point and a backward noise collection feature point, the forward noise collection feature point includes a decision support degree of a multi-way coupled noise item of a sample collection target of which each sample collection event unit data in the big data sample collection event data is the target sample collection instance, and the backward noise collection feature point includes a noise field link range and noise field permeation path data corresponding to each sample collection event unit data in the big data sample collection event data;
the step of determining big data collection cleaning decision information of the big data sample collection event data based on the noise collection characteristic point distribution specifically includes:
determining a multi-party coupled noise term for a sample acquisition target of the target sample acquisition instance in the big data sample acquisition event data based on the forward noise acquisition feature points;
outputting a noise positioning element of a sample acquisition target of the target sample acquisition instance in the big data sample acquisition event data based on the multi-party coupled noise item and the noise field link range and the noise field permeation path data corresponding to the sample acquisition event unit data at the multi-party coupled noise item;
and outputting the noise positioning element of the sample acquisition target of the target sample acquisition example as the acquisition cleaning field distribution of the sample acquisition target of the target sample acquisition example.
9. The big data cleaning method for AI cloud computing training according to any one of claims 1 to 8, wherein the step of performing corresponding big data collection cleaning configuration for the AI cloud computing training node based on the big data collection cleaning decision information specifically includes:
acquiring cleaning service node information distributed in each relevant acquisition cleaning field in a corresponding target big data acquisition cleaning control model based on the big data acquisition cleaning decision information;
determining a cleaning control path distributed by each relevant acquisition cleaning field based on cleaning service node information distributed by each relevant acquisition cleaning field;
performing relevant node communication on cleaning control paths distributed in each relevant acquisition cleaning field, and outputting a target cleaning control path of the target big data acquisition cleaning control model;
performing model control instruction distribution on the target big data acquisition and cleaning control model based on the target cleaning control path, and outputting at least one model control instruction of the target big data acquisition and cleaning control model;
and performing corresponding big data acquisition cleaning configuration on the AI cloud computing training node based on at least one model control instruction of the target big data acquisition cleaning control model.
10. A big data collection system, comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to perform the big data cleansing method for AI cloud computing training of any of claims 1-9.
CN202210786105.9A 2022-07-06 2022-07-06 Big data cleaning method and big data acquisition system for AI cloud computing training Active CN115145904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210786105.9A CN115145904B (en) 2022-07-06 2022-07-06 Big data cleaning method and big data acquisition system for AI cloud computing training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210786105.9A CN115145904B (en) 2022-07-06 2022-07-06 Big data cleaning method and big data acquisition system for AI cloud computing training

Publications (2)

Publication Number Publication Date
CN115145904A true CN115145904A (en) 2022-10-04
CN115145904B CN115145904B (en) 2023-04-07

Family

ID=83411354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210786105.9A Active CN115145904B (en) 2022-07-06 2022-07-06 Big data cleaning method and big data acquisition system for AI cloud computing training

Country Status (1)

Country Link
CN (1) CN115145904B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016101690A1 (en) * 2014-12-22 2016-06-30 国家电网公司 Time sequence analysis-based state monitoring data cleaning method for power transmission and transformation device
CN109657800A (en) * 2018-11-30 2019-04-19 清华大学深圳研究生院 Intensified learning model optimization method and device based on parametric noise
CN112711578A (en) * 2020-12-30 2021-04-27 陈静 Big data denoising method for cloud computing service and cloud computing financial server
WO2021164232A1 (en) * 2020-02-17 2021-08-26 平安科技(深圳)有限公司 User identification method and apparatus, and device and storage medium
WO2021180062A1 (en) * 2020-03-09 2021-09-16 华为技术有限公司 Intention identification method and electronic device
CN113505120A (en) * 2021-09-10 2021-10-15 西南交通大学 Double-stage noise cleaning method for large-scale face data set
US20210357776A1 (en) * 2020-05-13 2021-11-18 International Business Machines Corporation Data-analysis-based, noisy labeled and unlabeled datapoint detection and rectification for machine-learning
US20210397895A1 (en) * 2020-06-23 2021-12-23 International Business Machines Corporation Intelligent learning system with noisy label data
US20220188645A1 (en) * 2020-12-16 2022-06-16 Oracle International Corporation Using generative adversarial networks to construct realistic counterfactual explanations for machine learning models
CN114691665A (en) * 2022-04-13 2022-07-01 辽源市讯展网络科技有限公司 Big data analysis-based acquisition noise point mining method and big data acquisition system
CN114691664A (en) * 2022-04-13 2022-07-01 宁夏沸蓝科技发展有限公司 AI prediction-based intelligent scene big data cleaning method and intelligent scene system
CN114697128A (en) * 2022-04-13 2022-07-01 石家庄汇勤网络科技有限公司 Big data denoising method and big data acquisition system through artificial intelligence decision

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016101690A1 (en) * 2014-12-22 2016-06-30 国家电网公司 Time sequence analysis-based state monitoring data cleaning method for power transmission and transformation device
CN109657800A (en) * 2018-11-30 2019-04-19 清华大学深圳研究生院 Intensified learning model optimization method and device based on parametric noise
WO2021164232A1 (en) * 2020-02-17 2021-08-26 平安科技(深圳)有限公司 User identification method and apparatus, and device and storage medium
WO2021180062A1 (en) * 2020-03-09 2021-09-16 华为技术有限公司 Intention identification method and electronic device
US20210357776A1 (en) * 2020-05-13 2021-11-18 International Business Machines Corporation Data-analysis-based, noisy labeled and unlabeled datapoint detection and rectification for machine-learning
US20210397895A1 (en) * 2020-06-23 2021-12-23 International Business Machines Corporation Intelligent learning system with noisy label data
US20220188645A1 (en) * 2020-12-16 2022-06-16 Oracle International Corporation Using generative adversarial networks to construct realistic counterfactual explanations for machine learning models
CN112711578A (en) * 2020-12-30 2021-04-27 陈静 Big data denoising method for cloud computing service and cloud computing financial server
CN113505120A (en) * 2021-09-10 2021-10-15 西南交通大学 Double-stage noise cleaning method for large-scale face data set
CN114691665A (en) * 2022-04-13 2022-07-01 辽源市讯展网络科技有限公司 Big data analysis-based acquisition noise point mining method and big data acquisition system
CN114691664A (en) * 2022-04-13 2022-07-01 宁夏沸蓝科技发展有限公司 AI prediction-based intelligent scene big data cleaning method and intelligent scene system
CN114697128A (en) * 2022-04-13 2022-07-01 石家庄汇勤网络科技有限公司 Big data denoising method and big data acquisition system through artificial intelligence decision

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
RANDAL E.BRYANT等: "Big-Data computing:Creating revolutionary breakthroughs in commerce,science,and society", 《A WHITE PAPER PREPARED FOR THE COMPUTING COMMUNITY CONSORTIUM COMMITTEE OF THE COMPUTING RESEARCH ASSOCIATION》 *
SALVADOR GARCIA等: "Big data preprocessing:methods and prospects", 《BIG DATA ANALYTICS》 *
周宝建: "基于云计算的个人信用数据分析模型的仿真研究", 《科技通报》 *
张琴等: "支持向量学习机在点云去噪中的应用", 《计算机技术与发展》 *
李星南等: "基于孤立森林算法和BP神经网络算法的电力运维数据清洗方法", 《电气应用》 *
李英等: "面向深度神经网络训练的数据差分隐私保护随机梯度下降算法", 《计算机应用与软件》 *
汪海涛等: "基于大数据不平衡样本集的重采样方法及应用", 《现代计算机(专业版)》 *
程元启等: "基于模糊支持向量机的软件缺陷预测技术", 《计算机工程与设计》 *

Also Published As

Publication number Publication date
CN115145904B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111124840A (en) Method and device for predicting alarm in business operation and maintenance and electronic equipment
CN113704771B (en) Service vulnerability mining method based on artificial intelligence analysis and big data mining system
CN115048370B (en) Artificial intelligence processing method for big data cleaning and big data cleaning system
CN111340233B (en) Training method and device of machine learning model, and sample processing method and device
CN114880314B (en) Big data cleaning decision-making method applying artificial intelligence strategy and AI processing system
CN111260082B (en) Spatial object motion trail prediction model construction method based on neural network
CN112860676B (en) Data cleaning method applied to big data mining and business analysis and cloud server
CN113687821A (en) Intelligent code splicing method based on graphic visualization
KR20200076323A (en) Apparatus and method for multi-model parallel execution automation and verification on digital twin
CN116244647A (en) Unmanned aerial vehicle cluster running state estimation method
CN114647790A (en) Big data mining method and cloud AI (Artificial Intelligence) service system applied to behavior intention analysis
CN112115996B (en) Image data processing method, device, equipment and storage medium
CN115145904B (en) Big data cleaning method and big data acquisition system for AI cloud computing training
CN113726558A (en) Network equipment flow prediction system based on random forest algorithm
CN113407837A (en) Intelligent medical big data processing method based on artificial intelligence and intelligent medical system
CN114978765B (en) Big data processing method for information attack defense and AI attack defense system
KR101827124B1 (en) System and Method for recognizing driving pattern of driver
CN115906927B (en) Data access analysis method and system based on artificial intelligence and cloud platform
CN115712843B (en) Data matching detection processing method and system based on artificial intelligence
CN116186016A (en) Training data cleaning method and system for AI training task
CN115422486B (en) Cloud service online page optimization method based on artificial intelligence and big data system
CN113704751B (en) Vulnerability repairing method based on artificial intelligence decision and big data mining system
CN114780967A (en) Mining evaluation method based on big data vulnerability mining and AI vulnerability mining system
CN113098884A (en) Network security monitoring method based on big data, cloud platform system and medium
CN116306574B (en) Big data mining method and server applied to intelligent wind control task analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230105

Address after: 2042 Shaoguang Street, Chenggong District, Kunming, Yunnan 650000

Applicant after: Yang Huanrong

Address before: No. 4607 Canal Avenue, Taierzhuang, Zaozhuang City, Shandong Province, 277400

Applicant before: Zaozhuang Hongyu Digital Technology Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230313

Address after: 100094 Unit -105, Floor 1, Building 10, Xinruijia Garden, Shangzhuang, Haidian District, Beijing

Applicant after: Beijing Zhengyuanda Technology Co.,Ltd.

Address before: 2042 Shaoguang Street, Chenggong District, Kunming, Yunnan 650000

Applicant before: Yang Huanrong

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant