CN114691665A - Big data analysis-based acquisition noise point mining method and big data acquisition system - Google Patents

Big data analysis-based acquisition noise point mining method and big data acquisition system Download PDF

Info

Publication number
CN114691665A
CN114691665A CN202210381584.6A CN202210381584A CN114691665A CN 114691665 A CN114691665 A CN 114691665A CN 202210381584 A CN202210381584 A CN 202210381584A CN 114691665 A CN114691665 A CN 114691665A
Authority
CN
China
Prior art keywords
acquisition
collection
routing
field
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210381584.6A
Other languages
Chinese (zh)
Other versions
CN114691665B (en
Inventor
徐信福
苏健明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongkun Beijing Aviation Equipment Co ltd
Original Assignee
Liaoyuan Xunzhan Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoyuan Xunzhan Network Technology Co ltd filed Critical Liaoyuan Xunzhan Network Technology Co ltd
Priority to CN202210381584.6A priority Critical patent/CN114691665B/en
Publication of CN114691665A publication Critical patent/CN114691665A/en
Application granted granted Critical
Publication of CN114691665B publication Critical patent/CN114691665B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the application provides a method for mining the collection noise points based on big data analysis and a big data collection system, wherein the method comprises the steps of mining the redundant collection field of each training redundant feedback data node in a plurality of training redundant feedback data nodes and the sample collection routing space covered by the training source data collection track, then mining the sample collection routing field of each sample collection routing node in the sample collection routing space one by one, then combining the redundant collection fields of the training redundant feedback data nodes and the sample collection routing field of the sample collection routing space, determining the collection noise points related to the training source data collection track and a plurality of connected collection noise points related to the collection noise points, and tracing the collection noise points to optimize the subsequent big data collection process by combining the redundant collection feedback process and the sample collection routing process, therefore, the reliability of the optimization of the big data acquisition process is improved.

Description

Big data analysis-based acquisition noise point mining method and big data acquisition system
Technical Field
The application relates to the technical field of big data acquisition, in particular to a method for mining an acquisition noise point based on big data analysis and a big data acquisition system.
Background
With the development of artificial intelligence technology, training sample data is learned based on an artificial intelligence model so as to be applied to relevant online business requirements. Based on this, a large amount of training sample data needs to be collected, and the collection of such training sample data requires extensive large data, and the reliability of the training sample data also concerns the reliability of training learning. In the related art, due to the existence of the acquisition noise point, the generation of some training redundant feedback data nodes can be caused. In the existing noise point mining method, screening is mainly performed based on simple preset field rules, so that it is difficult to effectively and reasonably mine and collect noise points to perform big data collection flow optimization, the reliability of the big data collection flow optimization is influenced, and the final sample learning and training effects are influenced.
Disclosure of Invention
In order to overcome at least the above disadvantages in the prior art, the present application aims to provide a method for mining a collection noise point based on big data analysis and a big data collection system.
In a first aspect, the present application provides a method for mining an acquisition noise point based on big data analysis, which is applied to a big data acquisition system, and the method includes:
determining a plurality of training redundant feedback data nodes from a training redundant feedback flow by combining a training source data acquisition track of a big data acquisition flow requested by an AI training task issued by an AI training server, and then analyzing redundant acquisition fields of each training redundant feedback data node in the plurality of training redundant feedback data nodes;
excavating a sample acquisition routing space covered by the training source data acquisition track, and then excavating sample acquisition routing fields of all sample acquisition routing nodes in the sample acquisition routing space one by one;
and determining a collection noise point related to the training source data collection track and a plurality of connected collection noise points related to the collection noise point by combining the redundant collection fields of the training redundant feedback data nodes and the sample collection routing field of the sample collection routing space, and optimizing a big data collection flow of the AI training server by combining the collection noise point and the plurality of connected collection noise points related to the collection noise point.
In a second aspect, an embodiment of the present application further provides a collected noise point mining system based on big data analysis, where the collected noise point mining system based on big data analysis includes a big data collection system and a plurality of AI training servers in communication connection with the big data collection system;
the big data acquisition system is used for:
determining a plurality of training redundant feedback data nodes from a training redundant feedback flow by combining a training source data acquisition track of a big data acquisition flow requested by an AI training task issued by an AI training server, and then analyzing redundant acquisition fields of each training redundant feedback data node in the plurality of training redundant feedback data nodes;
excavating a sample acquisition routing space covered by the training source data acquisition track, and then excavating sample acquisition routing fields of all sample acquisition routing nodes in the sample acquisition routing space one by one;
and determining a collection noise point related to the training source data collection track and a plurality of connected collection noise points related to the collection noise point by combining the redundant collection fields of the training redundant feedback data nodes and the sample collection routing field of the sample collection routing space, and optimizing a big data collection flow of the AI training server by combining the collection noise point and the plurality of connected collection noise points related to the collection noise point.
In a third aspect, an embodiment of the present application further provides a big data collecting system, where the big data collecting system includes a processor and a machine-readable storage medium, where a computer program is stored in the machine-readable storage medium, and the computer program is loaded and executed by the processor to implement the big data analysis-based collected noise point mining method according to any aspect above.
In combination with any of the above aspects, the embodiment mines the redundant acquisition field of each of the training redundant feedback data nodes and the sample acquisition routing space covered by the training source data acquisition trajectory, then the sample collection routing fields of all the sample collection routing nodes in the sample collection routing space are mined one by one, then combining the redundant acquisition fields of a plurality of training redundant feedback data nodes and the sample acquisition routing field of the sample acquisition routing space to determine an acquisition noise point related to a training source data acquisition track and a plurality of connected acquisition noise points related to the acquisition noise point, therefore, the redundant acquisition feedback process and the sample acquisition routing process are combined, the acquisition noise points are traced for subsequent big data acquisition process optimization, and the reliability of the big data acquisition process optimization is improved.
Drawings
Fig. 1 is a schematic flowchart of a mining method for collecting noise points based on big data analysis according to an embodiment of the present disclosure;
fig. 2 is a block diagram illustrating a structure of a big data acquisition system for implementing the above-mentioned big data analysis-based acquisition noise point mining method according to an embodiment of the present disclosure.
Detailed Description
The architecture of the big data analysis-based acquisition noise point mining system 10 provided by an embodiment of the present application is described below, and the big data analysis-based acquisition noise point mining system 10 may include a big data acquisition system 100 and an AI training server 200 communicatively connected to the big data acquisition system 100. The big data collection system 100 and the AI training server 200 in the big data analysis-based collection noise point mining system 10 may cooperatively perform the big data analysis-based collection noise point mining method described in the following method embodiments, and the detailed description of the method embodiments may be referred to in the following steps of the big data collection system 100 and the AI training server 200.
The big data analysis-based collected noise point mining method provided in this embodiment may be executed by the big data collection system 100, and will be described in detail below with reference to fig. 1.
The Process110 determines a plurality of training redundant feedback data nodes from the training redundant feedback flow by combining with a training source data acquisition trajectory of a big data acquisition flow requested by an AI training task issued by an AI training server, and then analyzes a redundant acquisition field of each training redundant feedback data node in the plurality of training redundant feedback data nodes.
In some embodiments, when the AI training server needs to collect training sample data, a corresponding AI training task may be issued, where the AI training task may be used to request execution of a corresponding big data acquisition flow, so as to perform training source data acquisition based on the big data acquisition flow and obtain a corresponding training source data acquisition trajectory, and in a subsequent training process, a training redundancy feedback flow may be added, where a plurality of training redundancy feedback data nodes may be determined in the training redundancy feedback flow, that is, data nodes that may generate redundant useless training data, so as to indicate subsequent big data acquisition optimization.
In some embodiments, assuming that K training redundant feedback data nodes are obtained from the training source data acquisition trajectory, the redundant acquisition fields of each training redundant feedback data node may be extracted as L1, L2, … …, LK. That is, each training redundant feedback data node will correspond to a redundant acquisition field.
The redundant acquisition field may represent a specific redundant acquisition object of each training redundant feedback data node, and if the specific redundant acquisition object of a certain training redundant feedback data node is a subscription field of a certain online subscription process, the redundant acquisition field may be a subscription noise field characteristic related to the subscription field of the online subscription process.
The Process120 excavates the sample acquisition routing space covered by the training source data acquisition trajectory, and then excavates the sample acquisition routing fields of each sample acquisition routing node in the sample acquisition routing space one by one.
For example, each training source data acquisition trajectory may generate, in real-time, an associated sample acquisition routing space (which may be used to characterize information about sample acquisition routing nodes associated with the training source data acquisition trajectory itself), both of which serve as configurations for generating acquisition noise points. And mapping each sample collection routing node in the sample collection routing space to a sample collection routing field in another routing field index space according to the routing field index. Assuming that the sample collection route space includes P sample collection route nodes, the sample collection route fields R1, R2, … …, RP will result from the route field index. Each sample collection routing field and each redundant collection field may have field variables that have a field association relationship. Thus, both configuration vectors of different modalities are mapped into the same routing field index space, depending on the processing of Process110 and Process 120.
And the Process130 performs field communication between the redundant acquisition fields of the training redundant feedback data nodes and the sample acquisition routing fields of the sample acquisition routing space, and generates a first field communication matrix.
The embodiment may further include: and a step of obtaining a first redundant field connection matrix and a first redundant routing path characteristic related to the first redundant routing path, and obtaining a second redundant field connection matrix and a second redundant routing path characteristic related to the second redundant routing path, wherein further, the first redundant routing path may be an initial redundant routing path for expressing the first field connection matrix, and the second redundant routing path may be a redundant routing path for expressing the separation between the first field connection matrices of different modes.
Further, in the Process130, field communication is performed in sequence according to the first redundant field communication matrix, the redundant acquisition fields of the training redundant feedback data nodes, the second redundant field communication matrix, and the sample acquisition routing field of the sample acquisition routing space, so as to generate a first field communication matrix.
The Process140 determines a first routing path of the redundant acquisition trigger point associated with each redundant acquisition field in the training source data acquisition trajectory, and determines a second routing path of the sample acquisition routing node associated with each sample acquisition routing field in the sample acquisition routing space.
Further, for each redundant acquisition field, it is determined that it corresponds to a training redundant feedback data node in the training source data acquisition trajectory, where the training redundant feedback data node may be considered as the first routing path. For each sample acquisition routing field, it is determined that it corresponds to a few sample acquisition routing nodes in the sample acquisition routing space, where the few sample acquisition routing nodes may be considered as the second routing paths.
The Process150 performs secondary mapping on each first field communication matrix unit in the first field communication matrix by combining the first routing path related to each redundant acquisition field and the second routing path related to each sample acquisition routing field, and generates a second field communication matrix.
And further, embedding the characteristics of each first field communication matrix unit in the first field communication matrix according to the first redundant routing path characteristics, the first routing path related to each redundant acquisition field, the second redundant routing path characteristics and the second routing path related to each sample acquisition routing field, and generating a second field communication matrix.
And the Process160 determines, by combining the second field connected matrix, a collection noise point related to the training source data collection trajectory and a plurality of connected collection noise points related to the collection noise point, and performs big data collection flow optimization on the AI training server by combining the collection noise point and the plurality of connected collection noise points related to the collection noise point.
Further therein, at Process160, in conjunction with the second field connectivity matrix, acquisition noise points associated with the training source data acquisition trajectory may first be generated. The acquisition noise points thus obtained may be regarded as direct acquisition noise points. Then, the correlated connected collection noise points can be determined by combining the direct collection noise points with the preset past noise connected record library. The acquisition noise points involved in the specific process of generating the acquisition noise points associated with the training source data acquisition trajectory described in the following embodiments are direct acquisition noise points.
Based on the steps, in the embodiment of the application, the redundant acquisition field of each training redundant feedback data node in the training redundant feedback data nodes and the sample acquisition routing space covered by the training source data acquisition track are mined, then the sample acquisition routing fields of each sample acquisition routing node in the sample acquisition routing space are mined one by one, then the redundant acquisition fields of the training redundant feedback data nodes and the sample acquisition routing fields of the sample acquisition routing space are combined to determine the acquisition noise points related to the training source data acquisition track and the plurality of connected acquisition noise points related to the acquisition noise points, and therefore the redundant acquisition feedback flow and the sample acquisition routing flow are combined, the acquisition noise points are traced back to perform subsequent large data acquisition flow optimization, and therefore the reliability of large data acquisition flow optimization is improved.
For example, in some examples, acquisition noise points associated with the training source data acquisition trajectory may be generated individually. In some examples, in the process of determining, in combination with the second field connectivity matrix, an acquisition noise point related to the training source data acquisition trajectory, an embodiment of the present application provides an acquisition noise point output method, and a specific implementation manner is as follows.
And the Process210 loads the second field connectivity matrix to an AI unit for analyzing the acquired noise points, and outputs the current sample acquisition routing node of the acquired noise points. Of course, when the Process210 is executed for the first time, the obtained current sample collection routing node of the collection noise point is a trigger routing node.
For example, in some examples, the following steps may be combined to obtain a current sample acquisition routing node of the acquisition noise point. The acquisition noise point analysis AI unit may include a self-coding branch and a noise point analysis branch. Firstly, loading the second field connection matrix to a self-coding branch, and outputting self-coding distribution of a current sample collection routing node corresponding to a collected noise point. And then, according to the noise point analysis branch, converting the self-coding distribution of the current sample collection routing node corresponding to the collection noise point into a first noise decision thermodynamic diagram, wherein the first noise decision thermodynamic diagram comprises the heat triggering force values corresponding to all the sample collection routing nodes in the sample collection routing node sequence. And finally, combining the first noise decision thermodynamic diagram to output the current sample acquisition routing node of the acquisition noise point.
And further, selecting the thermodynamic characteristic region with the largest thermodynamic value in the first noise decision thermodynamic diagram, and using the sample collection routing node related to the thermodynamic characteristic region as the current sample collection routing node of the collection noise point. In this way, an acquisition noise point associated with the training source data acquisition trajectory will ultimately be generated.
Or, for another example, a plurality of thermodynamic characteristic regions with larger thermodynamic values may be selected in the first noise decision thermodynamic diagram, and a plurality of sample collection routing nodes related to the plurality of thermodynamic characteristic regions may all be used as the associated routing node set information of the collection noise point. In this way, a plurality of acquisition noise points will ultimately be generated that are correlated to the training source data acquisition trajectory.
Wherein further, in combination with the first noise decision thermodynamic diagram, outputting the trigger routing node that collects noise points may include: performing thermal value arrangement on thermal values in the first noise decision thermodynamic diagram; and determining the thermal value of R before sequencing in the thermal value configuration information, and determining the sample collection routing node of R before sequencing as an associated routing node set of the trigger routing node of the collection noise point. Assume that a set of associated routing nodes of R trigger routing nodes is selected. And loading the adjusted second field connectivity matrix to the self-coding branch, and repeatedly performing the above operations until obtaining AI output information of the collection noise point may include: generating associated routing node sets of other sample collection routing nodes in sequence by combining the associated routing node sets of the trigger routing nodes; and determining the specified number of the collection noise points by combining the associated routing node sets of the sample collection routing nodes in the collection noise points. Namely, each associated routing node set triggering the routing node is loaded to the AI unit for analyzing the acquired noise points, and R associated routing node sets of the second sample acquisition routing node are output. Therefore, R × R trigger routing nodes and the associated routing node set of the second sample collection routing node are obtained in total. For example, a combination of R trigger routing nodes with a larger combined thermal value and a second sample collection routing node may be extracted from the R × R associated routing node sets in combination with the thermal value of the trigger routing node and the trigger thermal value (for example, a product of the two values), and the combination is loaded to the collection noise point analysis AI unit as the obtained sample collection routing node to generate an associated routing node set of each subsequent sample collection routing node. Finally, R acquisition noise points associated with the training source data acquisition trajectory may be determined.
And the Process220 determines whether the current sample collection routing node of the collection noise point obtained in the Process210 is AI output information. If the determination at Process220 is negative, processing proceeds to Process 230. On the other hand, if it is determined yes in the Process220, the Process ends.
The Process230 determines a sample collection routing field of the current sample collection routing node of the collection noise point. Here, similarly to the Process120 described in the foregoing embodiment, the generated current sample collection routing node of the collection noise point is mapped to a sample collection routing field in another vector space in accordance with a routing field index.
And the Process240 optimizes the second field connectivity matrix based on field aggregation by combining the sample collection routing field of the current sample collection routing node of the collection noise point and the routing path of the current sample collection routing node of the collection noise point. Next, a description will be given by taking the current sample collection routing node of the collection noise point as a trigger routing node as an example. Of course, the present sample collection routing node of the collection noise point is the same as the other sample collection routing nodes in the embodiment. In combination with the sample collection routing field of the trigger routing node of the collection noise point and the routing path thereof in the collection noise point, optimizing the second field connectivity matrix based on field aggregation may include: and performing field communication on the redundant acquisition fields of the training redundant feedback data nodes, the sample acquisition routing fields of the sample acquisition routing space and the sample acquisition routing fields of the trigger routing nodes for acquiring noise points, and optimizing the first field communication matrix. And secondly, carrying out secondary mapping on each first field communication matrix unit in the optimized first field communication matrix by combining a first routing path related to each redundant acquisition field, a second routing path related to each sample acquisition routing field in the sample acquisition routing space and a second routing path of a trigger routing node of the acquired noise point, and optimizing the second field communication matrix.
In some examples, the optimizing the second field connectivity matrix based on field aggregation, in combination with the sample acquisition routing field of the trigger routing node acquiring the noise point and the routing path thereof in the acquisition noise point, may include: and performing field communication on the redundant acquisition fields of the training redundant feedback data nodes, the sample acquisition routing fields of the sample acquisition routing space and the sample acquisition routing fields of the trigger routing nodes for acquiring noise points, and optimizing the first field communication matrix. And then, carrying out on-line service application characteristic labeling on each first field communication matrix unit in the first field communication matrix by combining on-line service application related to each redundant acquisition field, on-line service application of a sample acquisition routing node related to each sample acquisition routing field in the sample acquisition routing space and on-line service application of a sample acquisition routing node related to a trigger routing node for acquiring noise points, and optimizing the first field communication matrix. Or, performing online service application characteristic labeling on each first field communication matrix unit in the second field communication matrix by combining online service application related to each redundant acquisition field, online service application where a sample acquisition routing node related to each sample acquisition routing field of a sample acquisition routing space is located, and online service application where a sample acquisition routing node related to a trigger routing node of an acquisition noise point is located, and optimizing the second field communication matrix.
Then, the Process returns to the Process210 to load the optimized second field connectivity matrix to the acquisition noise point analysis AI unit, and optimize the second field connectivity matrix by traversing the sample acquisition routing fields of the determined sample acquisition routing nodes until the AI output information of the acquisition noise points is obtained.
In the foregoing embodiments, the collected noise point mining method based on big data analysis in conjunction with the embodiments of the present application is described in detail. It can be seen that, in the big data analysis-based acquisition noise point mining method combined with the embodiment of the present application, the acquisition noise points related to the training source data acquisition trajectory are generated according to the fusion of the vectors of different configured modalities.
Further, the embodiment of the present application may further include: after outputting the AI output information of the sampling noise point, arbitrarily determining one self-encoding distribution as a third self-encoding distribution and one self-encoding distribution as a first self-encoding distribution among a plurality of self-encoding distributions generated by the self-encoding branches. Further, one of the self-coding distributions as a third self-coding distribution may be a self-coding distribution associated with a first redundant field connectivity matrix that precedes the redundant acquisition fields of the training redundant feedback data nodes, and one of the self-coding distributions as a first self-coding distribution may be a self-coding distribution associated with a second redundant field connectivity matrix that precedes the redundant acquisition fields of the training redundant feedback data nodes and the sample acquisition routing fields of the sample acquisition routing space. Then, a degree of match between the third self-encoding distribution and the first self-encoding distribution is determined. Wherein further, the degree of match between the third self-encoding distribution and the first self-encoding distribution may be calculated using a characteristic distance. Thus, the matching degree is a value between-1 and 1, and the closer the value is to 1, the higher the matching degree is considered. And finally, determining whether the training source data acquisition track is matched with the sample acquisition routing space or not by combining the matching degree.
In addition, the matching degree of the collection noise point and the sample collection routing space can be further analyzed. Under the condition that a plurality of acquisition noise points related to the training source data acquisition track are generated, the generated acquisition noise points can be labeled and sorted by combining the generated acquisition noise points and the matching degree of the sample acquisition routing space so as to label the acquisition noise points with low matching degree.
Wherein further, this embodiment may further include: respectively aiming at each of the specified number of collected noise points, implementing the following steps: after outputting the AI output information of the sampling noise point, arbitrarily determining one self-encoding distribution as a first self-encoding distribution and one self-encoding distribution as a second self-encoding distribution among a plurality of self-encoding distributions generated by the self-encoding branches. Further, one of the self-coding distributions may be a self-coding distribution related to a second redundancy field connection matrix between the redundancy acquisition fields residing in the training redundancy feedback data nodes and the sample acquisition routing fields of the sample acquisition routing space, and one of the self-coding distributions may be a self-coding distribution related to a second redundancy field connection matrix between the sample acquisition routing fields residing in the sample acquisition routing space and the sample acquisition routing fields of the acquisition noise points. Then, a degree of match between the first self-encoding distribution and the second self-encoding distribution is determined. Wherein further, a degree of match between the first self-encoding distribution and the second self-encoding distribution may be calculated using a characteristic distance. Further, the matching degree is a numerical value between-1 and 1, and the closer the numerical value is to 1, the higher the matching degree is. And if the maximum matching degree is not greater than the set matching degree threshold value, determining that the information of the acquisition noise point related to the training source data acquisition track does not exist. In addition, the higher the matching degree is, the higher the accuracy of the collected noise points is. The matching degree of the sample collection routing space and the collection noise points is shown, and the template collection noise points and the target collection noise points have certain distinguishing capacity. The collection noise points generated by labeling are filtered by combining the matching degree of the sample collection routing space and the collection noise points, the collection noise points with low matching degree are filtered and labeled, and the accuracy of collecting the noise points is improved.
In addition, the embodiment may further use the matching degree between the training source data acquisition track and the acquisition noise point and the matching degree between the sample acquisition routing space and the acquisition noise point to find the acquisition noise point generated only by combining the training source data acquisition track.
Wherein further, this embodiment may further include: respectively aiming at each of the specified number of collected noise points, implementing the following steps: after outputting the AI output information of the sampling noise point, arbitrarily determining one self-coding distribution as a third self-coding distribution, one self-coding distribution as a first self-coding distribution, and one self-coding distribution as a second self-coding distribution among a plurality of self-coding distributions generated by the self-coding branches. Further, one of the self-coding distributions as the third self-coding distribution may be a self-coding distribution related to a first redundancy field connection matrix existing before the redundancy acquisition fields of the training redundancy feedback data nodes, one of the self-coding distributions as the first self-coding distribution may be a self-coding distribution related to a second redundancy field connection matrix existing between the redundancy acquisition fields of the training redundancy feedback data nodes and the sample acquisition routing fields of the sample acquisition routing space, and one of the self-coding distributions as the second self-coding distribution may be a self-coding distribution related to a second redundancy field connection matrix existing between the sample acquisition routing fields of the sample acquisition routing space and the sample acquisition routing fields of the acquisition noise point. Then, a matching degree between the third self-coding distribution and the first self-coding distribution is determined, and a matching degree between the third self-coding distribution and the second self-coding distribution is determined. Wherein further, feature distances may be used to calculate a degree of matching between the training source data acquisition trajectory and the acquisition noise points and a degree of matching between the first self-encoding distribution and the second self-encoding distribution. Thus, the matching degree is a numerical value between-1 and 1, and the closer the numerical value is to 1, the higher the matching degree is considered. And when the matching degree between the third self-coding distribution and the first self-coding distribution is smaller than a first set matching degree threshold value and the matching degree between the third self-coding distribution and the second self-coding distribution is larger than a second set matching degree threshold value, determining the acquisition noise point as an acquisition noise point generated only by combining with a training source data acquisition track.
Next, the AI training steps involved in the above big data analysis based acquisition noise point mining method will be introduced.
And updating the weight parameters of the deep learning model, the self-coding branch and the noise point analysis branch according to the big data of a first template training source data acquisition track, wherein the big data of the first template training source data acquisition track comprises a plurality of first template training source data acquisition tracks, and each first template training source data acquisition track comprises a first template training source data acquisition track, a first template sample acquisition routing space related to the first template training source data acquisition track and a template acquisition noise point related to the first template training source data acquisition track and the first template sample acquisition routing space.
Further, in any first template training source data acquisition track in the big data of the first template training source data acquisition tracks, the following steps are carried out for each first template training source data acquisition track.
The Process310 obtains a plurality of first template redundant feedback data nodes from the first template training source data acquisition trajectory of the first template training source data acquisition trajectory, and then analyzes each redundant acquisition field in the plurality of first template redundant feedback data nodes.
The Process320 obtains a first template sample collection routing space related to the first template training source data collection track, and then excavates sample collection routing fields of each sample collection routing node in the first template sample collection routing space one by one, wherein each sample collection routing field and each redundant collection field may have a field variable having a field association relationship.
The Process330 changes a plurality of sample collection routing nodes in the template collection noise points into connected sample collection routing nodes to generate connected collection noise points, and then excavates sample collection routing fields of all the sample collection routing nodes in the connected collection noise points one by one, wherein the sample collection routing fields and each redundant collection field of all the sample collection routing nodes in the connected collection noise points can have field variables with field association relations.
The Process340 performs field communication on the redundant acquisition fields of the plurality of first template redundant feedback data nodes, the sample acquisition routing field of the first template sample acquisition routing space, and the sample acquisition routing field of each sample acquisition routing node in the connected acquisition noise point, and generates a first training field connection matrix.
The Process350 determines a first routing path of the redundant acquisition trigger point related to the redundant acquisition field of each first template redundant feedback data node in the first template training source data acquisition trajectory, determines a second routing path of the sample acquisition routing node related to each sample acquisition routing field in the first template sample acquisition routing space, and determines a second routing path of the sample acquisition routing node related to each sample acquisition routing field in the connected acquisition noise points.
The Process360 performs secondary mapping on each first field communication matrix unit in the first training field communication matrix by combining the first routing path related to each redundant acquisition field and the second routing path related to each sample acquisition routing field, and generates a second training field communication matrix.
And the Process370 determines a plurality of connected sample collection routing nodes in the connected collection noise points by combining the second training field connected matrix.
And the Process380 calculates first cost information between the plurality of connected sample collection routing nodes and the template sample collection routing node.
And the Process390 updates the weight parameter information of the deep learning model, the self-coding branch and the noise point analysis branch at least in combination with the first price information, and performs acquisition noise point prediction in combination with the deep learning model, the self-coding branch and the noise point analysis branch.
In addition, in the aspect of cost information selection, in addition to the first cost information between the communication sample collection routing node and the template sample collection routing node, the further cost information between the training source data collection track and the sample collection routing space, the further cost information between the training source data collection track and the generation collection noise point, and the further cost information between the sample collection routing space and the generation collection noise point can be further increased. The training source data acquisition track, the sample acquisition routing space and the generated acquisition noise point are all the first field connected matrixes obtained according to the same acquisition noise point analysis AI unit, so that the purpose of increasing the advanced cost information is to enable the acquisition noise point generated by the acquisition noise point analysis AI unit to be as close as possible to the training source data acquisition track or the sample acquisition routing space in the learning process.
The above-mentioned advanced cost information can be calculated, for example, in the following manner. First, one self-coding distribution as a third self-coding distribution, one self-coding distribution as a first self-coding distribution, and one self-coding distribution as a second self-coding distribution are selected among a plurality of self-coding distributions generated by the self-coding branches.
And then, calculating second cost information as advanced cost information between the training source data acquisition track and the sample acquisition routing space by combining the matching degree between the third self-coding distribution and the related first self-coding distribution, the matching degree between the third self-coding distribution and the unrelated first self-coding distribution, the matching degree between the first self-coding distribution and the related third self-coding distribution and the matching degree between the first self-coding distribution and the unrelated third self-coding distribution.
And calculating third price information by combining the matching degree between the third self-coding distribution and the related second self-coding distribution, the matching degree between the third self-coding distribution and the unrelated second self-coding distribution, the matching degree between the second self-coding distribution and the related third self-coding distribution, and the matching degree between the second self-coding distribution and the non-corresponding third self-coding distribution, wherein the third price information is used as the advanced price information between the training source data acquisition track and the acquisition noise point.
And calculating fourth cost information by combining the matching degree between the first self-coding distribution and the related second self-coding distribution, the matching degree between the first self-coding distribution and the unrelated second self-coding distribution, the matching degree between the second self-coding distribution and the related first self-coding distribution and the matching degree between the second self-coding distribution and the unrelated first self-coding distribution, wherein the fourth cost information is used as the advanced cost information between the sample acquisition routing space and the acquisition noise point.
And, updating the weight parameter information of the deep learning model, the self-coding branch and the noise point analysis branch at least in conjunction with the first cost information may include: and training the deep learning model, the self-coding branch and the noise point analysis branch by combining the sum of the first cost information, the second cost information, the third cost information and the fourth cost information.
For example, in some examples, before the weight parameters of the deep learning model, the self-coding branch and the noise point analysis branch are updated according to big data of a first template training source data acquisition track, a pre-training step is further included. Wherein further, the above method may further comprise: and updating the weight parameter information of the deep learning model and the self-coding branch according to big data of a second template training source data acquisition track, wherein the big data of the second template training source data acquisition track comprises a plurality of second template training source data acquisition tracks, and each second template training source data acquisition track comprises a second template training source data acquisition track and a second template sample acquisition routing space related to the second template training source data acquisition track. Here, it can be seen that the big data of the second template training source data acquisition trajectory differs from the big data of the first template training source data acquisition trajectory in that the big data of the second template training source data acquisition trajectory is the big data of the template training source data acquisition trajectory without an acquisition noise point.
Updating the weight parameter information of the deep learning model and the self-coding branch according to the big data of the second template training source data acquisition track may include the following steps.
In any second template training source data acquisition track in the big data of the second template training source data acquisition tracks, the following steps are carried out for each second template training source data acquisition track.
Firstly, a plurality of second template redundant feedback data nodes are obtained from the template training source data acquisition track of the second template training source data acquisition track, and then each redundant acquisition field in the plurality of second template redundant feedback data nodes is analyzed.
Then, a second template sample collection routing space related to the second template training source data collection track is obtained, and then sample collection routing fields of all sample collection routing nodes in the second template sample collection routing space are mined one by one, wherein each sample collection routing field and each redundant collection field can have a field variable with a field contact relation.
And then, performing field communication on the redundant acquisition fields of the plurality of second template redundant feedback data nodes and the sample acquisition routing field of the second template sample acquisition routing space to generate a third training field communication matrix.
Then, a first routing path of the redundant acquisition trigger point related to the redundant acquisition field of each second template redundant feedback data node in the second template training source data acquisition track is determined, and a second routing path of the sample acquisition routing node related to each sample acquisition routing field in the second template sample acquisition routing space is determined.
And secondly, carrying out secondary mapping on each first field communication matrix unit in the third training field communication matrix by combining the first routing path related to each redundant acquisition field and the second routing path related to each sample acquisition routing field to generate a fourth training field communication matrix.
It can be seen that this pre-training process is generally similar to the big data analysis based acquisition noise point mining method described in the previous embodiments. Except that the pre-training process does not generate and output acquisition noise points. According to the method, pre-training is carried out on big data of a training source data acquisition track of a second template of a noise point acquisition track without a training label, the field characteristics of a redundant acquisition field and a sample acquisition routing field of the training source data acquisition track can be learned, and the third self-coding distribution is close to the sample acquisition routing field.
Next, the fourth training field connectivity matrix is loaded to the self-encoding branch, and one self-encoding distribution that is the third self-encoding distribution and one self-encoding distribution that is the first self-encoding distribution are selected among a plurality of self-encoding distributions generated by the self-encoding branch.
Then, fifth cost information is calculated by combining the matching degree between the third self-coding distribution and the related first self-coding distribution and the matching degree between the third self-coding distribution and the unrelated first self-coding distribution, and the matching degree between the first self-coding distribution and the related third self-coding distribution and the matching degree between the first self-coding distribution and the unrelated third self-coding distribution.
And then, combining the fifth cost information, and updating the weight parameter information of the deep learning model and the self-coding branch.
In some examples, performing big data acquisition procedure optimization on the AI training server in combination with the acquisition noise point and a plurality of connected acquisition noise points related to the acquisition noise point may include the following steps.
A Process161, which determines a plurality of big data acquisition control data associated with the connected noise point cluster from a big data acquisition template database, with the acquired noise point and a plurality of connected acquired noise points related to the acquired noise point as a connected noise point cluster;
the Process162 is used for generating frequent item field distribution data corresponding to the connected noise point clusters by combining the collection item ranges of the large data collection control data and the extraction of the large data collection frequent items of the large data collection control data;
the Process163 generates frequent noise field distribution data by combining the label field distribution of the connected noise point clusters and the distribution frequency of the large data acquisition frequent items in the connected noise point clusters;
and the Process164 generates big data acquisition Process optimization information corresponding to the connected noise point clusters based on the frequent item field distribution data and the frequent item noise field distribution data, and outputs the big data acquisition Process optimization information serving as an optimization basis to a developer for optimization prompt.
Combining the collection item ranges of the plurality of big data collection control data and the extraction of the big data collection frequent items of the plurality of big data collection control data to generate frequent item field distribution data corresponding to the connected noise point clusters, including: performing frequent item extraction on the large data acquisition control data to obtain large data acquisition frequent items in a historical acquisition control period in the large data acquisition control data; performing frequent positioning point labeling on the big data acquisition frequent items extracted from the plurality of big data acquisition control data, and combining the big data acquisition frequent items of the same frequent positioning point extracted from different big data acquisition control data; combining the distribution characteristics of the large data acquisition frequent items of the same frequent positioning point in each large data acquisition control data to generate the frequent item distribution characteristics of the large data acquisition frequent items; combining the collection item range of each big data collection control data and the frequent item distribution characteristics of the big data collection frequent items to generate the current field distribution data of each big data collection frequent item; and generating frequent item field distribution data corresponding to the connected noise point cluster based on the current field distribution data of each big data acquisition frequent item.
Generating frequent noise field distribution data by combining the label field distribution of the connected noise point cluster and the distribution frequency of the big data acquisition frequent items in the connected noise point cluster, wherein the generation comprises the following steps: combining the distribution characteristic information of the connected noise point clusters in the big data acquisition control data and the big data acquisition frequent items in the historical acquisition control period in the extracted big data acquisition control data to generate the distribution frequency of the big data acquisition frequent items in the connected noise point clusters;
combining the distribution characteristic information of the connected noise point clusters in the big data acquisition control data and the crawling field category attribute of the crawling script end extracting the big data acquisition control data to generate the label field distribution of the connected noise point clusters; and combining the distribution of the label fields of the connected noise point clusters and the distribution frequency of the large data acquisition frequent items in the connected noise point clusters to generate frequent item noise field distribution data.
Fig. 2 illustrates a hardware structure of the big data collecting system 100 for implementing the big data analysis-based collecting noise point mining system according to the embodiment of the present disclosure, and as shown in fig. 2, the big data collecting system 100 may include a processor 110, a machine-readable storage medium 120, a bus 130, and a communication unit 140.
The processor 110 may perform various suitable actions and processes according to a program stored in the machine-readable storage medium 120, such as program instructions corresponding to the big data analysis based acquisition noise point mining method described in the foregoing embodiments. The processor 110, the machine-readable storage medium 120, and the communication unit 140 perform signal transmission through the bus 130.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication unit 140, and when executed by the processor 110, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
Still another embodiment of the present disclosure further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the method for mining the collected noise points based on big data analysis according to any of the above embodiments is implemented.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (LAM), a read-only memory (LOM), an erasable programmable read-only memory (EPLOM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-LOM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, LM (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method shown in the above embodiments.
Yet another embodiment of the present disclosure further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for mining the collection noise points based on the big data analysis as described in any of the above embodiments is implemented.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local Area network (LAR) or a wide Area network (pin Area network (PAR)), or the remote computer may be connected to an external computer (e.g., through the internet using an internet service provider).
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as LOM, LAM, magnetic or optical disks, etc.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present disclosure, and not for limiting the same; while the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the scope of the embodiments of the present disclosure by the essence of the corresponding technical solutions.

Claims (10)

1. A big data analysis-based acquisition noise point mining method is applied to a big data acquisition system, and comprises the following steps:
determining a plurality of training redundant feedback data nodes from a training redundant feedback flow by combining a training source data acquisition track of a big data acquisition flow requested by an AI training task issued by an AI training server, and then analyzing redundant acquisition fields of each training redundant feedback data node in the plurality of training redundant feedback data nodes;
excavating a sample acquisition routing space covered by the training source data acquisition track, and then excavating sample acquisition routing fields of all sample acquisition routing nodes in the sample acquisition routing space one by one;
and determining a collection noise point related to the training source data collection track and a plurality of connected collection noise points related to the collection noise point by combining the redundant collection fields of the training redundant feedback data nodes and the sample collection routing field of the sample collection routing space, and optimizing a big data collection flow of the AI training server by combining the collection noise point and the plurality of connected collection noise points related to the collection noise point.
2. The big data analysis-based collection noise point mining method according to claim 1, wherein the step of determining a collection noise point associated with the training source data collection trajectory and a plurality of connected collection noise points associated with the collection noise point by combining the redundant collection fields of the training redundant feedback data nodes and the sample collection routing field of the sample collection routing space, and optimizing the big data collection process of the AI training server by combining the collection noise point and the plurality of connected collection noise points associated with the collection noise point comprises:
performing field communication on the redundant acquisition fields of the training redundant feedback data nodes and the sample acquisition routing fields of the sample acquisition routing space to generate a first field communication matrix;
determining a first routing path of a redundant acquisition trigger point related to each redundant acquisition field in the training source data acquisition track, and determining a second routing path of a sample acquisition routing node related to each sample acquisition routing field in the sample acquisition routing space;
performing secondary mapping on each first field communication matrix unit in the first field communication matrix by combining a first routing path related to each redundant acquisition field and a second routing path related to each sample acquisition routing field to generate a second field communication matrix;
and determining a collection noise point related to the training source data collection track and a plurality of connected collection noise points related to the collection noise point by combining the second field connected matrix, and performing big data collection flow optimization on the AI training server by combining the collection noise point and the plurality of connected collection noise points related to the collection noise point.
3. The big data analysis-based acquisition noise point mining method according to claim 2, wherein the determining, in combination with the second field connectivity matrix, an acquisition noise point associated with the training source data acquisition trajectory comprises:
loading the second field communication matrix to an AI (analysis of noise point) acquisition unit, and outputting a trigger routing node of the noise point acquisition;
determining a sample collection routing field of the trigger routing node of the collection noise point;
optimizing the second field communication matrix based on field convergence by combining the sample acquisition routing field of the trigger routing node of the noise acquisition point and the routing path of the trigger routing node in the noise acquisition point;
and loading the optimized second field communication matrix to an AI (acquisition noise point) analysis unit, and traversing and combining the determined sample acquisition routing fields of all the sample acquisition routing nodes to optimize the second field communication matrix until AI output information of the acquisition noise points is obtained.
4. The big data analysis-based collected noise point mining method according to claim 3, wherein the collected noise point analysis AI unit comprises a self-coding branch and a noise point analysis branch, wherein loading the second field connectivity matrix into the collected noise point analysis AI unit and outputting the trigger routing node of the collected noise point comprises:
loading the second field communication matrix to a self-coding branch, and outputting self-coding distribution of trigger routing nodes corresponding to the collected noise points;
according to the noise point analysis branch, converting self-coding distribution of trigger routing nodes corresponding to the collected noise points into a first noise decision thermodynamic diagram, wherein the first noise decision thermodynamic diagram comprises trigger heating power values corresponding to all sample collection routing nodes in a sample collection routing node sequence;
and outputting the trigger routing node of the collection noise point by combining the first noise decision thermodynamic diagram.
5. The big data analysis-based collected noise point mining method according to claim 2, wherein the optimization of the second field connectivity matrix based on field aggregation, in combination with the sample collection routing field of the trigger routing node of the collected noise point and the routing path thereof in the collected noise point, comprises:
performing field communication on redundant acquisition fields of the training redundant feedback data nodes, sample acquisition routing fields of the sample acquisition routing space and sample acquisition routing fields of trigger routing nodes for acquiring noise points, and optimizing the first field communication matrix;
and performing secondary mapping on each first field communication matrix unit in the optimized first field communication matrix by combining a first routing path related to each redundant acquisition field, a second routing path related to each sample acquisition routing field in the sample acquisition routing space and a second routing path of a trigger routing node of the acquired noise point, and optimizing the second field communication matrix.
6. The big data analysis-based collected noise point mining method according to claim 4, wherein outputting the trigger routing node of the collected noise point in combination with the first noise decision thermodynamic diagram comprises:
performing thermal value arrangement on thermal values in the first noise decision thermodynamic diagram;
determining a thermal value of a pre-sequencing R in thermal value configuration information, and determining a sample collection routing node of the pre-sequencing R as an associated routing node set of a trigger routing node of the collection noise point, wherein the optimized second field connectivity matrix is loaded to the self-encoding branch, and the above operations are repeatedly executed until AI output information of the collection noise point is obtained, including:
generating associated routing node sets of other sample collection routing nodes in sequence by combining the associated routing node sets of the trigger routing nodes;
and determining the specified number of the collection noise points by combining the associated routing node sets of the sample collection routing nodes in the collection noise points.
7. The big data analysis-based acquisition noise point mining method of claim 6, further comprising:
respectively aiming at each acquisition noise point in the specified number of acquisition noise points, implementing the following steps:
after outputting the AI output information of the collection noise point, arbitrarily determining one self-encoding distribution as a first self-encoding distribution and one self-encoding distribution as a second self-encoding distribution among a plurality of self-encoding distributions generated by the self-encoding branches;
and determining a degree of match between the first self-encoding distribution and the second self-encoding distribution;
if the maximum matching degree is larger than a set matching degree threshold value, determining an acquisition noise point related to the matching degree as an acquisition noise point related to a training source data acquisition track, and if the maximum matching degree is not larger than the set matching degree threshold value, determining that information of the acquisition noise point related to the training source data acquisition track does not exist;
wherein, for each collection noise point, after outputting AI output information of the collection noise point, one self-encoding distribution is arbitrarily determined as a third self-encoding distribution, one self-encoding distribution is determined as a first self-encoding distribution, and one self-encoding distribution is determined as a second self-encoding distribution among a plurality of self-encoding distributions generated by the self-encoding branches;
determining the matching degree between the third self-coding distribution and the first self-coding distribution, and determining the matching degree between the third self-coding distribution and the second self-coding distribution;
and when the matching degree between the third self-coding distribution and the first self-coding distribution is smaller than a first set matching degree threshold value and the matching degree between the third self-coding distribution and the second self-coding distribution is larger than a second set matching degree threshold value, determining the acquisition noise point as an acquisition noise point generated only by combining with a training source data acquisition track.
8. The big data analysis-based collection noise point mining method of claim 4, wherein obtaining the redundant collection field and the sample collection routing field and the quadratic mapping are implemented according to a deep learning model, and the method further comprises:
performing weight parameter updating on the deep learning model, the self-coding branch and the noise point analysis branch according to big data of a first template training source data acquisition track, wherein the big data of the first template training source data acquisition track comprises a plurality of first template training source data acquisition tracks, each first template training source data acquisition track comprises a first template training source data acquisition track, a first template sample acquisition routing space related to the first template training source data acquisition track, and a template acquisition noise point related to the first template training source data acquisition track and the first template sample acquisition routing space, wherein the weight parameter updating is performed on the deep learning model, the self-coding branch and the noise point analysis branch according to the big data of the first template training source data acquisition track, the method comprises the following steps:
in any first template training source data acquisition track in the big data of the first template training source data acquisition track, combining each first template training source data acquisition track, obtaining a plurality of first template redundant feedback data nodes from the first template training source data acquisition track of the first template training source data acquisition track, and then analyzing each redundant acquisition field in the plurality of first template redundant feedback data nodes;
acquiring a first template sample acquisition routing space related to the first template training source data acquisition track, and then excavating sample acquisition routing fields of all sample acquisition routing nodes in the first template sample acquisition routing space one by one, wherein each sample acquisition routing field and each redundant acquisition field have field variables with field association relationship;
changing a plurality of sample acquisition routing nodes in the template acquisition noise points into connected sample acquisition routing nodes to generate connected acquisition noise points, and then excavating sample acquisition routing fields of all the sample acquisition routing nodes in the connected acquisition noise points one by one, wherein the sample acquisition routing fields of all the sample acquisition routing nodes in the connected acquisition noise points and each redundant acquisition field have field variables with field association relationship;
performing field communication on the redundant acquisition fields of the plurality of first template redundant feedback data nodes, the sample acquisition routing field of the first template sample acquisition routing space and the sample acquisition routing field of each sample acquisition routing node in the connected acquisition noise point to generate a first training field communication matrix;
determining a first routing path of a redundant acquisition trigger point related to a redundant acquisition field of each first template redundant feedback data node in the first template training source data acquisition track, determining a second routing path of a sample acquisition routing node related to each sample acquisition routing field in the first template sample acquisition routing space, and determining a second routing path of a sample acquisition routing node related to each sample acquisition routing field in the connected acquisition noise points;
carrying out secondary mapping on each first field communication matrix unit in the first training field communication matrix by combining a first routing path related to each redundant acquisition field and a second routing path related to each sample acquisition routing field to generate a second training field communication matrix;
determining a plurality of connected sample acquisition routing nodes in the connected acquisition noise points by combining the second training field connected matrix;
calculating first cost information between the plurality of connected sample collection routing nodes and the template sample collection routing node;
and updating the weight parameter information of the deep learning model, the self-coding branch and the noise point analysis branch at least in combination with the first cost information.
9. The big data analysis-based collected noise point mining method according to any one of claims 2 to 8, wherein the step of performing big data collection procedure optimization on the AI training server by combining the collected noise points and a plurality of connected collected noise points related to the collected noise points comprises:
taking the collection noise point and a plurality of connected collection noise points related to the collection noise point as a connected noise point cluster, and determining a plurality of big data collection control data related to the connected noise point cluster from a big data collection template database;
combining the collection item ranges of the large data collection control data and the extraction of large data collection frequent items of the large data collection control data to generate frequent item field distribution data corresponding to the connected noise point clusters;
combining the distribution of the label fields of the connected noise point clusters and the distribution frequency of the big data acquisition frequent items in the connected noise point clusters to generate frequent item noise field distribution data;
generating big data acquisition process optimization information corresponding to the connected noise point clusters based on the frequent item field distribution data and the frequent item noise field distribution data, and outputting the big data acquisition process optimization information as an optimization basis to developers for optimization prompt;
the generating of frequent item field distribution data corresponding to the connected noise point cluster by combining the collection item ranges of the large data collection control data and the extraction of the large data collection frequent items of the large data collection control data comprises:
performing frequent item extraction on the large data acquisition control data to obtain large data acquisition frequent items in a historical acquisition control period in the large data acquisition control data;
performing frequent positioning point labeling on the big data acquisition frequent items extracted from the plurality of big data acquisition control data, and combining the big data acquisition frequent items of the same frequent positioning point extracted from different big data acquisition control data;
combining the distribution characteristics of the large data acquisition frequent items of the same frequent positioning point in each large data acquisition control data to generate the frequent item distribution characteristics of the large data acquisition frequent items;
combining the collection item range of each big data collection control data and the frequent item distribution characteristics of the big data collection frequent items to generate the current field distribution data of each big data collection frequent item;
generating frequent item field distribution data corresponding to the connected noise point cluster based on the current field distribution data of each big data acquisition frequent item;
generating frequent noise field distribution data by combining the label field distribution of the connected noise point cluster and the distribution frequency of the big data acquisition frequent items in the connected noise point cluster, wherein the method comprises the following steps:
combining the distribution characteristic information of the connected noise point clusters in the big data acquisition control data and the big data acquisition frequent items in the extracted big data acquisition control data in the historical acquisition control period to generate the distribution frequency of the big data acquisition frequent items in the connected noise point clusters;
combining the distribution characteristic information of the connected noise point clusters in the big data acquisition control data and the crawling field category attribute of the crawling script end extracting the big data acquisition control data to generate the label field distribution of the connected noise point clusters;
and combining the distribution of the label fields of the connected noise point clusters and the distribution frequency of the large data acquisition frequent items in the connected noise point clusters to generate frequent item noise field distribution data.
10. A big data collection system, comprising a processor and a machine-readable storage medium, wherein the machine-readable storage medium has a computer program stored therein, the computer program being loaded and executed by the processor to implement the big data analysis-based collection noise point mining method according to any one of claims 1 to 9.
CN202210381584.6A 2022-04-13 2022-04-13 Big data analysis-based acquisition noise point mining method and big data acquisition system Active CN114691665B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210381584.6A CN114691665B (en) 2022-04-13 2022-04-13 Big data analysis-based acquisition noise point mining method and big data acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210381584.6A CN114691665B (en) 2022-04-13 2022-04-13 Big data analysis-based acquisition noise point mining method and big data acquisition system

Publications (2)

Publication Number Publication Date
CN114691665A true CN114691665A (en) 2022-07-01
CN114691665B CN114691665B (en) 2023-11-14

Family

ID=82143435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210381584.6A Active CN114691665B (en) 2022-04-13 2022-04-13 Big data analysis-based acquisition noise point mining method and big data acquisition system

Country Status (1)

Country Link
CN (1) CN114691665B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062722A (en) * 2022-07-06 2022-09-16 哈尔滨璧能科技有限公司 AI training method based on cloud service big data cleaning and artificial intelligence cloud system
CN115145904A (en) * 2022-07-06 2022-10-04 枣庄宏禹数字科技有限公司 Big data cleaning method and big data acquisition system for AI cloud computing training

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019144639A (en) * 2018-02-16 2019-08-29 株式会社日立製作所 Method for training model outputting vector indicating tag set corresponding to image
CN112199411A (en) * 2020-09-15 2021-01-08 刘明明 Big data analysis method and artificial intelligence platform applied to cloud computing communication architecture
CN112464065A (en) * 2020-06-06 2021-03-09 谢国柱 Big data acquisition method and system based on mobile internet
CN112506910A (en) * 2020-12-14 2021-03-16 招商局金融科技有限公司 Multi-source data acquisition method and device, electronic equipment and storage medium
US20210357776A1 (en) * 2020-05-13 2021-11-18 International Business Machines Corporation Data-analysis-based, noisy labeled and unlabeled datapoint detection and rectification for machine-learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019144639A (en) * 2018-02-16 2019-08-29 株式会社日立製作所 Method for training model outputting vector indicating tag set corresponding to image
US20210357776A1 (en) * 2020-05-13 2021-11-18 International Business Machines Corporation Data-analysis-based, noisy labeled and unlabeled datapoint detection and rectification for machine-learning
CN112464065A (en) * 2020-06-06 2021-03-09 谢国柱 Big data acquisition method and system based on mobile internet
CN112199411A (en) * 2020-09-15 2021-01-08 刘明明 Big data analysis method and artificial intelligence platform applied to cloud computing communication architecture
CN112506910A (en) * 2020-12-14 2021-03-16 招商局金融科技有限公司 Multi-source data acquisition method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062722A (en) * 2022-07-06 2022-09-16 哈尔滨璧能科技有限公司 AI training method based on cloud service big data cleaning and artificial intelligence cloud system
CN115145904A (en) * 2022-07-06 2022-10-04 枣庄宏禹数字科技有限公司 Big data cleaning method and big data acquisition system for AI cloud computing training

Also Published As

Publication number Publication date
CN114691665B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
CN114691665B (en) Big data analysis-based acquisition noise point mining method and big data acquisition system
CN111143226B (en) Automatic test method and device, computer readable storage medium and electronic equipment
CN114697128B (en) Big data denoising method and big data acquisition system through artificial intelligence decision
CN113609210A (en) Big data visualization processing method based on artificial intelligence and visualization service system
CN112508723B (en) Financial risk prediction method and device based on automatic preferential modeling and electronic equipment
CN115048370B (en) Artificial intelligence processing method for big data cleaning and big data cleaning system
CN112329874A (en) Data service decision method and device, electronic equipment and storage medium
Chen et al. Experience transfer for the configuration tuning in large-scale computing systems
CN114691664B (en) AI prediction-based intelligent scene big data cleaning method and intelligent scene system
Meilong et al. An approach to semantic and structural features learning for software defect prediction
CN113722719A (en) Information generation method and artificial intelligence system for security interception big data analysis
CN114826768A (en) Cloud vulnerability processing method applying big data and AI technology and AI analysis system
CN114185761A (en) Log collection method, device and equipment
CN113722711A (en) Data adding method based on big data security vulnerability mining and artificial intelligence system
CN112783508A (en) File compiling method, device, equipment and storage medium
CN114928493B (en) Threat information generation method and AI security system based on threat attack big data
CN114781624B (en) User behavior intention mining method based on big data analysis and big data system
Wen et al. A Cross-Project Defect Prediction Model Based on Deep Learning With Self-Attention
CN115345600B (en) RPA flow generation method and device
CN114780967A (en) Mining evaluation method based on big data vulnerability mining and AI vulnerability mining system
CN115775064A (en) Engineering decision calculation result evaluation method and cloud platform
CN114329454B (en) Threat analysis method and system based on application software big data
CN114168966B (en) Big data analysis-based security protection upgrade mining method and information security system
CN114978765A (en) Big data processing method serving information attack defense and AI attack defense system
CN114239406A (en) Financial process mining method based on reinforcement learning and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220915

Address after: No. 39, Xiangyang Street, Longshan District, Liaoyuan City, Jilin Province, 136300

Applicant after: Xu Xinfu

Address before: No. 420, Xinda Road, Dongfeng Town, Dongfeng County, Liaoyuan City, Jilin Province 136300

Applicant before: Liaoyuan Xunzhan Network Technology Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231013

Address after: 101300 No. 5, houxiao street, baishuzhuang village, Zhang Town, Shunyi District, Beijing

Applicant after: Zhongkun (Beijing) aviation equipment Co.,Ltd.

Address before: No. 39, Xiangyang Street, Longshan District, Liaoyuan City, Jilin Province, 136300

Applicant before: Xu Xinfu

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant