CN116010602B - Data optimization method and system based on big data - Google Patents

Data optimization method and system based on big data Download PDF

Info

Publication number
CN116010602B
CN116010602B CN202310035223.0A CN202310035223A CN116010602B CN 116010602 B CN116010602 B CN 116010602B CN 202310035223 A CN202310035223 A CN 202310035223A CN 116010602 B CN116010602 B CN 116010602B
Authority
CN
China
Prior art keywords
word vector
classified
index
data
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310035223.0A
Other languages
Chinese (zh)
Other versions
CN116010602A (en
Inventor
孔祥山
陈伍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Central China Technology Development Of Electric Power Co ltd
Original Assignee
Hubei Central China Technology Development Of Electric Power Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Central China Technology Development Of Electric Power Co ltd filed Critical Hubei Central China Technology Development Of Electric Power Co ltd
Priority to CN202310035223.0A priority Critical patent/CN116010602B/en
Publication of CN116010602A publication Critical patent/CN116010602A/en
Application granted granted Critical
Publication of CN116010602B publication Critical patent/CN116010602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data tuning method and a system based on big data, which relate to the field of model tuning and comprise the following steps: obtaining data to be classified of a model to be optimized; inputting the data to be classified into a word vector model for vectorization processing to generate a word vector to be classified; inputting the word vectors to be classified into a sample space for cluster analysis to obtain a neighbor word vector set; performing index classification on the data to be classified by traversing the neighbor word vector set to obtain a classification index judgment result; evaluating the to-be-tuned optimal model according to the classifying index judging result to generate a classifying index evaluating result; and performing tuning analysis on the model to be tuned according to the classification index evaluation result. The method solves the technical problem that the intelligent degree of the model tuning analysis process is low because the data of each index of the evaluation model in the prior art needs to be set in a self-defined mode.

Description

Data optimization method and system based on big data
Technical Field
The application relates to the technical field of model tuning, in particular to a data tuning method and system based on big data.
Background
Artificial intelligence, machine learning and the like are development hot words in recent years, and various intelligent models, such as decision tree models, neural network models, vector machines, expert systems and the like, are enabled for the artificial intelligence and the machine learning, and the training of any model can ensure stable processing performance by means of iterative optimization learning based on a large amount of training data.
After model training, a model evaluation and tuning stage is required to find the model deficiency and optimize as soon as possible. In the practical development process of artificial intelligence application, the accuracy, performance, interpretability and the like of the evaluation model are required, and the data of each index of the evaluation model at present need to be set in a self-defined mode, so that the intelligence is poor.
In the prior art, the data of each index of the evaluation model needs to be set in a self-defined manner, so that the technical problem of low intelligent degree in the model tuning analysis process exists.
Disclosure of Invention
The application provides a data tuning method and system based on big data, which are used for solving the technical problem that the model tuning analysis process is low in intelligent degree due to the lack of a means for classifying model data according to evaluation indexes in the prior art.
In view of the above problems, the present application provides a data tuning method and system based on big data.
In a first aspect of the present application, a data tuning method based on big data is provided, which includes: obtaining data to be classified of a model to be optimized; inputting the data to be classified into a word vector model for vectorization processing to generate a word vector to be classified; inputting the word vectors to be classified into a sample space for cluster analysis to obtain a neighbor word vector set; performing index classification on the data to be classified by traversing the neighbor word vector set to obtain a classification index judgment result; evaluating the to-be-tuned optimal model according to the classifying index judging result to generate a classifying index evaluating result; and performing tuning analysis on the model to be tuned according to the classification index evaluation result.
In a second aspect of the present application, a data tuning system based on big data is provided, which includes: the data waiting module to be classified is used for acquiring data to be classified of the optimal model to be adjusted; the vectorization processing module is used for inputting the data to be classified into a word vector model for vectorization processing to generate a word vector to be classified; the neighbor word vector matching module is used for inputting the word vector to be classified into a sample space for cluster analysis to obtain a neighbor word vector set; the index classification judging module is used for traversing the neighbor word vector set to conduct index classification on the data to be classified, and obtaining a classification index judging result; the classification index evaluation module is used for evaluating the to-be-tuned optimal model according to the classification index judgment result to generate a classification index evaluation result; and the model tuning analysis module is used for performing tuning analysis on the model to be tuned according to the classification index evaluation result.
In a third aspect of the present application, there is provided a big data based data conditioning system, the computer device comprising a memory and a processor, the memory having stored therein a computer program which when executed by the processor implements the steps of the method in the first aspect.
In a fourth aspect of the present application, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of the first aspect.
One or more technical schemes provided by the application have at least the following technical effects or advantages:
the technical scheme provided by the application provides: loading data to be classified of the optimal model to be tuned, carrying out vector processing and clustering on the data to be classified to obtain a neighbor word vector set of the data to be classified; performing index classification on the data to be classified according to the neighbor word vector set to obtain a classification index judgment result; then, according to the classification index judgment result, evaluating the model to be tuned to obtain a classification index evaluation result; according to the technical scheme of tuning the model to be tuned according to the classification index evaluation result, vector processing and clustering are carried out on the data to be classified, so that the data to be classified are matched with the neighbor data of the data to be classified, the category of the data to be classified is determined according to the index category to which the neighbor data belongs, the accurate classification of the index evaluation data is carried out based on the category, and tuning analysis are carried out on the model based on the classification result, so that the technical effect of improving the data processing intelligence of the model tuning process is achieved.
Drawings
FIG. 1 is a schematic flow chart of a data tuning method based on big data provided by the application;
FIG. 2 is a schematic flow chart of obtaining word vectors to be classified in a data optimizing method based on big data provided by the application;
FIG. 3 is a flow chart of a method for obtaining a classification index determination result in a data optimization method based on big data according to the present application;
FIG. 4 is a schematic diagram of a data optimization system based on big data;
fig. 5 is a schematic structural diagram of an exemplary computer device according to an embodiment of the present application.
Reference numerals illustrate: the data to be classified comprises a data standby module 41, a vectorization processing module 42, a neighbor word vector matching module 43, an index classification judging module 44, a classification index evaluating module 45, a model tuning analyzing module 46, a computer device 5000, a processor 51, an input device 52, an output device 53 and a memory 54.
Detailed Description
The application provides a data tuning method and a data tuning system based on big data, which solve the technical problem that the data of each index of an evaluation model in the prior art needs to be set in a self-defined way, so that the intelligent degree of the model tuning analysis process is low. The method and the device achieve the technical effects that vector processing and clustering are carried out on data to be classified, the data to be classified are matched with the adjacent data of the data to be classified, the category of the data to be classified is determined according to the index category to which the adjacent data belongs, the accurate classification of index evaluation data is carried out on the basis of the category, and then the model is subjected to tuning analysis on the basis of the classification result, so that the data processing intelligence of the model tuning process is improved.
Example 1
As shown in fig. 1, the present application provides a data tuning method based on big data, which is characterized by comprising the following steps:
s100: obtaining data to be classified of a model to be optimized;
specifically, the model to be tuned refers to a model which is already trained, and comprises intelligent models such as a neural network model, a decision tree model, a support vector machine, a random forest model and the like; the data to be classified refers to parameters that can be used to evaluate various metrics of the model to be tuned, including but not limited to: output data, input data, output identification data, model node parameters, model loss parameters, model training process parameters and other parameter types, the method can be used for evaluating various indexes such as the precision, the visualization degree, the accuracy, the error rate, the precision, the recall rate and the like of the model.
S200: inputting the data to be classified into a word vector model for vectorization processing to generate a word vector to be classified;
further, as shown in fig. 2, the step S200 of vectorizing the input word vector model of the data to be classified to generate the word vector to be classified includes the steps of:
s210: based on big data, acquiring text retrieval information and word vector labeling information, wherein the text retrieval information and the word vector labeling information are in one-to-one correspondence;
s220: training a neural network word vector model according to the text retrieval information and the word vector labeling information to generate the word vector model;
s230: and inputting the data to be classified into the word vector model for vectorization processing, and generating the word vector to be classified.
In particular, different data to be classified respond to different model tuning evaluation criteria, such as, for example: the accuracy and the error rate of the model can be determined according to the output identification data and the deviation of the output data, and the visualization degree and the like of the model can be evaluated according to the process parameters of model training.
Namely, the data required by different index evaluations have variability, so that the required data needs to be matched before evaluating different indexes, the data to be classified needs to be classified according to the index dimension required to be evaluated, the data to be classified classification method adopted here is a k nearest neighbor classification algorithm, the distance evaluation between the data to be classified and other data needs to be realized before classification, and the data to be classified needs to be quantized firstly, and the preferred process is as follows:
based on big data, acquiring text retrieval information and word vector labeling information, wherein the text retrieval information comprises attribute information of each item of data of a model to be optimized, description information of data characteristics, each item of parameters and the like; the word vector labeling information refers to the result of vectorization labeling according to attribute information of each item of data, description information of data characteristics, each item of parameters and the like, and the vectorization labeling is preferably performed in the following manner: based on big data, attribute information, data characteristics and various parameters of the statistical model are characterized by using characters or character strings, so that the unique corresponding relation between the characters or character strings and the statistical attribute information, data characteristics and various parameters is ensured; thus, any text retrieval information has corresponding word vector labeling information.
Further, the text retrieval information is used as input data, the word vector labeling information is used as output identification information, the neural network word vector model is trained to generate the word vector model, the neural network word vector model is an intelligent word vector labeling generating model, and is a model of a neural network language model, the application is wide, the word vector model obtained based on the text retrieval information and the word vector labeling information training can be used for carrying out vectorization processing on data to be classified rapidly, and therefore word vectors to be classified, which are used for representing the data to be classified by a group of characters or character strings, are obtained. Facilitating the evaluation of the distance between the backward stepping data.
S300: inputting the word vectors to be classified into a sample space for cluster analysis to obtain a neighbor word vector set;
further, the step S300 includes the steps of:
s310: according to the sample space, a sample word vector set is obtained;
further, the step S310 includes the steps of:
s311: based on big data, acquiring a classified data set of the to-be-tuned optimal model;
s312: inputting the classified data set into the word vector model for vectorization processing to generate a classified word vector set;
s313: and constructing the sample space according to the classified word vector set.
S320: traversing the sample word vector set to evaluate the similar distance based on the word vector to be classified, and generating a similar distance evaluation result;
further, the step S320 includes the steps of:
s321: constructing a similarity distance evaluation formula:
wherein a represents a word vector to be classified, b represents any one sample word vector, and a i Representing an ith dimension parameter in a word vector to be classified, b i Representing an ith dimension parameter in a sample word vector, wherein n represents the highest dimension, D (a, b) represents the similarity distance between a and b, and alpha, beta and gamma are self-defined correction factors;
s322: and traversing the sample word vector set to perform similarity distance evaluation based on the word vector to be classified according to the similarity distance evaluation formula, and generating a similarity distance evaluation result.
S330: sorting the sample word vector set from small to large according to the similarity distance evaluation result to obtain a sample word vector sorting result;
s340: and traversing the sample word vector sequencing result to screen a preset number of sample word vectors, and obtaining the neighbor word vector set.
Specifically, the sample space refers to a set of word vectors that have been classified; the neighbor word vector set refers to a word vector set which is used for evaluating the category of the word vector to be classified according to the specific quantity screened according to the distance between the word vector to be classified and the word vector set which is already classified in a sample space.
The preferred detailed procedure is as follows:
firstly, acquiring a sample word vector set, namely extracting a result obtained by the word vector set stored in a sample space, wherein word vectors in the sample space are stored in groups, and any group of word vectors represent model data;
further, a similarity distance evaluation formula is obtained:
constructing a similarity distance evaluation formula:
wherein a represents a word vector to be classified, b represents any one sample word vector, and a i Representing an ith dimension parameter in a word vector to be classified, b i Representing an ith dimension parameter in a sample word vector, wherein n represents the highest dimension, D (a, b) represents the similarity distance between a and b, and alpha, beta and gamma are self-defined correction factors;
any one sample word vector or word vector to be classified is the result of arrangement of a plurality of word vectors, and the word vectors are respectively first dimension parameters … ith dimension parameters … and the like according to the arrangement sequence from head to tail. When the word vector to be classified is compared with any one sample word vector, the same word vectors in the sample word vector and the word vector to be classified are arranged in the same dimension, different word vectors are arranged in different dimensions, and if the word vector in a certain dimension of the word vector to be classified does not exist, the word vector position of the same dimension of the sample word vector is filled with 0. After the sorting, the word vector to be classified and any sample word vector which can be subjected to distance evaluation by using a similar distance evaluation formula are obtained.
Further, according to the similarity distance evaluation formula, the similarity distance evaluation is carried out based on the to-be-classified word vector traversing sample word vector set, and a similarity distance evaluation result representing the similarity distance is output, wherein in the similarity distance evaluation formula, the larger the similarity distance is, the lower the similarity is, the larger the similarity distance evaluation result is, and the smaller the similarity distance evaluation result is, the higher the similarity is.
Further, sorting the sample word vector sets from small to large according to the similarity distance evaluation result to obtain a sample word vector sorting result representing the sorting order of the sample word vector sets; and then, traversing sample word vector sequencing results from head to tail, screening a preset number of sample word vectors, setting the sample word vectors as a neighbor word vector set, wherein the preset number is the number of neighbor samples for judging the data to be classified, and the preset number of sample word vectors before screening can be set in a self-defined manner, so that the data for classifying the evaluation indexes are ensured to be relatively similar to the data to be classified. Therefore, the intelligent classification of the model data is realized, and accurate reference data is provided for the evaluation of each index.
S400: performing index classification on the data to be classified by traversing the neighbor word vector set to obtain a classification index judgment result;
further, as shown in fig. 3, the step S400 includes the steps of:
s410: traversing the neighbor word vector set to obtain neighbor word vector classification identification information;
s420: performing index classification on the neighbor word vector set according to the neighbor word vector classification identification information to obtain a plurality of index classification results;
s430: the index classification results are sorted in a descending order according to the neighbor word vector classification number, and index classification sorting results are generated;
s440: and setting the first index of the index classification and sorting result as the classification index judgment result.
Further, the step S440 includes the steps of:
s441: when the number of the first indexes is greater than or equal to two, a weight distribution formula is obtained:
wherein w is j Characterizing a word vector a to be classified and a j-th neighbor word vector b j Classification weights of D (a, b) j ) Characterizing a word vector a to be classified and a j-th neighbor word vector b j A similar distance;
s442: traversing the first index to perform weight distribution according to the weight distribution formula to obtain a weight distribution result;
s443: and screening the classification index judgment result according to the weight distribution result.
Specifically, the classification index determination result refers to a result obtained after k-nearest neighbor classification is performed on data to be classified according to a nearest neighbor word vector set.
The preferred categorization procedure is as follows:
firstly, according to a neighbor word vector set, loading and storing neighbor word vector classification identification information of index categories to which neighbor word vectors belong; further extracting storage content of neighbor word vector classification identification information, performing index classification on a neighbor word vector set to obtain a plurality of index classification results, wherein the plurality of index classification results refer to a result of clustering the neighbor word vector set according to index types to which the neighbor word vector set belongs, and any one index classification result comprises one or more neighbor word vectors; and further, carrying out descending order sorting on the plurality of index sorting results according to the neighbor word vector sorting numbers, generating index sorting results, representing descending order sorting results of the neighbor word vector sorting numbers of the number of neighbor word vectors on the plurality of index sorting results, and setting the first index of the index sorting results as a sorting index judgment result. Namely, the index classification result with the largest neighbor word vector classification number is set as the classification index judgment result. Therefore, an index classification result of the data to be classified is determined, and accurate reference data is provided for index evaluation of the model to be optimized in the later step.
Further, when the number of the first indexes is greater than or equal to two, a weight distribution formula is obtained:
wherein w is j Characterizing a word vector a to be classified and a j-th neighbor word vector b j Classification weights of D (a, b) j ) Characterizing a word vector a to be classified and a j-th neighbor word vector b j A similar distance;
traversing the first indexes according to a weight distribution formula to carry out weight distribution, obtaining a plurality of first index weight distribution results, and adding the weight distribution results; and then weighting calculation is carried out according to the weight distribution result, wherein the preferable calculation mode is as follows: and finally comparing the weighted distribution results corresponding to the number of the first indexes, screening indexes corresponding to the maximum value of the weighted results, and taking the index as a classification index judgment result, wherein if the weighted results still have the same indexes, the data to be classified can be used for evaluating the indexes with the same weighted results.
S500: evaluating the to-be-tuned optimal model according to the classifying index judging result to generate a classifying index evaluating result;
s600: and performing tuning analysis on the model to be tuned according to the classification index evaluation result.
Specifically, after the accurate classification of the model data is completed, the classification index judgment result is needed, the data of the corresponding category is called to evaluate the classification index, and then the to-be-tuned model with indexes not meeting the requirements is adjusted or optimized and trained according to the evaluation result. By means of the accurate classification of the model data, accurate reference data is provided for model tuning analysis, and model tuning analysis efficiency is guaranteed.
In summary, the embodiment of the application has at least the following technical effects:
the application provides a data tuning method and a data tuning system based on big data, which solve the technical problem that the data of each index of an evaluation model in the prior art needs to be set in a self-defined way, so that the intelligent degree of the model tuning analysis process is low. The method and the device achieve the technical effects that vector processing and clustering are carried out on data to be classified, the data to be classified are matched with the adjacent data of the data to be classified, the category of the data to be classified is determined according to the index category to which the adjacent data belongs, the accurate classification of index evaluation data is carried out on the basis of the category, and then the model is subjected to tuning analysis on the basis of the classification result, so that the data processing intelligence of the model tuning process is improved.
Example two
Based on the same inventive concept as the data tuning method based on big data in the foregoing embodiments, as shown in fig. 4, the present application provides a data tuning system based on big data, including:
the data waiting module 41 is configured to obtain data to be classified of the optimal model to be tuned;
the vectorization processing module 42 is configured to perform vectorization processing on the data input word vector model to be classified, and generate a word vector to be classified;
the neighbor word vector matching module 43 is configured to input the word vector to be classified into a sample space for cluster analysis, and obtain a neighbor word vector set;
the index classification judging module 44 is configured to traverse the neighboring word vector set to perform index classification on the data to be classified, and obtain a classification index judging result;
the categorization index evaluation module 45 is configured to evaluate the to-be-tuned model according to the categorization index determination result, and generate a categorization index evaluation result;
the model tuning analysis module 46 is configured to perform tuning analysis on the model to be tuned according to the evaluation result of the classification index.
Further, the vectorization processing module 42 performs the steps of:
based on big data, acquiring text retrieval information and word vector labeling information, wherein the text retrieval information and the word vector labeling information are in one-to-one correspondence;
training a neural network word vector model according to the text retrieval information and the word vector labeling information to generate the word vector model;
and inputting the data to be classified into the word vector model for vectorization processing, and generating the word vector to be classified.
Further, the performing step of the neighbor word vector matching module 43 includes:
according to the sample space, a sample word vector set is obtained;
traversing the sample word vector set to evaluate the similar distance based on the word vector to be classified, and generating a similar distance evaluation result;
sorting the sample word vector set from small to large according to the similarity distance evaluation result to obtain a sample word vector sorting result;
and traversing the sample word vector sequencing result to screen a preset number of sample word vectors, and obtaining the neighbor word vector set.
Further, the performing step of the neighbor word vector matching module 43 includes:
based on big data, acquiring a classified data set of the to-be-tuned optimal model;
inputting the classified data set into the word vector model for vectorization processing to generate a classified word vector set;
and constructing the sample space according to the classified word vector set.
Further, the performing step of the neighbor word vector matching module 43 includes:
constructing a similarity distance evaluation formula:
wherein a represents a word vector to be classified, b represents any one sample word vector, and a i Representing an ith dimension parameter in a word vector to be classified, b i Representing an ith dimension parameter in a sample word vector, wherein n represents the highest dimension, D (a, b) represents the similarity distance between a and b, and alpha, beta and gamma are self-defined correction factors;
and traversing the sample word vector set to perform similarity distance evaluation based on the word vector to be classified according to the similarity distance evaluation formula, and generating a similarity distance evaluation result.
Further, the index categorization determination module 44 performs the steps of:
traversing the neighbor word vector set to obtain neighbor word vector classification identification information;
performing index classification on the neighbor word vector set according to the neighbor word vector classification identification information to obtain a plurality of index classification results;
the index classification results are sorted in a descending order according to the neighbor word vector classification number, and index classification sorting results are generated;
and setting the first index of the index classification and sorting result as the classification index judgment result.
Further, the index categorization determination module 44 performs the steps of:
when the number of the first indexes is greater than or equal to two, a weight distribution formula is obtained:
wherein w is j Characterizing a word vector a to be classified and a j-th neighbor word vector b j Classification weights of D (a, b) j ) Characterizing a word vector a to be classified and a j-th neighbor word vector b j A similar distance;
traversing the first index to perform weight distribution according to the weight distribution formula to obtain a weight distribution result;
and screening the classification index judgment result according to the weight distribution result.
Example III
Based on the same inventive concept as the data optimizing method based on big data in the foregoing embodiments, the present application further provides a computer readable storage medium, in which a computer program is stored, which when executed by a processor, implements the steps of the method in the first embodiment.
Example IV
As shown in fig. 5, based on the same inventive concept as a data tuning method based on big data in the foregoing embodiments, the present application further provides a data tuning system (computer device) 5000 based on big data, the computer device 5000 including a memory 54 and a processor 51, the memory storing computer executable instructions thereon, the processor executing the computer executable instructions thereon to implement the method described above. In practical applications, the system may also include other necessary elements, including but not limited to any number of input devices 52, output devices 53, processors 51, controllers, memories 54, etc., and all systems that can implement the big data management method of the embodiments of the present application are within the scope of the present application.
The memory includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read to only memory, CD to ROM) for the associated instructions and data.
The input means 52 are for inputting data and/or signals and the output means 53 are for outputting data and/or signals. The output device 53 and the input device 52 may be separate devices or may be an integral device.
A processor may include one or more processors, including for example one or more central processing units (central processing unit, CPU), which in the case of a CPU may be a single core CPU or a multi-core CPU. The processor may also include one or more special purpose processors, which may include GPUs, FPGAs, etc., for acceleration processing.
The memory is used to store program codes and data for the network device.
The processor is used to call the program code and data in the memory to perform the steps of the method embodiments described above. Reference may be made specifically to the description of the method embodiments, and no further description is given here.
In the several embodiments provided by the present application, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the division of the unit is merely a logic function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable system. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a read-only memory (ROM), or a random-access memory (random access memory, RAM), or a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium such as a digital versatile disk (digital versatile disc, DVD), or a semiconductor medium such as a Solid State Disk (SSD), or the like.
The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.
The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.

Claims (8)

1. A data tuning method based on big data, comprising:
obtaining data to be classified of the to-be-tuned optimal model, wherein the data to be classified refers to parameters which can be used for evaluating various indexes of the to-be-tuned optimal model, and the method comprises the following steps: outputting data, outputting identification data, model node parameters, model loss parameters and model training process parameters;
inputting the data to be classified into a word vector model for vectorization processing to generate a word vector to be classified;
inputting the word vectors to be classified into a sample space for cluster analysis to obtain a neighbor word vector set;
performing index classification on the data to be classified by traversing the neighbor word vector set to obtain a classification index judgment result;
evaluating the to-be-tuned optimal model according to the classifying index judging result to generate a classifying index evaluating result;
performing tuning analysis on the model to be tuned according to the classification index evaluation result;
the traversing the neighbor word vector set performs index classification on the data to be classified to obtain a classification index judgment result, and the method comprises the following steps:
traversing the neighbor word vector set to obtain neighbor word vector classification identification information;
performing index classification on the neighbor word vector set according to the neighbor word vector classification identification information to obtain a plurality of index classification results;
the index classification results are sorted in a descending order according to the neighbor word vector classification number, and index classification sorting results are generated;
setting the first index of the index classification and sorting result as the classification index judgment result;
the step of setting the first index of the index classification and sorting result as the classification index judgment result includes:
when the number of the first indexes is greater than or equal to two, a weight distribution formula is obtained:
wherein w is j Characterizing a word vector a to be classified and a j-th neighbor word vector b j Classification weights of D (a, b) j ) Characterizing a word vector a to be classified and a j-th neighbor word vector b j A similar distance;
traversing the first index to perform weight distribution according to the weight distribution formula to obtain a weight distribution result;
and screening the classification index judgment result according to the weight distribution result.
2. The method of claim 1, wherein the vectorizing the input word vector model of the data to be classified to generate the word vector to be classified comprises:
based on big data, acquiring text retrieval information and word vector labeling information, wherein the text retrieval information and the word vector labeling information are in one-to-one correspondence;
training a neural network word vector model according to the text retrieval information and the word vector labeling information to generate the word vector model;
and inputting the data to be classified into the word vector model for vectorization processing, and generating the word vector to be classified.
3. The method of claim 1, wherein the inputting the word vector to be classified into a sample space for cluster analysis to obtain a set of neighboring word vectors comprises:
according to the sample space, a sample word vector set is obtained;
traversing the sample word vector set to evaluate the similar distance based on the word vector to be classified, and generating a similar distance evaluation result;
sorting the sample word vector set from small to large according to the similarity distance evaluation result to obtain a sample word vector sorting result;
and traversing the sample word vector sequencing result to screen a preset number of sample word vectors, and obtaining the neighbor word vector set.
4. The method of claim 3, wherein the obtaining a set of sample word vectors from the sample space comprises:
based on big data, acquiring a classified data set of the to-be-tuned optimal model;
inputting the classified data set into the word vector model for vectorization processing to generate a classified word vector set;
and constructing the sample space according to the classified word vector set.
5. The method of claim 3, wherein traversing the set of sample word vectors for similarity distance assessment based on the word vectors to be classified, generating a similarity distance assessment result comprises:
constructing a similarity distance evaluation formula:
wherein a represents a word vector to be classified, b represents any one sample word vector, and a i Characterization of the ClassificationIth dimension parameter in word vector, b i Representing an ith dimension parameter in a sample word vector, wherein n represents the highest dimension, D (a, b) represents the similarity distance between a and b, and alpha, beta and gamma are self-defined correction factors;
and traversing the sample word vector set to perform similarity distance evaluation based on the word vector to be classified according to the similarity distance evaluation formula, and generating a similarity distance evaluation result.
6. A big data based data tuning system, comprising:
the waiting module for data to be classified is used for acquiring the data to be classified of the model to be tuned, wherein the data to be classified refers to parameters which can be used for evaluating various indexes of the model to be tuned, and the waiting module comprises the following components: outputting data, outputting identification data, model node parameters, model loss parameters and model training process parameters;
the vectorization processing module is used for inputting the data to be classified into a word vector model for vectorization processing to generate a word vector to be classified;
the neighbor word vector matching module is used for inputting the word vector to be classified into a sample space for cluster analysis to obtain a neighbor word vector set;
the index classification judging module is used for traversing the neighbor word vector set to conduct index classification on the data to be classified, and obtaining a classification index judging result;
the classification index evaluation module is used for evaluating the to-be-tuned optimal model according to the classification index judgment result to generate a classification index evaluation result;
the model tuning analysis module is used for performing tuning analysis on the model to be tuned according to the classification index evaluation result;
the traversing the neighbor word vector set performs index classification on the data to be classified to obtain a classification index judgment result, and the method comprises the following steps:
traversing the neighbor word vector set to obtain neighbor word vector classification identification information;
performing index classification on the neighbor word vector set according to the neighbor word vector classification identification information to obtain a plurality of index classification results;
the index classification results are sorted in a descending order according to the neighbor word vector classification number, and index classification sorting results are generated;
setting the first index of the index classification and sorting result as the classification index judgment result;
the step of setting the first index of the index classification and sorting result as the classification index judgment result includes:
when the number of the first indexes is greater than or equal to two, a weight distribution formula is obtained:
wherein w is j Characterizing a word vector a to be classified and a j-th neighbor word vector b j Classification weights of D (a, b) j ) Characterizing a word vector a to be classified and a j-th neighbor word vector b j A similar distance;
traversing the first index to perform weight distribution according to the weight distribution formula to obtain a weight distribution result;
and screening the classification index judgment result according to the weight distribution result.
7. A big data based data tuning system, characterized in that the data tuning system comprises a memory and a processor, the memory having stored therein a computer program, the data tuning system when executed by the processor realizing the steps of the method according to claims 1-5.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the steps of the method of claims 1-5.
CN202310035223.0A 2023-01-10 2023-01-10 Data optimization method and system based on big data Active CN116010602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310035223.0A CN116010602B (en) 2023-01-10 2023-01-10 Data optimization method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310035223.0A CN116010602B (en) 2023-01-10 2023-01-10 Data optimization method and system based on big data

Publications (2)

Publication Number Publication Date
CN116010602A CN116010602A (en) 2023-04-25
CN116010602B true CN116010602B (en) 2023-09-29

Family

ID=86024654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310035223.0A Active CN116010602B (en) 2023-01-10 2023-01-10 Data optimization method and system based on big data

Country Status (1)

Country Link
CN (1) CN116010602B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118194004A (en) * 2024-04-07 2024-06-14 广州国曜科技有限公司 Coupling index dynamic tuning method and system based on electromagnetic continuous emission

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807097A (en) * 2018-08-03 2020-02-18 北京京东尚科信息技术有限公司 Method and device for analyzing data
CN113705247A (en) * 2021-10-27 2021-11-26 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product
CN114416979A (en) * 2021-12-30 2022-04-29 上海聚均科技有限公司 Text query method, text query equipment and storage medium
CN114757307A (en) * 2022-06-14 2022-07-15 中国电力科学研究院有限公司 Artificial intelligence automatic training method, system, device and storage medium
CN115146865A (en) * 2022-07-22 2022-10-04 中国平安财产保险股份有限公司 Task optimization method based on artificial intelligence and related equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220092452A1 (en) * 2020-09-18 2022-03-24 Tibco Software Inc. Automated machine learning tool for explaining the effects of complex text on predictive results

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807097A (en) * 2018-08-03 2020-02-18 北京京东尚科信息技术有限公司 Method and device for analyzing data
CN113705247A (en) * 2021-10-27 2021-11-26 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product
CN114416979A (en) * 2021-12-30 2022-04-29 上海聚均科技有限公司 Text query method, text query equipment and storage medium
CN114757307A (en) * 2022-06-14 2022-07-15 中国电力科学研究院有限公司 Artificial intelligence automatic training method, system, device and storage medium
CN115146865A (en) * 2022-07-22 2022-10-04 中国平安财产保险股份有限公司 Task optimization method based on artificial intelligence and related equipment

Also Published As

Publication number Publication date
CN116010602A (en) 2023-04-25

Similar Documents

Publication Publication Date Title
Fong et al. Accelerated PSO swarm search feature selection for data stream mining big data
CN110636445B (en) WIFI-based indoor positioning method, device, equipment and medium
CN113918753A (en) Image retrieval method based on artificial intelligence and related equipment
CN116010602B (en) Data optimization method and system based on big data
CN112767106B (en) Automatic auditing method, system, computer readable storage medium and auditing equipment
Sana et al. A novel customer churn prediction model for the telecommunication industry using data transformation methods and feature selection
CN111027636A (en) Unsupervised feature selection method and system based on multi-label learning
CN111209469A (en) Personalized recommendation method and device, computer equipment and storage medium
CN113159213A (en) Service distribution method, device and equipment
CN113408301A (en) Sample processing method, device, equipment and medium
CN114722198A (en) Method, system and related device for determining product classification code
CN118250169A (en) Network asset class recommendation method, device and storage medium
CN117668536A (en) Software defect report priority prediction method based on hypergraph attention network
CN116883740A (en) Similar picture identification method, device, electronic equipment and storage medium
CN112711678A (en) Data analysis method, device, equipment and storage medium
CN116089886A (en) Information processing method, device, equipment and storage medium
CN110765393A (en) Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression
CN115758462A (en) Method, device, processor and computer readable storage medium for realizing sensitive data identification in trusted environment
TW202312042A (en) Automatic optimization method and automatic optimization system of diagnosis model
CN111737465A (en) Method and device for realizing multi-level and multi-class Chinese text classification
CN112861974A (en) Text classification method and device, electronic equipment and storage medium
CN111400413A (en) Method and system for determining category of knowledge points in knowledge base
KR20210050362A (en) Ensemble pruning method, ensemble model generation method for identifying programmable nucleases and apparatus for the same
CN113743431B (en) Data selection method and device
TWI759785B (en) System and method for recommending audit criteria based on integration of qualitative data and quantitative data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230525

Address after: No. 50 Jinbi Road, Xishan District, Kunming City, Yunnan Province, 650034

Applicant after: Kunming Liangzhaiyao Network Technology Co.,Ltd.

Address before: No. 77 Zhonglin Road, Lixia District, Jinan City, Shandong Province, 250000

Applicant before: Kong Xiangshan

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230830

Address after: No. 546, Luoyu Road, Hongshan District, Wuhan, Hubei Province, 430000

Applicant after: HUBEI CENTRAL CHINA TECHNOLOGY DEVELOPMENT OF ELECTRIC POWER Co.,Ltd.

Address before: No. 50 Jinbi Road, Xishan District, Kunming City, Yunnan Province, 650034

Applicant before: Kunming Liangzhaiyao Network Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant