Detailed Description
The application provides a data tuning method and a data tuning system based on big data, which solve the technical problem that the data of each index of an evaluation model in the prior art needs to be set in a self-defined way, so that the intelligent degree of the model tuning analysis process is low. The method and the device achieve the technical effects that vector processing and clustering are carried out on data to be classified, the data to be classified are matched with the adjacent data of the data to be classified, the category of the data to be classified is determined according to the index category to which the adjacent data belongs, the accurate classification of index evaluation data is carried out on the basis of the category, and then the model is subjected to tuning analysis on the basis of the classification result, so that the data processing intelligence of the model tuning process is improved.
Example 1
As shown in fig. 1, the present application provides a data tuning method based on big data, which is characterized by comprising the following steps:
s100: obtaining data to be classified of a model to be optimized;
specifically, the model to be tuned refers to a model which is already trained, and comprises intelligent models such as a neural network model, a decision tree model, a support vector machine, a random forest model and the like; the data to be classified refers to parameters that can be used to evaluate various metrics of the model to be tuned, including but not limited to: output data, input data, output identification data, model node parameters, model loss parameters, model training process parameters and other parameter types, the method can be used for evaluating various indexes such as the precision, the visualization degree, the accuracy, the error rate, the precision, the recall rate and the like of the model.
S200: inputting the data to be classified into a word vector model for vectorization processing to generate a word vector to be classified;
further, as shown in fig. 2, the step S200 of vectorizing the input word vector model of the data to be classified to generate the word vector to be classified includes the steps of:
s210: based on big data, acquiring text retrieval information and word vector labeling information, wherein the text retrieval information and the word vector labeling information are in one-to-one correspondence;
s220: training a neural network word vector model according to the text retrieval information and the word vector labeling information to generate the word vector model;
s230: and inputting the data to be classified into the word vector model for vectorization processing, and generating the word vector to be classified.
In particular, different data to be classified respond to different model tuning evaluation criteria, such as, for example: the accuracy and the error rate of the model can be determined according to the output identification data and the deviation of the output data, and the visualization degree and the like of the model can be evaluated according to the process parameters of model training.
Namely, the data required by different index evaluations have variability, so that the required data needs to be matched before evaluating different indexes, the data to be classified needs to be classified according to the index dimension required to be evaluated, the data to be classified classification method adopted here is a k nearest neighbor classification algorithm, the distance evaluation between the data to be classified and other data needs to be realized before classification, and the data to be classified needs to be quantized firstly, and the preferred process is as follows:
based on big data, acquiring text retrieval information and word vector labeling information, wherein the text retrieval information comprises attribute information of each item of data of a model to be optimized, description information of data characteristics, each item of parameters and the like; the word vector labeling information refers to the result of vectorization labeling according to attribute information of each item of data, description information of data characteristics, each item of parameters and the like, and the vectorization labeling is preferably performed in the following manner: based on big data, attribute information, data characteristics and various parameters of the statistical model are characterized by using characters or character strings, so that the unique corresponding relation between the characters or character strings and the statistical attribute information, data characteristics and various parameters is ensured; thus, any text retrieval information has corresponding word vector labeling information.
Further, the text retrieval information is used as input data, the word vector labeling information is used as output identification information, the neural network word vector model is trained to generate the word vector model, the neural network word vector model is an intelligent word vector labeling generating model, and is a model of a neural network language model, the application is wide, the word vector model obtained based on the text retrieval information and the word vector labeling information training can be used for carrying out vectorization processing on data to be classified rapidly, and therefore word vectors to be classified, which are used for representing the data to be classified by a group of characters or character strings, are obtained. Facilitating the evaluation of the distance between the backward stepping data.
S300: inputting the word vectors to be classified into a sample space for cluster analysis to obtain a neighbor word vector set;
further, the step S300 includes the steps of:
s310: according to the sample space, a sample word vector set is obtained;
further, the step S310 includes the steps of:
s311: based on big data, acquiring a classified data set of the to-be-tuned optimal model;
s312: inputting the classified data set into the word vector model for vectorization processing to generate a classified word vector set;
s313: and constructing the sample space according to the classified word vector set.
S320: traversing the sample word vector set to evaluate the similar distance based on the word vector to be classified, and generating a similar distance evaluation result;
further, the step S320 includes the steps of:
s321: constructing a similarity distance evaluation formula:
wherein a represents a word vector to be classified, b represents any one sample word vector, and a i Representing an ith dimension parameter in a word vector to be classified, b i Representing an ith dimension parameter in a sample word vector, wherein n represents the highest dimension, D (a, b) represents the similarity distance between a and b, and alpha, beta and gamma are self-defined correction factors;
s322: and traversing the sample word vector set to perform similarity distance evaluation based on the word vector to be classified according to the similarity distance evaluation formula, and generating a similarity distance evaluation result.
S330: sorting the sample word vector set from small to large according to the similarity distance evaluation result to obtain a sample word vector sorting result;
s340: and traversing the sample word vector sequencing result to screen a preset number of sample word vectors, and obtaining the neighbor word vector set.
Specifically, the sample space refers to a set of word vectors that have been classified; the neighbor word vector set refers to a word vector set which is used for evaluating the category of the word vector to be classified according to the specific quantity screened according to the distance between the word vector to be classified and the word vector set which is already classified in a sample space.
The preferred detailed procedure is as follows:
firstly, acquiring a sample word vector set, namely extracting a result obtained by the word vector set stored in a sample space, wherein word vectors in the sample space are stored in groups, and any group of word vectors represent model data;
further, a similarity distance evaluation formula is obtained:
constructing a similarity distance evaluation formula:
wherein a represents a word vector to be classified, b represents any one sample word vector, and a i Representing an ith dimension parameter in a word vector to be classified, b i Representing an ith dimension parameter in a sample word vector, wherein n represents the highest dimension, D (a, b) represents the similarity distance between a and b, and alpha, beta and gamma are self-defined correction factors;
any one sample word vector or word vector to be classified is the result of arrangement of a plurality of word vectors, and the word vectors are respectively first dimension parameters … ith dimension parameters … and the like according to the arrangement sequence from head to tail. When the word vector to be classified is compared with any one sample word vector, the same word vectors in the sample word vector and the word vector to be classified are arranged in the same dimension, different word vectors are arranged in different dimensions, and if the word vector in a certain dimension of the word vector to be classified does not exist, the word vector position of the same dimension of the sample word vector is filled with 0. After the sorting, the word vector to be classified and any sample word vector which can be subjected to distance evaluation by using a similar distance evaluation formula are obtained.
Further, according to the similarity distance evaluation formula, the similarity distance evaluation is carried out based on the to-be-classified word vector traversing sample word vector set, and a similarity distance evaluation result representing the similarity distance is output, wherein in the similarity distance evaluation formula, the larger the similarity distance is, the lower the similarity is, the larger the similarity distance evaluation result is, and the smaller the similarity distance evaluation result is, the higher the similarity is.
Further, sorting the sample word vector sets from small to large according to the similarity distance evaluation result to obtain a sample word vector sorting result representing the sorting order of the sample word vector sets; and then, traversing sample word vector sequencing results from head to tail, screening a preset number of sample word vectors, setting the sample word vectors as a neighbor word vector set, wherein the preset number is the number of neighbor samples for judging the data to be classified, and the preset number of sample word vectors before screening can be set in a self-defined manner, so that the data for classifying the evaluation indexes are ensured to be relatively similar to the data to be classified. Therefore, the intelligent classification of the model data is realized, and accurate reference data is provided for the evaluation of each index.
S400: performing index classification on the data to be classified by traversing the neighbor word vector set to obtain a classification index judgment result;
further, as shown in fig. 3, the step S400 includes the steps of:
s410: traversing the neighbor word vector set to obtain neighbor word vector classification identification information;
s420: performing index classification on the neighbor word vector set according to the neighbor word vector classification identification information to obtain a plurality of index classification results;
s430: the index classification results are sorted in a descending order according to the neighbor word vector classification number, and index classification sorting results are generated;
s440: and setting the first index of the index classification and sorting result as the classification index judgment result.
Further, the step S440 includes the steps of:
s441: when the number of the first indexes is greater than or equal to two, a weight distribution formula is obtained:
wherein w is j Characterizing a word vector a to be classified and a j-th neighbor word vector b j Classification weights of D (a, b) j ) Characterizing a word vector a to be classified and a j-th neighbor word vector b j A similar distance;
s442: traversing the first index to perform weight distribution according to the weight distribution formula to obtain a weight distribution result;
s443: and screening the classification index judgment result according to the weight distribution result.
Specifically, the classification index determination result refers to a result obtained after k-nearest neighbor classification is performed on data to be classified according to a nearest neighbor word vector set.
The preferred categorization procedure is as follows:
firstly, according to a neighbor word vector set, loading and storing neighbor word vector classification identification information of index categories to which neighbor word vectors belong; further extracting storage content of neighbor word vector classification identification information, performing index classification on a neighbor word vector set to obtain a plurality of index classification results, wherein the plurality of index classification results refer to a result of clustering the neighbor word vector set according to index types to which the neighbor word vector set belongs, and any one index classification result comprises one or more neighbor word vectors; and further, carrying out descending order sorting on the plurality of index sorting results according to the neighbor word vector sorting numbers, generating index sorting results, representing descending order sorting results of the neighbor word vector sorting numbers of the number of neighbor word vectors on the plurality of index sorting results, and setting the first index of the index sorting results as a sorting index judgment result. Namely, the index classification result with the largest neighbor word vector classification number is set as the classification index judgment result. Therefore, an index classification result of the data to be classified is determined, and accurate reference data is provided for index evaluation of the model to be optimized in the later step.
Further, when the number of the first indexes is greater than or equal to two, a weight distribution formula is obtained:
wherein w is j Characterizing a word vector a to be classified and a j-th neighbor word vector b j Classification weights of D (a, b) j ) Characterizing a word vector a to be classified and a j-th neighbor word vector b j A similar distance;
traversing the first indexes according to a weight distribution formula to carry out weight distribution, obtaining a plurality of first index weight distribution results, and adding the weight distribution results; and then weighting calculation is carried out according to the weight distribution result, wherein the preferable calculation mode is as follows: and finally comparing the weighted distribution results corresponding to the number of the first indexes, screening indexes corresponding to the maximum value of the weighted results, and taking the index as a classification index judgment result, wherein if the weighted results still have the same indexes, the data to be classified can be used for evaluating the indexes with the same weighted results.
S500: evaluating the to-be-tuned optimal model according to the classifying index judging result to generate a classifying index evaluating result;
s600: and performing tuning analysis on the model to be tuned according to the classification index evaluation result.
Specifically, after the accurate classification of the model data is completed, the classification index judgment result is needed, the data of the corresponding category is called to evaluate the classification index, and then the to-be-tuned model with indexes not meeting the requirements is adjusted or optimized and trained according to the evaluation result. By means of the accurate classification of the model data, accurate reference data is provided for model tuning analysis, and model tuning analysis efficiency is guaranteed.
In summary, the embodiment of the application has at least the following technical effects:
the application provides a data tuning method and a data tuning system based on big data, which solve the technical problem that the data of each index of an evaluation model in the prior art needs to be set in a self-defined way, so that the intelligent degree of the model tuning analysis process is low. The method and the device achieve the technical effects that vector processing and clustering are carried out on data to be classified, the data to be classified are matched with the adjacent data of the data to be classified, the category of the data to be classified is determined according to the index category to which the adjacent data belongs, the accurate classification of index evaluation data is carried out on the basis of the category, and then the model is subjected to tuning analysis on the basis of the classification result, so that the data processing intelligence of the model tuning process is improved.
Example two
Based on the same inventive concept as the data tuning method based on big data in the foregoing embodiments, as shown in fig. 4, the present application provides a data tuning system based on big data, including:
the data waiting module 41 is configured to obtain data to be classified of the optimal model to be tuned;
the vectorization processing module 42 is configured to perform vectorization processing on the data input word vector model to be classified, and generate a word vector to be classified;
the neighbor word vector matching module 43 is configured to input the word vector to be classified into a sample space for cluster analysis, and obtain a neighbor word vector set;
the index classification judging module 44 is configured to traverse the neighboring word vector set to perform index classification on the data to be classified, and obtain a classification index judging result;
the categorization index evaluation module 45 is configured to evaluate the to-be-tuned model according to the categorization index determination result, and generate a categorization index evaluation result;
the model tuning analysis module 46 is configured to perform tuning analysis on the model to be tuned according to the evaluation result of the classification index.
Further, the vectorization processing module 42 performs the steps of:
based on big data, acquiring text retrieval information and word vector labeling information, wherein the text retrieval information and the word vector labeling information are in one-to-one correspondence;
training a neural network word vector model according to the text retrieval information and the word vector labeling information to generate the word vector model;
and inputting the data to be classified into the word vector model for vectorization processing, and generating the word vector to be classified.
Further, the performing step of the neighbor word vector matching module 43 includes:
according to the sample space, a sample word vector set is obtained;
traversing the sample word vector set to evaluate the similar distance based on the word vector to be classified, and generating a similar distance evaluation result;
sorting the sample word vector set from small to large according to the similarity distance evaluation result to obtain a sample word vector sorting result;
and traversing the sample word vector sequencing result to screen a preset number of sample word vectors, and obtaining the neighbor word vector set.
Further, the performing step of the neighbor word vector matching module 43 includes:
based on big data, acquiring a classified data set of the to-be-tuned optimal model;
inputting the classified data set into the word vector model for vectorization processing to generate a classified word vector set;
and constructing the sample space according to the classified word vector set.
Further, the performing step of the neighbor word vector matching module 43 includes:
constructing a similarity distance evaluation formula:
wherein a represents a word vector to be classified, b represents any one sample word vector, and a i Representing an ith dimension parameter in a word vector to be classified, b i Representing an ith dimension parameter in a sample word vector, wherein n represents the highest dimension, D (a, b) represents the similarity distance between a and b, and alpha, beta and gamma are self-defined correction factors;
and traversing the sample word vector set to perform similarity distance evaluation based on the word vector to be classified according to the similarity distance evaluation formula, and generating a similarity distance evaluation result.
Further, the index categorization determination module 44 performs the steps of:
traversing the neighbor word vector set to obtain neighbor word vector classification identification information;
performing index classification on the neighbor word vector set according to the neighbor word vector classification identification information to obtain a plurality of index classification results;
the index classification results are sorted in a descending order according to the neighbor word vector classification number, and index classification sorting results are generated;
and setting the first index of the index classification and sorting result as the classification index judgment result.
Further, the index categorization determination module 44 performs the steps of:
when the number of the first indexes is greater than or equal to two, a weight distribution formula is obtained:
wherein w is j Characterizing a word vector a to be classified and a j-th neighbor word vector b j Classification weights of D (a, b) j ) Characterizing a word vector a to be classified and a j-th neighbor word vector b j A similar distance;
traversing the first index to perform weight distribution according to the weight distribution formula to obtain a weight distribution result;
and screening the classification index judgment result according to the weight distribution result.
Example III
Based on the same inventive concept as the data optimizing method based on big data in the foregoing embodiments, the present application further provides a computer readable storage medium, in which a computer program is stored, which when executed by a processor, implements the steps of the method in the first embodiment.
Example IV
As shown in fig. 5, based on the same inventive concept as a data tuning method based on big data in the foregoing embodiments, the present application further provides a data tuning system (computer device) 5000 based on big data, the computer device 5000 including a memory 54 and a processor 51, the memory storing computer executable instructions thereon, the processor executing the computer executable instructions thereon to implement the method described above. In practical applications, the system may also include other necessary elements, including but not limited to any number of input devices 52, output devices 53, processors 51, controllers, memories 54, etc., and all systems that can implement the big data management method of the embodiments of the present application are within the scope of the present application.
The memory includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read to only memory, CD to ROM) for the associated instructions and data.
The input means 52 are for inputting data and/or signals and the output means 53 are for outputting data and/or signals. The output device 53 and the input device 52 may be separate devices or may be an integral device.
A processor may include one or more processors, including for example one or more central processing units (central processing unit, CPU), which in the case of a CPU may be a single core CPU or a multi-core CPU. The processor may also include one or more special purpose processors, which may include GPUs, FPGAs, etc., for acceleration processing.
The memory is used to store program codes and data for the network device.
The processor is used to call the program code and data in the memory to perform the steps of the method embodiments described above. Reference may be made specifically to the description of the method embodiments, and no further description is given here.
In the several embodiments provided by the present application, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the division of the unit is merely a logic function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable system. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a read-only memory (ROM), or a random-access memory (random access memory, RAM), or a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium such as a digital versatile disk (digital versatile disc, DVD), or a semiconductor medium such as a Solid State Disk (SSD), or the like.
The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.
The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.