CN117574243A - Data analysis method, device and system - Google Patents

Data analysis method, device and system Download PDF

Info

Publication number
CN117574243A
CN117574243A CN202410051642.8A CN202410051642A CN117574243A CN 117574243 A CN117574243 A CN 117574243A CN 202410051642 A CN202410051642 A CN 202410051642A CN 117574243 A CN117574243 A CN 117574243A
Authority
CN
China
Prior art keywords
data
bin
word
classified
word frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410051642.8A
Other languages
Chinese (zh)
Other versions
CN117574243B (en
Inventor
闫荣新
孟凡华
谷莉方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Wangxin Digital Technology Co ltd
Original Assignee
Hebei Wangxin Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Wangxin Digital Technology Co ltd filed Critical Hebei Wangxin Digital Technology Co ltd
Priority to CN202410051642.8A priority Critical patent/CN117574243B/en
Publication of CN117574243A publication Critical patent/CN117574243A/en
Application granted granted Critical
Publication of CN117574243B publication Critical patent/CN117574243B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data analysis method, a device and a system, and relates to the field of data processing. The method comprises the steps of obtaining the category for classifying the data; acquiring a data bin of each type; acquiring reference data of each data bin; acquiring unclassified data; obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data; and obtaining the category of each classified data according to the data characteristics of the unclassified data and the data characteristics of each data bin. According to the invention, by carrying out matching analysis on unclassified data and reference data of each total class, accurate classification of the data is realized.

Description

Data analysis method, device and system
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data analysis method, apparatus, and system.
Background
With the rapid development of internet technology, the data volume has shown an explosive growth. The data contains great value, and how to accurately and rapidly extract useful information from massive, heterogeneous and dynamic data is a great challenge in the field of data mining.
Data classification analysis is an important aspect of data mining that helps people understand the essential characteristics and inherent relationships of data by classifying the data into predefined categories. However, the computer does not have human logic thinking capability, and accurate analysis and classification of data are difficult.
Disclosure of Invention
The invention aims to provide a data analysis method, a device and a system, which realize accurate classification of data by carrying out matching analysis on unclassified data and reference data of each total class.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention provides a data analysis method, which comprises the following steps,
acquiring the category for classifying the data;
acquiring a data bin of each type;
acquiring reference data of each data bin;
acquiring unclassified data;
obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data;
and obtaining the category of each classified data according to the data characteristics of the unclassified data and the data characteristics of each data bin.
The invention also discloses a data analysis method, which comprises the following steps,
establishing a data bin for each type of data;
receiving classified data and categories thereof;
each classified data is stored to the corresponding data warehouse by category.
The invention also discloses a data analysis device, which comprises,
the data bin reading interface is used for acquiring the category of classifying the data;
acquiring a data bin of each type;
acquiring a plurality of reference data of each data bin;
analyzing a service input interface to obtain unclassified data;
the operation unit is used for obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data;
acquiring and obtaining the type of each classified data according to the data characteristics of the unclassified data and the data characteristics of each data bin;
and the analysis service output interface is used for outputting the category of each classified data.
The invention also discloses a data analysis system, which comprises,
a data analysis device for outputting a category of each classified data; the method comprises the steps of,
a storage unit for creating a data bin for each kind of data;
receiving classified data and categories thereof;
each classified data is stored to the corresponding data warehouse by category.
According to the method, the data characteristics of each data bin are obtained by analyzing the reference data and the classified data of each data bin, and then the data characteristics of the unclassified data are compared with the data characteristics of the unclassified data, so that the accurate classification of the data is realized.
Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram illustrating a functional unit and an information flow of a data analysis system according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating steps of a data analysis device according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a memory cell according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating the step S5 according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating the step S52 according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a step S528 in an embodiment of the present invention;
FIG. 7 is a flowchart illustrating the step S55 according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating the step S553 according to an embodiment of the present invention;
FIG. 9 is a flowchart illustrating the step S6 according to an embodiment of the present invention;
in the drawings, the list of components represented by the various numbers is as follows:
the system comprises a 1-data bin reading interface, a 2-analysis service input interface, a 3-operation unit, a 4-analysis service output interface and a 5-storage unit.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
Computer data analysis refers to the use of computers and related technologies for data processing, analysis, and mining. The process of computer data analysis typically involves the use of computer software and tools to collect, clean up, transform, and analyze large amounts of data. Because the number and the scale of the computer data are large, the manual auditing is difficult to carry out one by one. Meanwhile, the method is limited by the limitation of computer identification technology, and misjudgment is easily caused by automatic computer classification. In order to improve the accuracy of data classification, the invention provides the following scheme.
Referring to fig. 1 to 3, the present invention provides a data analysis system, which can provide data analysis service for users, so as to realize accurate determination of data types. The interactive function is divided into a data bin reading interface 1, an analysis service input interface 2, an operation unit 3, an analysis service output interface 4 and a storage unit 5. Wherein, the data bin reading interface 1, the analysis service input interface 2, the operation unit 3 and the analysis service output interface 4 work cooperatively to output the kind of each classified data, and the storage unit 5 is used to store the classified data.
In a specific data analysis process, step S1 is first performed by the data bin read interface 1 to obtain the category of classifying the data. Step S2 may then be performed to obtain a data bin for each category. Step S3 may then be performed to obtain several reference data for each data bin. The reference data herein may be exemplary samples selected by the operator in order to have the system have a fractional pair of samples.
The analysis service input interface 2 then performs step S4 to obtain unclassified data. And the operation unit 3 executes step S5 to obtain the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data. Step S6 may be performed to obtain and derive a category of each classified data based on the data characteristics of the unclassified data and the data characteristics of each data bin. Finally, the analysis service output interface 4 outputs the category of each classified data. Of course, in actual operation, the classified data itself may be synchronously output by the analysis service output interface 4.
In the process allowed by the storage unit 5, step S01 may be performed first to create a data bin for each kind of data. Step S02 may then be performed to receive the classified data and their categories, and finally step S03 may be performed to store each classified data per category in the corresponding data warehouse. The data bin can be an aggregation concept, namely, the classified data with the same kind are stored in a concentrated mode, or an abstract concept, namely, the existence of the data is not distinguished, and the classified data are marked by the kind, so that the method and the device are feasible and belong to the protection scope of the scheme.
Referring to fig. 4, reference data corresponding to a data bin has a great number of feature dimensions, but the semantics of the data are considered first from the aspect that humans use the data. Therefore, the scheme classifies the data according to the semantics contained in the data. In other words, the semantics with the feature differentiating function in the reference data are used as the data features of the corresponding data bin. In a specific operation, step S51 may be performed first to perform word segmentation on each reference data, so as to obtain segmented words and corresponding numbers in each reference data. Step S52 may be performed to obtain keywords and word frequencies thereof in each reference data according to the segmentation words and the corresponding numbers in each reference data. Step S53 may be performed next to take the keywords of the reference data corresponding to each data bin as the keywords of the classified data in the data bins. Step S54 may then be performed to perform word segmentation on the classified data in each data bin to obtain word frequencies for keywords of each classified data in each data bin. Finally, step S55 may be executed to obtain the word frequency valid distribution range of each keyword in the classified data in the data bin as the data feature of the data bin according to the keyword and the word frequency of each keyword in the reference data and the word frequency of each keyword of the classified data in the data bin. The semantics with characteristic distinguishing function are expressed as the word frequency effective distribution range of each keyword, so that a computer can perform operation comparison.
Referring to fig. 5, since the number of the segmentation words generated after the semantic segmentation of the reference data is excessive, in order to select important and representative segmentation words as data features, step S521 may be performed to obtain each segmentation word in the whole reference data. Step S522 may then be performed to obtain the number of occurrences of each of the segmented words in the overall reference data. Step S523 may be performed next to acquire the cumulative number of occurrences of all the segmentation words in all the reference data. Step S524 may be performed next to take the ratio of the number of occurrences of each of the divided words in the total reference data to the cumulative number of occurrences of all of the divided words as the global word frequency of each of the divided words. Step S525 may then be performed to obtain the cumulative number of occurrences of all the segmented words within each reference data. Step S526 may then be performed to obtain the number of occurrences of each of the segmented words within each of the reference data. Step S527 may be performed next to take the ratio of the number of occurrences of each of the divided words in each of the reference data to the cumulative number of occurrences of all the divided words in the reference data as the internal word frequency of each of the divided words in each of the reference data. Finally, step S528 may be executed to obtain the keywords and word frequencies of the keywords in each reference data according to the internal word frequencies and the corresponding global word frequencies of each segmentation word in each reference data.
To supplement the above-described implementation procedures of step S521 to step S528, source codes of part of the functional modules are provided, and a comparison explanation is made in the annotation section. In order to avoid data leakage involving trade secrets, a desensitization process is performed on portions of the data that do not affect implementation of the scheme, as follows.
#include <iostream>
#include <vector>
#include <map>
#include <string>
#include <sstream>
# include < iomanip >// for setting output format
Function of// definition of split string
std::vector<std::string> split(const std::string &text, char delim) {
std::vector<std::string> tokens;
std::string token;
std::istringstream tokenStream(text);
while (std::getline(tokenStream, token, delim)) {
if (!token.empty()) {
tokens.push_back(token);
}
}
return tokens;
}
Function of// definition of calculated word frequency
std::map<std::string, int> calculateWordFrequency(const std::vector<std::string>& words) {
std::map<std::string, int> wordFreq;
for (const std::string& word : words) {
wordFreq[word]++;
}
return wordFreq;
}
int main() {
Exemplary reference data
std::vector<std::string> referenceData = {
"apple banana apple cherry banana",
"banana apple grape banana cherry",
"cherry banana apple grape"
};
std::map<std::string, int> globalFrequency;
std::vector<std::map<std::string, float>> internalFrequencies;
int totalWordCount = 0;
Calculating global word frequency
for (const std::string& text : referenceData) {
auto words = split(text, ' ');
auto wordFreq = calculateWordFrequency(words);
for (const auto& pair : wordFreq) {
globalFrequency[pair.first] += pair.second;
totalWordCount += pair.second;
}
}
Calculating internal word frequency of divided words in each reference data
for (const std::string& text : referenceData) {
auto words = split(text, ' ');
auto wordFreq = calculateWordFrequency(words);
int totalWordsInReference = 0;
for (const auto& pair : wordFreq) {
totalWordsInReference += pair.second;
}
std::map<std::string, float> internalFrequency;
for (const auto& pair : wordFreq) {
internalFrequency[pair.first] = static_cast<float>(pair.second) / totalWordsInReference;
}
internalFrequencies.push_back(internalFrequency);
}
Output of keywords and their word frequencies for each reference data
for (size_t i = 0; i < internalFrequencies.size(); ++i) {
std::cout << "Reference Data " << i + 1 << ":\n";
for (const auto& pair : internalFrequencies[i]) {
std::cout << "Keyword: " << std::setw(10) << pair.first
<< ", Internal Frequency: " << std::setw(5) << pair.second
<< ", Global Frequency: " << std::setw(5)
<< static_cast<float>(globalFrequency[pair.first]) / totalWordCount << '\n';
}
std::cout << '\n';
}
return 0;
}
The code first reads a set of reference data strings and segments and counts words in each string. The global word frequency (frequency of occurrence in all reference data) of each word and the internal word frequency (frequency of occurrence in specific reference data) of each word in each reference data text are then calculated. Finally, the code outputs the internal word frequency and the global word frequency of each keyword in each reference data set. These data can be used to identify keywords and their importance.
Referring to fig. 6, since the importance degree of each of the divided words is different, the number of keywords in the reference data is limited, and the importance of the keywords is significantly higher than that of other divided words. In view of this, in acquiring the keyword of each reference data, step S5281 may be first performed to acquire the ratio of the internal word frequency of each divided word to the corresponding global word frequency as the word frequency coefficient of each divided word. Next, step S5282 may be performed to arrange the word frequency coefficients of each of the divided words according to the numerical value size to obtain a coefficient list. Step S5283 may then be performed to obtain the mean value of the differences between each word frequency coefficient and the adjacent word frequency coefficients in the coefficient list as the coefficient average difference value. Next, step S5284 may be executed to sequentially calculate the difference values from the next smaller word frequency coefficient from the word frequency coefficient with the largest value in the coefficient list, and determine whether the difference value is larger than the coefficient average difference value. If yes, the step S5285 may be executed next, if not, the step S5284 may be executed next, and the step of calculating the difference value between the word frequency coefficient with the largest value and the adjacent smaller word frequency coefficient in sequence in the coefficient list and judging whether the difference value is smaller than the coefficient average difference value is executed continuously. Step S5286 may be performed next to use the segmented word corresponding to the word frequency coefficient involved in the calculation as a keyword. Finally, step S5287 may be executed to collect the keywords and word frequencies in each reference data.
To supplement the above-described implementation procedures of steps S5281 to S5287, source codes of part of the functional modules are provided, and a comparison explanation is made in the annotation section.
#include <iostream>
#include <vector>
#include <map>
#include <string>
# include < algorithm >// for sort and other algorithms
String function of/(division)
std::vector<std::string> split(const std::string &str, char delim) {
std::vector<std::string> elements;
std::stringstream ss(str);
std::string item;
while (std::getline(ss, item, delim)) {
elements.push_back(item);
}
return elements;
}
Frequency function of words/calculation
std::map<std::string, int> calculateFrequency(const std::vector<std::string> &data) {
std::map<std::string, int> freq;
for (const auto &word : data) {
freq[word]++;
}
return freq;
}
Structure of frequency coefficient of/word
struct WordCoefficient {
std::string word;
float coefficient;
};
Main program of//
int main() {
Exemplary reference data
std::vector<std::string> referenceData = {
"apple banana apple cherry",
"banana apple grape banana cherry",
"cherry banana apple grape"
};
std::map<std::string, int> globalFrequency;
std::vector<std::map<std::string, int>> allFrequencies;
Calculating global word frequency and word frequency of each reference data
for (const auto &data : referenceData) {
auto words = split(data, ' ');
auto freq = calculateFrequency(words);
allFrequencies.push_back(freq);
for (const auto &entry : freq) {
globalFrequency[entry.first] += entry.second;
}
}
Processing each reference data in a// loop
for (size_t i = 0; i < referenceData.size(); ++i) {
std::vector<WordCoefficient> coefficients;
for (const auto &entry : allFrequencies[i]) {
WordCoefficient wc;
wc.word = entry.first;
wc.coefficient = static_cast<float>(entry.second) / globalFrequency[entry.first];
coefficients.push_back(wc);
}
Ordering according to coefficients
std::sort(coefficients.begin(), coefficients.end(), [](const WordCoefficient &a, const WordCoefficient &b) {
return a.coefficient > b.coefficient;
});
Calculating coefficient average difference
float averageDifference = 0.0f;
for (size_t j = 1; j < coefficients.size(); ++j) {
averageDifference += coefficients[j - 1].coefficient - coefficients[j].coefficient;
}
averageDifference /= coefficients.size() - 1;
Determining keywords
std::vector<std::string> keywords;
for (size_t j = 0; j < coefficients.size() - 1; ++j) {
if (coefficients[j].coefficient - coefficients[j + 1].coefficient > averageDifference) {
keywords.push_back(coefficients[j].word);
break;// exceed average difference value, stop execution
}
}
Output result
std::cout << "Reference dataset " << i + 1 << " keywords:\n";
for (const auto &keyword : keywords) {
std::cout << keyword << " (Coefficient: " << allFrequencies```cpp
#include <iostream>
#include <vector>
#include <map>
#include <algorithm>
#include <numeric>
A// custom type for storing word frequencies and their corresponding global word frequencies
struct WordFreq {
std::string word;
int local_freq;
int global_freq;
};
A// custom ranking function for ranking according to word frequency coefficients
bool sortByCoefficient(const WordFreq& a, const WordFreq& b) {
double coef_a = a.local_freq / static_cast<double>(a.global_freq);
double coef_b = b.local_freq / static_cast<double>(b.global_freq);
return coef_a > coef_b;
}
int main() {
Data of internal word frequency of/divided word and corresponding global word frequency
Static data is used here to simulate
std::vector<WordFreq> referenceData = {
{"apple", 2, 10},
{"banana", 3, 5},
{"cherry", 1, 8}
Greater data may be added
};
Calculating word frequency coefficient of each divided word
for (auto& wf : referenceData) {
wf.coefficient = static_cast<double>(wf.local_freq) / wf.global_freq;
}
Ordering according to word frequency coefficients
std::sort(referenceData.begin(), referenceData.end(), sortByCoefficient);
Obtaining/obtaining the average value of the difference value between each word frequency coefficient and adjacent word frequency coefficient in the coefficient list
First, all differences are calculated
std::vector<double> differences;
for (size_t i = 0; i < referenceData.size() - 1; ++i) {
double diff = referenceData[i].coefficient - referenceData[i + 1].coefficient;
differences.push_back(diff);
}
Mean value of the difference value
double meanDifference = std::accumulate(differences.begin(), differences.end(), 0.0) / differences.size();
Keyword// screening
std::vector<std::string> keywords;
for (size_t i = 0; i < differences.size(); ++i) {
if (differences[i] > meanDifference) {
keywords.push_back(referenceData[i].word);
break;// stop execution if the difference is greater than the mean difference
}
}
Output keyword and word frequency thereof
std::cout << "Keywords and their local frequencies:" << std::endl;
for (const auto& keyword : keywords) {
std::cout << keyword << ": " << referenceData[i].local_freq << std::endl;
}
return 0;
}
This code first defines a structure WordFreq that stores the text, local word frequency, and global word frequency for each word. And then using a standard library function std, namely sorting the word frequency coefficients by the sort, calculating the average value of the difference values between the word frequency coefficients, and screening out the keywords according to whether the difference value is larger than the average value. And finally outputting each keyword and the local word frequency thereof. Keywords are determined by comparing word frequency coefficients to the average gap of the coefficients.
Referring to fig. 7 to 8, since the number of reference data is single, classifying the data by only the reference data may result in too narrow a classification aperture, so that a large amount of unclassified data cannot be classified into dirty data effectively. There is therefore also a need for the auxiliary classification of classified data within a data bin, i.e. the simultaneous extraction of data features of classified data as data features of a data bin. Specifically, for each data bin, step S551 may be performed first to obtain word frequencies of keywords of each classified data in the data bin. Next, step S552 may be performed to arrange the keywords of each classified data in the same order to obtain a multi-dimensional vector composed of the numerical values of the word frequencies of the keywords of each classified data as the feature vector of each classified data. The classified data may be selected by a worker and then injected into a data bin, or classified data may be classified.
Step S553 may then be performed to obtain a plurality of distribution ranges of word frequencies of each keyword in the classified data in the data bin according to the feature vector of each classified data. In this process, step S5531 may be performed first to select a plurality of feature vectors among the plurality of feature vectors of the classified data as the target feature vector. Step S5532 may then be performed to calculate a vector differential modulo length for each target feature vector and each other feature vector. Step S5533 may be performed to combine each of the other feature vectors with the target feature vector having the smallest vector difference module length into a vector set. Step S5534 may be performed to calculate and acquire, as the updated target feature vector, a feature vector having the smallest vector difference modulo length from the mean vector of all feature vectors in each vector set. Step S5535 may be performed to determine whether the updated target feature vector has changed. If yes, step S5532 to step S5535 may be performed next, and the continuously updated vector set and the target feature vector are returned, if not, step S5536 may be performed next, to obtain a distribution range of word frequencies of each keyword of the classified data corresponding to all feature vectors in each vector set. Namely, the data characteristics of the classified data are expanded, classified and compared with the caliber.
Since the classified data in the data bin does not necessarily strictly meet the target requirement, the reference data needs to be calibrated, and therefore, step S554 may be executed finally to take the distribution range of the keywords and the word frequency thereof in the reference data as the word frequency effective distribution range of each keyword in the classified data in the data bin.
To supplement the above-described implementation procedures of step S5531 to step S5536, source codes of part of the functional modules are provided and a comparison explanation is made in the annotation section.
#include <iostream>
#include <vector>
#include <cmath>
#include <limits>
#include <algorithm>
The// custom type is used for storing keywords and word frequency thereof
struct KeywordFreq {
std::string keyword;
std is vector < int > freqs;// word frequency of the same keyword in different data
};
The// custom type is used for storing the feature vector
struct FeatureVector {
std: vector < int > features;// feature vector
};
Calculating Euclidean distance between two eigenvectors
double euclideanDistance(const FeatureVector& a, const FeatureVector& b) {
double distance = 0.0;
for (size_t i = 0; i < a.features.size(); ++i) {
distance += std::pow(static_cast<double>(a.features[i] - b.features[i]), 2);
}
return std::sqrt(distance);
}
Computing mean vector of feature vector set
FeatureVector calculateMeanVector(const std::vector<FeatureVector>& vectors) {
FeatureVector meanVector;
if (!vectors.empty()) {
meanVector.features.resize(vectors[0].features.size(), 0);
for (const auto& vec : vectors) {
for (size_t i = 0; i < vec.features.size(); ++i) {
meanVector.features[i] += vec.features[i];
}
}
for (size_t i = 0; i < meanVector.features.size(); ++i) {
meanVector.features[i] /= vectors.size();
}
}
return meanVector;
}
int main() {
The/(and keywords and their word frequencies in the different data have been extracted and stored in the keywordfeq structure
std::vector<KeywordFreq> keywordData = {
{"apple", {1, 5, 3}},
{"banana", {2, 1, 0}},
{"cherry", {3, 4, 2}}
Greater data may be added
};
The first few are simply chosen here as initial target feature vectors
std::vector<FeatureVector> targetVectors;
for (size_ t i =0; i < keywordData.size () & i < 3 ++i) {// number of target vectors is 3
targetVectors.push_back({keywordData[i].freqs});
}
bool changed;
do {
Classifying all feature vectors according to target feature vector
std::vector<std::vector<FeatureVector>> clusters(targetVectors.size());
for (const auto& keyword : keywordData) {
double minDistance = std::numeric_limits<double>::max();
size_t clusterIndex = 0;
for (size_t i = 0; i < targetVectors.size(); ++i) {
double distance = euclideanDistance({keyword.freqs}, targetVectors[i]);
if (distance < minDistance) {
minDistance = distance;
clusterIndex = i;
}
}
clusters[clusterIndex].push_back({keyword.freqs});
}
Calculating new target feature vectors for each set
changed = false;
for (size_t i = 0; i < clusters.size(); ++i) {
FeatureVector newMeanVector = calculateMeanVector(clusters[i]);
if (!std::equal(newMeanVector.features.begin(), newMeanVector.features.end(), targetVectors[i].features.begin())) {
targetVectors[i] = newMeanVector;
changed = true;
}
}
} while (changed);
Obtaining the distribution range of word frequencies corresponding to the characteristic vectors in each set
for (size_t i = 0; i < keywordData.size(); ++i) {
auto& freqs = keywordData[i].freqs;
std::cout << "Keyword: " << keywordData[i].keyword << ", Freq Range: ";
std::cout << std::min_element(freqs.begin(), freqs.end()) << " -```cpp
std::max_element(freqs.begin(), freqs.end()) << std::endl;
}
return 0;
}
The code runs the process, and the data is organized into keywords and word frequencies in different data items. The initial target feature vector is simply selected as the first few items in the dataset. The code then enters a loop, updating the target feature vectors continuously until they no longer change. During the loop, feature vectors are classified using Euclidean distances, and a new mean vector for each classification is calculated. Once the target feature vector has stabilized, the loop ends. And finally, calculating and outputting the word frequency distribution range of each keyword by the codes.
Referring to fig. 9, in the process of comparing the unclassified data with the data features of the data bins, in order to reduce the amount of calculation of the comparison, step S61 may be performed to segment the unclassified data into terms, thereby obtaining the segmented terms and the corresponding numbers in the unclassified data. Step S62 may next be performed to determine whether the segmented words within the unclassified data cover keywords in the data features of the data bins. If not, step S63 may be performed next, and if so, step S64 may be performed next with the data bin as an alternative data bin. Therefore, the data characteristics of each data bin can be prevented from being compared, and the calculation amount of comparison is reduced on the premise of not losing the accuracy of comparison and classification.
In the process of comparing the unclassified data with each candidate data bin, step S65 may be executed first, where the keywords of the candidate data bins are used as keywords of the unclassified data, and each keyword of the unclassified data and its word frequency are used as data features. Step S66 may then be performed to determine whether the word frequency of each keyword of the unclassified data falls within the word frequency valid distribution range of each keyword in the classified data within the candidate data bin. If so, step S67 may be performed to mark the unclassified data as classified data, and the type of the candidate data bin is used as the type of the classified data, and if not, steps S65 to S67 may be performed to compare with the next candidate data bin.
To supplement the above-described implementation procedures of step S61 to step S67, source codes of part of the functional modules are provided, and a comparison explanation is made in the annotation section.
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
#include <algorithm>
Data character structure
struct DataFeature {
std;. String category;// data category
std: unorded_map < std:: string, std:: pair < int, int > > keyword FreqRange;// keyword word frequency effective distribution range
};
Dividing unclassified data and calculating word frequency
std::unordered_map<std::string, int> tokenizeAndCountFreq(const std::string& data) {
std::unordered_map<std::string, int> wordCounts;
The/(split function) is a function for splitting character strings
// std::vector<std::string> words = splitFunction(data);
The simplified method of the example is used here, and the actual application needs to be replaced by a real word segmentation method
size_t prev = 0, pos = 0;
do {
pos = data.find(" ", prev);
if (pos == std::string::npos) pos = data.length();
std::string word = data.substr(prev, pos-prev);
if (!word.empty()) wordCounts[word]++;
prev = pos + 1;
} while (pos < data.length() && prev < data.length());
return wordCounts;
}
Checking/checking whether all keywords are present in the data bin feature
bool areAllKeywordsCovered(const std::unordered_map<std::string, int>& wordCounts, const DataFeature& dataFeature) {
for (const auto& wordCount : wordCounts) {
if (dataFeature.keywordFreqRange.find(wordCount.first) == dataFeature.keywordFreqRange.end()) {
return false;
}
}
return true;
}
Determining whether word frequency falls within effective distribution range
bool isFreqInRange(const std::unordered_map<std::string, int>& wordCounts, const DataFeature& dataFeature) {
for (const auto& wordCount : wordCounts) {
auto rangeIt = dataFeature.keywordFreqRange.find(wordCount.first);
if (rangeIt != dataFeature.keywordFreqRange.end()) {
const auto& range = rangeIt->second;
if (wordCount.second < range.first || wordCount.second > range.second) {
return false;// word frequency is out of range
}
}
}
return true;// all word frequencies are within range
}
int main() {
Data of// unclassified type
std::string unclassifiedData = "apple banana apple cherry";
Data characteristics of a data bin
std::vector<DataFeature> dataFeatures = {
{"Fruit", {{"apple", {1, 3}}, {"banana", {1, 2}}, {"cherry", {0, 2}}}},
Adding more data feature classes and word frequency ranges
};
Word frequency statistics for unclassified data
auto wordCounts = tokenizeAndCountFreq(unclassifiedData);
Every data bin is traversed and the matching category is searched
std: string category = "Unclassified";// default category
for (const auto& dataFeature : dataFeatures) {
Checking/checking whether all keywords are present in the data bin feature
if (areAllKeywordsCovered(wordCounts, dataFeature)) {
Determining whether word frequency falls within effective distribution range
if (isFreqInRange(wordCounts, dataFeature)) {
category=datafeature.category;// matched data category
break;// match is successful, exit the loop
}
}
}
Output final classification result
std::cout << "The unclassified data belongs to category: " << category << std::endl;
return 0;
}
The code first defines a structure, dataFeature, for storing data characteristics and word frequency ranges. In the main function main, an unclassified data string unclassifiedData and a dataFeatures array containing various data features are defined. The unclassified data was subjected to word segmentation and word frequency statistics using a tokenizer and dcourntfreq function. The data characteristics of each data bin are then traversed and the aroallkeywordscovered function is used to check whether the segmented words of the unclassified data cover the keywords in the data characteristics of the data bins. If so, further judging whether the word frequency of each keyword in the unclassified data falls into the word frequency effective distribution range of the corresponding keyword in the data bin by using an isFreqInRange function. If they are within range, the unclassified data is marked as classified data and the class of the data bin is taken as the class of the classified data. If not, the comparison with the next data bin is continued. If none of the bins match, the data remains unclassified. And finally, outputting a final classification result.
In another embodiment of the present invention, there is also provided a data analysis method including obtaining a category for classifying data; acquiring a data bin of each type; acquiring reference data of each data bin; acquiring unclassified data; obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data; and obtaining the category of each classified data according to the data characteristics of the unclassified data and the data characteristics of each data bin.
In another embodiment of the present invention, there is further provided a data analysis apparatus, including a data bin reading interface for acquiring a category of classifying data; acquiring a data bin of each type; acquiring a plurality of reference data of each data bin; analyzing a service input interface to obtain unclassified data; the operation unit is used for obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data; acquiring and obtaining the type of each classified data according to the data characteristics of the unclassified data and the data characteristics of each data bin; and the analysis service output interface is used for outputting the category of each classified data.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware, such as circuits or ASICs (application specific integrated circuits, application Specific Integrated Circuit), which perform the corresponding functions or acts, or combinations of hardware and software, such as firmware, etc.
Although the invention is described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (8)

1. A data analysis method is characterized by comprising the steps of,
acquiring the category for classifying the data;
acquiring a data bin of each type;
acquiring reference data of each data bin;
acquiring unclassified data;
obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data;
acquiring and obtaining the type of each classified data according to the data characteristics of the unclassified data and the data characteristics of each data bin;
wherein the step of obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data comprises the steps of,
word segmentation is carried out on each piece of reference data, so that segmented words and corresponding quantity in each piece of reference data are obtained;
obtaining keywords and word frequencies thereof in each piece of reference data according to the segmentation words and the corresponding quantity in each piece of reference data;
the keywords of the reference data corresponding to each data bin are used as keywords of classified data in the data bins;
word segmentation is carried out on the classified data in each data bin, so that word frequency of keywords of each classified data in each data bin is obtained;
and obtaining the word frequency effective distribution range of each keyword in the classified data in the data bin as the data characteristic of the data bin according to the keywords and the word frequency of each keyword in the reference data and the word frequency of each keyword in the classified data in the data bin.
2. The method of claim 1, wherein the step of obtaining keywords and their word frequencies in each of the reference data based on the segmented words and the corresponding numbers in each of the reference data comprises,
acquiring each segmentation word in all the reference data;
acquiring the occurrence times of each segmentation word in all the reference data;
acquiring the accumulated occurrence times of all the segmentation words in all the reference data;
taking the ratio of the occurrence frequency of each segmented word in all the reference data to the accumulated occurrence frequency of all the segmented words as the global word frequency of each segmented word;
acquiring the accumulated occurrence times of all the segmentation words in each reference data;
acquiring the occurrence times of each segmentation word in each reference data;
taking the ratio of the occurrence frequency of each segmented word in each reference data to the accumulated occurrence frequency of all segmented words in the reference data as the internal word frequency of each segmented word in each reference data;
and obtaining the keywords and the word frequency thereof in each reference data according to the internal word frequency and the corresponding global word frequency of each segmentation word in each reference data.
3. The method of claim 2, wherein the step of obtaining keywords and their word frequencies in each of the reference data based on the internal word frequencies and the corresponding global word frequencies of each of the divided words in each of the reference data comprises,
for each of the reference data in question,
obtaining the ratio of the internal word frequency of each segmented word to the corresponding global word frequency as the word frequency coefficient of each segmented word,
the word frequency coefficient of each segmented word is arranged according to the numerical value to obtain a coefficient list,
obtaining the average value of the difference value between each word frequency coefficient and the adjacent word frequency coefficient in the coefficient list as the coefficient average difference value,
sequentially calculating the difference value between the word frequency coefficient with the largest value and the adjacent word frequency coefficient in the coefficient list, judging whether the difference value is larger than the coefficient average difference value,
if yes, stopping the execution,
if not, continuously executing the steps of sequentially calculating the difference value between the word frequency coefficient with the largest value and the adjacent word frequency coefficient in the coefficient list and judging whether the difference value is smaller than the coefficient average difference value,
taking the segmentation word corresponding to the word frequency coefficient participating in calculation as the keyword;
and summarizing and acquiring the keywords and word frequency of the keywords in each piece of reference data.
4. The method of claim 1, wherein the step of obtaining the term frequency effective distribution range of each keyword in the classified data in the data bin based on each of the keywords in the reference data and the term frequency thereof and the term frequency of each keyword of the classified data in the data bin comprises,
for each of the data bins,
acquiring word frequency of key words of each classified data in the data bin,
the multi-dimensional vector composed of the numerical values of the word frequency of the keywords of each classified data is obtained by arranging the keywords of each classified data according to the same sequence and is used as the characteristic vector of each classified data,
obtaining a plurality of distribution ranges of word frequencies of each keyword in the classified data in the data bin according to the feature vector of each classified data,
and taking the distribution range of the keywords and the word frequency thereof in the reference data as the word frequency effective distribution range of each keyword in the classified data in the data bin.
5. The method of claim 4, wherein the step of obtaining a plurality of distribution ranges of word frequencies of each keyword in the classified data in the data bin based on the feature vector of each classified data comprises,
selecting a plurality of feature vectors of the classified data as target feature vectors;
calculating and obtaining a vector difference modular length of each target characteristic vector and each other characteristic vector;
forming a vector set by each other feature vector and the target feature vector with the minimum vector difference module length;
calculating and obtaining a characteristic vector with the minimum vector difference modulus length between the characteristic vector and the mean value vector of all the characteristic vectors in each vector set as an updated target characteristic vector;
judging whether the updated target characteristic vector changes or not;
if yes, returning to continuously update the vector set and the target feature vector;
if not, acquiring the distribution range of word frequency of each keyword of the classified data corresponding to all characteristic vectors in each vector set.
6. The method of any one of claims 1 to 5, wherein the step of obtaining and deriving a category of each classified data based on the data characteristics of the unclassified data and the data characteristics of each of the bins comprises,
word segmentation is carried out on the unclassified data, and segmented words and corresponding numbers in the unclassified data are obtained;
judging whether the segmentation words in the unclassified data cover the keywords in the data characteristics of the data bin or not;
if not, not processing;
if yes, taking the data bin as an alternative data bin;
in the process of comparing unclassified data with each of the candidate bins,
the keywords of the candidate data bin are used as keywords of unclassified data, each keyword of the unclassified data and the word frequency thereof are used as data characteristics,
judging whether the word frequency of each keyword of unclassified data falls into the word frequency effective distribution range of each keyword in classified data in the alternative data bin,
if yes, marking the unclassified data as classified data, taking the type of the candidate data bin as the type of the classified data,
if not, comparing with the next alternative data bin.
7. A data analysis device is characterized by comprising,
the data bin reading interface is used for acquiring the category of classifying the data;
acquiring a data bin of each type;
acquiring a plurality of reference data of each data bin;
analyzing a service input interface to obtain unclassified data;
the operation unit is used for obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data;
acquiring and obtaining the type of each classified data according to the data characteristics of the unclassified data and the data characteristics of each data bin;
an analysis service output interface for outputting a kind of each classified data;
wherein the step of obtaining the data characteristics of the data bin according to the data characteristics of the classified data in the data bin and the data characteristics of the corresponding reference data comprises the steps of,
word segmentation is carried out on each piece of reference data, so that segmented words and corresponding quantity in each piece of reference data are obtained;
obtaining keywords and word frequencies thereof in each piece of reference data according to the segmentation words and the corresponding quantity in each piece of reference data;
the keywords of the reference data corresponding to each data bin are used as keywords of classified data in the data bins;
word segmentation is carried out on the classified data in each data bin, so that word frequency of keywords of each classified data in each data bin is obtained;
and obtaining the word frequency effective distribution range of each keyword in the classified data in the data bin as the data characteristic of the data bin according to the keywords and the word frequency of each keyword in the reference data and the word frequency of each keyword in the classified data in the data bin.
8. A data analysis system, comprising,
a data analysis device as claimed in claim 7, for outputting a category of each classified data; the method comprises the steps of,
a storage unit for creating a data bin for each kind of data;
receiving classified data and categories thereof;
each classified data is stored to the corresponding data warehouse by category.
CN202410051642.8A 2024-01-15 2024-01-15 Data analysis method, device and system Active CN117574243B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410051642.8A CN117574243B (en) 2024-01-15 2024-01-15 Data analysis method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410051642.8A CN117574243B (en) 2024-01-15 2024-01-15 Data analysis method, device and system

Publications (2)

Publication Number Publication Date
CN117574243A true CN117574243A (en) 2024-02-20
CN117574243B CN117574243B (en) 2024-04-26

Family

ID=89890401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410051642.8A Active CN117574243B (en) 2024-01-15 2024-01-15 Data analysis method, device and system

Country Status (1)

Country Link
CN (1) CN117574243B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012194870A (en) * 2011-03-17 2012-10-11 Ntt Comware Corp Word extraction device, word extraction method and program
CN111079411A (en) * 2019-12-12 2020-04-28 拉扎斯网络科技(上海)有限公司 Text processing method and device, readable storage medium and electronic equipment
CN112115232A (en) * 2020-09-24 2020-12-22 腾讯科技(深圳)有限公司 Data error correction method and device and server
CN113407679A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN116484829A (en) * 2023-04-26 2023-07-25 日本电气株式会社 Method and apparatus for information processing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012194870A (en) * 2011-03-17 2012-10-11 Ntt Comware Corp Word extraction device, word extraction method and program
CN111079411A (en) * 2019-12-12 2020-04-28 拉扎斯网络科技(上海)有限公司 Text processing method and device, readable storage medium and electronic equipment
CN112115232A (en) * 2020-09-24 2020-12-22 腾讯科技(深圳)有限公司 Data error correction method and device and server
CN113407679A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN116484829A (en) * 2023-04-26 2023-07-25 日本电气株式会社 Method and apparatus for information processing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄晓春 等: "基于差异-相似矩阵的文本降维方法", 《计算机应用》, 31 August 2005 (2005-08-31), pages 1821 - 1823 *

Also Published As

Publication number Publication date
CN117574243B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
Li et al. Optimizing generalized pagerank methods for seed-expansion community detection
US7971150B2 (en) Document categorisation system
US20040049499A1 (en) Document retrieval system and question answering system
Jain et al. Machine Learning based Fake News Detection using linguistic features and word vector features
CN110633371A (en) Log classification method and system
CN110008306A (en) A kind of data relationship analysis method, device and data service system
CN111930933A (en) Detection case processing method and device based on artificial intelligence
CN115618014A (en) Standard document analysis management system and method applying big data technology
CN106815209B (en) Uygur agricultural technical term identification method
CN117574243B (en) Data analysis method, device and system
CN110232071A (en) Search method, device and storage medium, the electronic device of drug data
Anand et al. Analysis and prediction of television show popularity rating using incremental K-means algorithm
CN115982316A (en) Multi-mode-based text retrieval method, system and medium
CN110020034A (en) A kind of information citation analysis method and system
Negishi et al. Hardware-trojan detection at gate-level netlists using gradient boosting decision tree models
KR20220041337A (en) Graph generation system of updating a search word from thesaurus and extracting core documents and method thereof
CN114185875A (en) Big data unified analysis and processing system based on cloud computing
Pragarauskaitė et al. Markov Models in the analysis of frequent patterns in financial data
Van Le et al. An efficient pretopological approach for document clustering
Coaquira et al. Applications of rough sets theory in data preprocessing for knowledge discovery
Attik et al. Clustering Quality Measures for Data Samples with Multiple Labels.
Toriah et al. Shots Temporal Prediction Rules for High-Dimensional Data of Semantic Video Retrieval
Sarmento et al. Contextualization for the Organization of Text Documents Streams
Somsakul et al. On the Network and Topological Analyses of Legal Documents using Text Mining Approach
CN112307165A (en) Core patent judgment method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant