CN114117292B

CN114117292B - Internet big data analysis and extraction method

Info

Publication number: CN114117292B
Application number: CN202111298638.4A
Authority: CN
Inventors: 陈大海; 张冰; 徐浩; 葛卫春
Original assignee: China Information Consulting and Designing Institute Co Ltd
Current assignee: China Information Consulting and Designing Institute Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2024-04-16
Anticipated expiration: 2041-11-04
Also published as: CN114117292A

Abstract

The invention provides an Internet big data analysis and extraction method, which comprises the following steps: step 1, dividing a data object into different parts and types according to the characteristics of the data to obtain a data range to be extracted; step 2, establishing a regression model, solving each parameter of the model according to the measured data, evaluating whether the regression model can fit the measured data, and if so, further reducing the range of the data to be extracted according to the independent variable; step 3, dividing the data into more than two aggregation classes according to the characteristic attribute of the data, wherein elements in each aggregation class have the same characteristics, and grouping the data to be grabbed; step 4, calculating the similarity degree of the two data by adopting a similarity matching method; step 5, using word frequency as a statistical index to indicate the data segment information fed back by the data; and 6, obtaining a data analysis result. The method is automatically completed by utilizing the characteristic learning algorithm based on the embedded mapping, and has high calculation efficiency.

Description

Internet big data analysis and extraction method

Technical Field

The invention belongs to the technical field of big data, and particularly relates to an Internet big data analysis and extraction method.

Background

Big data refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which needs a new processing mode to have stronger decision-making ability, insight discovery ability and flow optimization ability.

At present, many adopt the web crawler mode, snatch relevant information from public website, then carry out structuring processing and storage, can be disturbed by useless information such as a large amount of expiration information, phishing website information, and the data accuracy and practicality are lower. Therefore, there is a need to intensively study an internet data extraction method to solve the problem of improving the reliability and accuracy of data.

The existing intelligent processing system for big data has at least the following disadvantages: the prior data technology lacks analysis of unstructured data, loses a large amount of effective information, and influences the analysis result of the service; the existing data analysis and extraction are excessively dependent on human-powered feature extraction, so that the accuracy is low, the calculation efficiency is poor, the response to a user request is slow, and the user experience is affected; different services typically employ different data processing and feature extraction methods, resulting in a large amount of redundant data processing, and the features of the data units of the different services are not compatible.

Disclosure of Invention

The invention aims to: the invention aims to solve the defects in the prior art, and provides an Internet big data analysis and extraction method which eliminates data with low accuracy and reliability to obtain positive check data with higher reliability and reliability.

The method specifically comprises the following steps:

step 1, dividing a data object into different parts and types according to the characteristics of the data, and further analyzing to obtain a data range to be extracted;

step 2, determining causal relation between variables by defining dependent variables and independent variables, establishing a regression model, solving each parameter of the model according to measured data, evaluating whether the regression model can fit the measured data, and if so, further reducing the range of the data to be extracted according to the independent variables;

step 3, dividing the data into more than two aggregation classes according to the characteristic attribute of the data (the characteristic attribute is used for representing the data, the source of the characteristic attribute can be statistical analysis, such as internet text data used in the invention, the characteristic attribute comprises source websites, topics, words, word frequency statistics and the like, the step 3 is to firstly perform a preliminary grouping, which is equivalent to initializing work, and then further refining and extracting), and the elements in each aggregation class have the same characteristic and group the data to be grabbed;

step 4, calculating the similarity degree of the two data by adopting a similarity matching method;

step 5, extracting the data frequently appearing in the steps 1 to 4 (the statistics of the selected word frequency reaches the first 20 percent), and according to the attribute characteristics of the data, using the word frequency as a statistical index to indicate the data segment information fed back by the data;

and 6, carrying out data decomposition on the data segment in the data to be analyzed according to the data segment decomposition regular expression (the regular expression is a computer text processing technology, is an internet text, contains a plurality of format symbols (such as html mark symbols and the like) and needs to be processed and filtered by means of the regular expression) so as to generate a data item value, associating the data item value with a data item name list corresponding to the data segment decomposition regular expression so as to form an intermediate data pair corresponding to the data item name and the data item value, and carrying out statistical analysis on the intermediate data pair according to a data statistical rule so as to obtain a data analysis result.

Preferably, in the step 2, the similarity matching algorithm can be applied to fields such as data cleansing, user input error correction, recommendation statistics, hacking detection system, automatic scoring system, web page searching and DNA sequence matching. In step 2, the measured data represents the data of the actual experimental test, that is, the input data, and the independent variable is derived from the measured data.

The step 2 comprises the following steps: setting the self-variable data object as X= { X ₁ ,x ₂ ,…,x _m The corresponding dependent variable is y= { y } ₁ ,y ₂ ,…,y _m The regression model is:

y＝w ₀ +w ₁ x ₁ +w ₂ x ₂ +…+w _m x _m +μ

wherein,x _m 、y _m respectively representing an mth independent variable and an mth dependent variable; w= { w ₀ ,w ₁ ,w ₂ ,…,w _m And the regression coefficient is set, w _m Represents the mth regression coefficient, μ is the random error, and the error of fit, L (X), is measured by the square error:

from the following componentsThe method comprises the following steps:

is a parameter estimation value for w (regression coefficient);

solving the problem of under fitting by local weighted linear regression, and adding weight w to the error _i The error is:

wherein W is a diagonal matrix, a Gaussian kernel is adopted, and the corresponding weight function W (j, j) is as follows:

where k represents the variance of the Gaussian function, resulting in a new regression coefficientThe method comprises the following steps:

wherein w=w ^T W。

In step 4, the similarity between data objects in different groups is required to be low, the similarity between data objects in the same group is required to be high, and J is calculated by the following objective function:

where J is the sum of square errors of all objects in the measured dataset, x _i Representing any one of the objects in the dataset, u _j Is the j-th aggregation class (cluster) C _j The goal is to have the objective function converge.

In step 4, the similarity matching method includes calculation of average index and variation index, and graphic representation of data distribution form, and by calculating distance between data items, similarity between two data items is measured, and comprehensive consideration of Euclidean distance, manand distance, minkowski distance, and included angle cosine distance is adopted, wherein the calculation formula is as follows:

the Euclidean distance D is calculated by adopting the following formula ₁ (X _i ,X _j )：

The Manchurian distance D is calculated using the following formula ₂ (X _i ,X _j )：

The Minkowski distance D is calculated using the formula ₃ (X _i ,X _j )：

The cosine distance D of the included angle is calculated by adopting the following formula ₄ (X _i ,X _j )：

Wherein X is _i ＝{x _i1 ,x _i2 ,…,x _id }∈R ^d And X _j ＝{x _j1 ,x _j2 ,…,x _jd }∈R ^d Representing two data item samples in a data item collection, a smaller distance value representing a greater similarity of the samples, a greater distance representing a lesser similarity of the samples; i, j=1, 2,3, …, N; x is x _id Representing the ith data item sample X _i The d-th value of (2); r is R ^d Representing a real set of dimensions d;

the weighted sum distance D (X) is calculated using the following formula _i ,X _j )：

D(X _i ,X _j )＝a ₁ ·D ₁ (X _i ,X _j )+a ₂ ·D ₂ (X _i ,X _j )+a ₃ ·D ₃ (X _i ,X _j )+a ₄ ·D ₄ (X _i ,X _j )

Wherein a is ₁ ,a ₂ ,a ₃ ,a ₄ The weight values of the corresponding Euclidean distance, manand ton distance, minkowski distance and included angle cosine distance are respectively the value ranges [0,1 ]]And a ₁ +a ₂ +a ₃ +a ₄ ＝1。

The invention also comprises a method for acquiring the site home page, which comprises the following steps: taking out the domain name address from the website of the webpage, and performing jump processing on the domain name address to obtain a website top page corresponding to the webpage; taking out the domain name addresses one by one from the websites of the full-network webpage, adding the domain name addresses into a domain name address set after performing duplication removal processing, and performing jump processing on all the domain name addresses in the domain name address set to obtain a corresponding website top page; or, using the top page sample set of the site to statistically analyze the link anchor text and the website style characteristics of the site to construct a top page classifier, and using the top page classifier to analyze the webpage to obtain all the top pages of the site.

The invention also includes: the method for acquiring the contact page corresponding to the webpage specifically comprises the following steps: and utilizing a contact page sample set of the sites to statistically analyze the link anchor text, the page title and the website style characteristics of the contact page sample set to construct a contact page classifier, and utilizing the contact page classifier to analyze the webpage to obtain contact information pages of all the sites.

The invention has the following beneficial effects:

1. the data structuring module can preprocess and network the original big data to convert the original big data into network data or structure data, so that the characterization learning module can utilize a characterization learning algorithm of the network data to realize rapid and uniform feature extraction of the data and express the data in a high-dimensional vector form; the whole characteristic extraction process is automatically completed by using a characterization learning algorithm based on embedded mapping without human participation, and the calculation efficiency is high.

2. Structural information (namely effective information) in the original big data is also greatly reserved in the process of feature extraction, so that the accuracy of tasks such as classification or prediction by using the feature information is improved; furthermore, due to the adoption of the characteristic learning algorithm based on the embedded mapping, the data characteristic system mined from the original big data can be uniformly represented in a high-dimensional vector form, so that the intelligent processing system of the big data is not limited to a specific application service, and a uniform and effective processing method can be provided for various application services.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic representation of the results of an embodiment of the present invention.

Detailed Description

Referring to fig. 1, the invention provides an internet big data analysis and extraction method, which comprises the following steps:

s1, dividing a data object into different parts and types according to the characteristics of the data, and further analyzing to obtain a data range to be extracted;

s2, determining causal relation between variables by stipulating dependent variables and independent variables, establishing a regression model, solving each parameter of the model according to measured data, then evaluating whether the regression model can fit the measured data well, if so, further narrowing the data range to be extracted according to the independent variables, and applying the similarity matching algorithm to fields such as data cleaning, user input error correction, recommendation statistics, plagiarism detection systems, automatic scoring systems, web page searching and DNA sequence matching;

s3, dividing the data into a plurality of aggregation classes according to the characteristic attribute of the data, wherein the elements in each aggregation class have the same characteristics as much as possible, and the data to be grabbed are grouped in a classification mode with the characteristic difference between different aggregation classes as large as possible;

s4, calculating the similarity degree of the two data by adopting a similarity matching method, wherein the similarity degree is usually measured by a percentage, and the similarity matching method comprises calculation of an average index and a variation index and graphic representation of a data distribution form;

s5, extracting the frequently-occurring data set in the steps, and according to the attribute characteristics of the data, using word frequency as a statistical index to indicate the data segment information fed back by the data;

s6, carrying out data decomposition on the data segment in the data to be analyzed according to the data segment decomposition regular expression, generating a data item value, associating the data item value with a data item name list corresponding to the data segment decomposition regular expression, forming an intermediate data pair corresponding to the data item name and the data item value, and carrying out statistical analysis on the intermediate data pair according to a data statistical rule to obtain a data analysis result.

The method for acquiring the site home page comprises the following steps: taking out the domain name address from the website of the webpage, and performing jump processing on the domain name address to obtain a site home page corresponding to the webpage; taking out the domain name addresses one by one from the websites of the full-network webpage, adding the domain name addresses into a domain name address set after performing duplication removal processing, and performing jump processing on all the domain name addresses in the domain name address set to obtain a corresponding website top page; or, using the top page sample set of the site to statistically analyze the link anchor text and the website style characteristics of the site to construct a top page classifier, and using the top page classifier to analyze the webpage to obtain all the top pages of the site.

The method for acquiring the contact pages corresponding to the webpage comprises the following steps: and utilizing the contact page sample set of the sites to statistically analyze the link anchor text, the page title and the website style characteristics of the sites to construct a contact page classifier, and utilizing the contact page classifier to analyze the webpage pages to obtain the contact information pages of all the sites.

The data structuring module can preprocess and network the original big data to convert the original big data into network data or structure data, so that the characterization learning module can utilize a characterization learning algorithm of the network data to realize rapid and uniform feature extraction of the data and express the data in a high-dimensional vector form; the whole characteristic extraction process is automatically completed by using a characterization learning algorithm based on embedded mapping without human participation, and the calculation efficiency is high.

Structural information (namely effective information) in the original big data is also greatly reserved in the process of feature extraction, so that the accuracy of tasks such as classification or prediction by using the feature information is improved; furthermore, due to the adoption of the characteristic learning algorithm based on the embedded mapping, the data characteristic system mined from the original big data can be uniformly represented in a high-dimensional vector form, so that the intelligent processing system of the big data is not limited to a specific application service, and a uniform and effective processing method can be provided for various application services.

By utilizing the method, extraction analysis is carried out on data such as recruitment release information, job seeker information and the like of a recruitment website, and the recruitment information is separated into the following data dimensions after extraction: post name, salary requirements, working city, working years, working properties, academic requirements, recruiters, post descriptions, post responsibilities, post welfare, detailed work places, post publisher names, company names, industries to which the company belongs, company personnel, company properties, company descriptions, company official network addresses and the like; after job hunting information is extracted, the job hunting information is divided into the following steps according to data dimension: job seekers name, gender, date of birth, political aspect, working years, graduation institutions, academia, job positions, desired salary, technical capabilities, working experiences, honor certificates, mobile phones, mailboxes, addresses, and the like.

According to the technical steps of the invention, experiments are carried out, the crawled internet data sets are screened according to the data dimension as effective characteristics, the screened results (namely effective data) are used as data sets for carrying out the next experiment, 0.1% of the accuracy of the experimental results is verified according to the extracted data sets, the obtained experimental results are as follows, the data sets are shown in table 1, and the experimental results are shown in fig. 2.

TABLE 1

The invention provides an internet big data analysis and extraction method, and the method and the way for realizing the technical scheme are a plurality of, the above is only the preferred implementation mode of the invention, and it should be pointed out that a plurality of improvements and modifications can be made by one of ordinary skill in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. The Internet big data analysis and extraction method is characterized by comprising the following steps:

step 3, dividing the data into more than two aggregation classes according to the characteristic attribute of the data, wherein elements in each aggregation class have the same characteristics, and grouping the data to be grabbed;

step 5, extracting the frequently occurring data in the steps 1 to 4, and according to the attribute characteristics of the data, using word frequency as a statistical index to indicate the data segment information fed back by the data;

step 6, carrying out data decomposition on the data segment in the data to be analyzed according to the data segment decomposition regular expression, generating a data item value, associating the data item value with a data item name list corresponding to the data segment decomposition regular expression, forming an intermediate data pair corresponding to the data item name and the data item value, and carrying out statistical analysis on the intermediate data pair according to a data statistical rule to obtain a data analysis result;

Using, for exampleThe following formula calculates the Manand ton distance D ₂ (X _i ,X _j )：

The Minkowski distance D is calculated using the formula ₃ (X _i ,X _j )：

Wherein a is ₁ ,a ₂ ,a ₃ ,a ₄ The weight values of the corresponding Euclidean distance, manand ton distance, minkowski distance and included angle cosine distance are respectively the value ranges [0,1 ]]And a ₁ +a ₂ +a ₃ +a ₄ ＝1；

The method for acquiring the site home page further comprises the following steps: taking out the domain name address from the website of the webpage, and performing jump processing on the domain name address to obtain a website top page corresponding to the webpage; taking out the domain name addresses one by one from the websites of the full-network webpage, adding the domain name addresses into a domain name address set after performing duplication removal processing, and performing jump processing on all the domain name addresses in the domain name address set to obtain a corresponding website top page; or, using the top page sample set of the site to statistically analyze the link anchor text and the website style characteristics of the site to construct a top page classifier, and using the top page classifier to analyze the webpage to obtain all the top pages of the site.

2. The method of claim 1, wherein step 2 comprises: setting the self-variable data object as X= { X ₁ ,x ₂ ,…,x _m The corresponding dependent variable is y= { y } ₁ ,y ₂ ,…,y _m The regression model is:

y＝w ₀ +w ₁ x ₁ +w ₂ x ₂ +…+w _m x _m +μ

wherein x is _m 、y _m Respectively representing an mth independent variable and an mth dependent variable; w= { w ₀ ,w ₁ ,w ₂ ,…,w _m And the regression coefficient is set, w _m Represents the mth regression coefficient, μ is the random error, and the error of fit, L (X), is measured by the square error:

from the following componentsThe method comprises the following steps:

is a parameter estimation value for w (regression coefficient);

wherein w=w ^T W。

3. The method according to claim 2, wherein in step 4, the similarity between data objects of different groups is required to be low, and the similarity between data objects in the same group is required to be high, J is calculated by the following objective function:

wherein J is the flatness of all objects in the measured datasetSum of square errors, x _i Representing any one of the objects in the dataset, u _j Is the j-th aggregation class C _j The goal is to have the objective function converge.

4. A method according to claim 3, further comprising: the method for acquiring the contact page corresponding to the webpage specifically comprises the following steps: and utilizing a contact page sample set of the sites to statistically analyze the link anchor text, the page title and the website style characteristics of the contact page sample set to construct a contact page classifier, and utilizing the contact page classifier to analyze the webpage to obtain contact information pages of all the sites.