CN114117292B - Internet big data analysis and extraction method - Google Patents
Internet big data analysis and extraction method Download PDFInfo
- Publication number
- CN114117292B CN114117292B CN202111298638.4A CN202111298638A CN114117292B CN 114117292 B CN114117292 B CN 114117292B CN 202111298638 A CN202111298638 A CN 202111298638A CN 114117292 B CN114117292 B CN 114117292B
- Authority
- CN
- China
- Prior art keywords
- data
- distance
- similarity
- domain name
- following
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 18
- 238000007405 data analysis Methods 0.000 title claims abstract description 13
- 238000000034 method Methods 0.000 claims abstract description 28
- 230000002776 aggregation Effects 0.000 claims abstract description 11
- 238000004220 aggregation Methods 0.000 claims abstract description 11
- 238000004364 calculation method Methods 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims description 17
- 238000000354 decomposition reaction Methods 0.000 claims description 9
- 230000001419 dependent effect Effects 0.000 claims description 8
- 238000007619 statistical method Methods 0.000 claims description 4
- 230000001364 causal effect Effects 0.000 claims description 3
- 238000009826 distribution Methods 0.000 claims description 3
- 238000012417 linear regression Methods 0.000 claims description 2
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 abstract description 9
- 238000013507 mapping Methods 0.000 abstract description 5
- 238000012512 characterization method Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000007115 recruitment Effects 0.000 description 3
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000000547 structure data Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000003245 working effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an Internet big data analysis and extraction method, which comprises the following steps: step 1, dividing a data object into different parts and types according to the characteristics of the data to obtain a data range to be extracted; step 2, establishing a regression model, solving each parameter of the model according to the measured data, evaluating whether the regression model can fit the measured data, and if so, further reducing the range of the data to be extracted according to the independent variable; step 3, dividing the data into more than two aggregation classes according to the characteristic attribute of the data, wherein elements in each aggregation class have the same characteristics, and grouping the data to be grabbed; step 4, calculating the similarity degree of the two data by adopting a similarity matching method; step 5, using word frequency as a statistical index to indicate the data segment information fed back by the data; and 6, obtaining a data analysis result. The method is automatically completed by utilizing the characteristic learning algorithm based on the embedded mapping, and has high calculation efficiency.
Description
Technical Field
The invention belongs to the technical field of big data, and particularly relates to an Internet big data analysis and extraction method.
Background
Big data refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which needs a new processing mode to have stronger decision-making ability, insight discovery ability and flow optimization ability.
At present, many adopt the web crawler mode, snatch relevant information from public website, then carry out structuring processing and storage, can be disturbed by useless information such as a large amount of expiration information, phishing website information, and the data accuracy and practicality are lower. Therefore, there is a need to intensively study an internet data extraction method to solve the problem of improving the reliability and accuracy of data.
The existing intelligent processing system for big data has at least the following disadvantages: the prior data technology lacks analysis of unstructured data, loses a large amount of effective information, and influences the analysis result of the service; the existing data analysis and extraction are excessively dependent on human-powered feature extraction, so that the accuracy is low, the calculation efficiency is poor, the response to a user request is slow, and the user experience is affected; different services typically employ different data processing and feature extraction methods, resulting in a large amount of redundant data processing, and the features of the data units of the different services are not compatible.
Disclosure of Invention
The invention aims to: the invention aims to solve the defects in the prior art, and provides an Internet big data analysis and extraction method which eliminates data with low accuracy and reliability to obtain positive check data with higher reliability and reliability.
The method specifically comprises the following steps:
step 1, dividing a data object into different parts and types according to the characteristics of the data, and further analyzing to obtain a data range to be extracted;
step 2, determining causal relation between variables by defining dependent variables and independent variables, establishing a regression model, solving each parameter of the model according to measured data, evaluating whether the regression model can fit the measured data, and if so, further reducing the range of the data to be extracted according to the independent variables;
step 3, dividing the data into more than two aggregation classes according to the characteristic attribute of the data (the characteristic attribute is used for representing the data, the source of the characteristic attribute can be statistical analysis, such as internet text data used in the invention, the characteristic attribute comprises source websites, topics, words, word frequency statistics and the like, the step 3 is to firstly perform a preliminary grouping, which is equivalent to initializing work, and then further refining and extracting), and the elements in each aggregation class have the same characteristic and group the data to be grabbed;
step 4, calculating the similarity degree of the two data by adopting a similarity matching method;
step 5, extracting the data frequently appearing in the steps 1 to 4 (the statistics of the selected word frequency reaches the first 20 percent), and according to the attribute characteristics of the data, using the word frequency as a statistical index to indicate the data segment information fed back by the data;
and 6, carrying out data decomposition on the data segment in the data to be analyzed according to the data segment decomposition regular expression (the regular expression is a computer text processing technology, is an internet text, contains a plurality of format symbols (such as html mark symbols and the like) and needs to be processed and filtered by means of the regular expression) so as to generate a data item value, associating the data item value with a data item name list corresponding to the data segment decomposition regular expression so as to form an intermediate data pair corresponding to the data item name and the data item value, and carrying out statistical analysis on the intermediate data pair according to a data statistical rule so as to obtain a data analysis result.
Preferably, in the step 2, the similarity matching algorithm can be applied to fields such as data cleansing, user input error correction, recommendation statistics, hacking detection system, automatic scoring system, web page searching and DNA sequence matching. In step 2, the measured data represents the data of the actual experimental test, that is, the input data, and the independent variable is derived from the measured data.
The step 2 comprises the following steps: setting the self-variable data object as X= { X 1 ,x 2 ,…,x m The corresponding dependent variable is y= { y } 1 ,y 2 ,…,y m The regression model is:
y=w 0 +w 1 x 1 +w 2 x 2 +…+w m x m +μ
wherein,x m 、y m respectively representing an mth independent variable and an mth dependent variable; w= { w 0 ,w 1 ,w 2 ,…,w m And the regression coefficient is set, w m Represents the mth regression coefficient, μ is the random error, and the error of fit, L (X), is measured by the square error:
from the following componentsThe method comprises the following steps:
is a parameter estimation value for w (regression coefficient);
solving the problem of under fitting by local weighted linear regression, and adding weight w to the error i The error is:
wherein W is a diagonal matrix, a Gaussian kernel is adopted, and the corresponding weight function W (j, j) is as follows:
where k represents the variance of the Gaussian function, resulting in a new regression coefficientThe method comprises the following steps:
wherein w=w T W。
In step 4, the similarity between data objects in different groups is required to be low, the similarity between data objects in the same group is required to be high, and J is calculated by the following objective function:
where J is the sum of square errors of all objects in the measured dataset, x i Representing any one of the objects in the dataset, u j Is the j-th aggregation class (cluster) C j The goal is to have the objective function converge.
In step 4, the similarity matching method includes calculation of average index and variation index, and graphic representation of data distribution form, and by calculating distance between data items, similarity between two data items is measured, and comprehensive consideration of Euclidean distance, manand distance, minkowski distance, and included angle cosine distance is adopted, wherein the calculation formula is as follows:
the Euclidean distance D is calculated by adopting the following formula 1 (X i ,X j ):
The Manchurian distance D is calculated using the following formula 2 (X i ,X j ):
D 2 (X i ,X j )=|x i1 -x j1 |+x i2 -x j2 |+…+|x id -x jd |
The Minkowski distance D is calculated using the formula 3 (X i ,X j ):
The cosine distance D of the included angle is calculated by adopting the following formula 4 (X i ,X j ):
Wherein X is i ={x i1 ,x i2 ,…,x id }∈R d And X j ={x j1 ,x j2 ,…,x jd }∈R d Representing two data item samples in a data item collection, a smaller distance value representing a greater similarity of the samples, a greater distance representing a lesser similarity of the samples; i, j=1, 2,3, …, N; x is x id Representing the ith data item sample X i The d-th value of (2); r is R d Representing a real set of dimensions d;
the weighted sum distance D (X) is calculated using the following formula i ,X j ):
D(X i ,X j )=a 1 ·D 1 (X i ,X j )+a 2 ·D 2 (X i ,X j )+a 3 ·D 3 (X i ,X j )+a 4 ·D 4 (X i ,X j )
Wherein a is 1 ,a 2 ,a 3 ,a 4 The weight values of the corresponding Euclidean distance, manand ton distance, minkowski distance and included angle cosine distance are respectively the value ranges [0,1 ]]And a 1 +a 2 +a 3 +a 4 =1。
The invention also comprises a method for acquiring the site home page, which comprises the following steps: taking out the domain name address from the website of the webpage, and performing jump processing on the domain name address to obtain a website top page corresponding to the webpage; taking out the domain name addresses one by one from the websites of the full-network webpage, adding the domain name addresses into a domain name address set after performing duplication removal processing, and performing jump processing on all the domain name addresses in the domain name address set to obtain a corresponding website top page; or, using the top page sample set of the site to statistically analyze the link anchor text and the website style characteristics of the site to construct a top page classifier, and using the top page classifier to analyze the webpage to obtain all the top pages of the site.
The invention also includes: the method for acquiring the contact page corresponding to the webpage specifically comprises the following steps: and utilizing a contact page sample set of the sites to statistically analyze the link anchor text, the page title and the website style characteristics of the contact page sample set to construct a contact page classifier, and utilizing the contact page classifier to analyze the webpage to obtain contact information pages of all the sites.
The invention has the following beneficial effects:
1. the data structuring module can preprocess and network the original big data to convert the original big data into network data or structure data, so that the characterization learning module can utilize a characterization learning algorithm of the network data to realize rapid and uniform feature extraction of the data and express the data in a high-dimensional vector form; the whole characteristic extraction process is automatically completed by using a characterization learning algorithm based on embedded mapping without human participation, and the calculation efficiency is high.
2. Structural information (namely effective information) in the original big data is also greatly reserved in the process of feature extraction, so that the accuracy of tasks such as classification or prediction by using the feature information is improved; furthermore, due to the adoption of the characteristic learning algorithm based on the embedded mapping, the data characteristic system mined from the original big data can be uniformly represented in a high-dimensional vector form, so that the intelligent processing system of the big data is not limited to a specific application service, and a uniform and effective processing method can be provided for various application services.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic representation of the results of an embodiment of the present invention.
Detailed Description
Referring to fig. 1, the invention provides an internet big data analysis and extraction method, which comprises the following steps:
s1, dividing a data object into different parts and types according to the characteristics of the data, and further analyzing to obtain a data range to be extracted;
s2, determining causal relation between variables by stipulating dependent variables and independent variables, establishing a regression model, solving each parameter of the model according to measured data, then evaluating whether the regression model can fit the measured data well, if so, further narrowing the data range to be extracted according to the independent variables, and applying the similarity matching algorithm to fields such as data cleaning, user input error correction, recommendation statistics, plagiarism detection systems, automatic scoring systems, web page searching and DNA sequence matching;
s3, dividing the data into a plurality of aggregation classes according to the characteristic attribute of the data, wherein the elements in each aggregation class have the same characteristics as much as possible, and the data to be grabbed are grouped in a classification mode with the characteristic difference between different aggregation classes as large as possible;
s4, calculating the similarity degree of the two data by adopting a similarity matching method, wherein the similarity degree is usually measured by a percentage, and the similarity matching method comprises calculation of an average index and a variation index and graphic representation of a data distribution form;
s5, extracting the frequently-occurring data set in the steps, and according to the attribute characteristics of the data, using word frequency as a statistical index to indicate the data segment information fed back by the data;
s6, carrying out data decomposition on the data segment in the data to be analyzed according to the data segment decomposition regular expression, generating a data item value, associating the data item value with a data item name list corresponding to the data segment decomposition regular expression, forming an intermediate data pair corresponding to the data item name and the data item value, and carrying out statistical analysis on the intermediate data pair according to a data statistical rule to obtain a data analysis result.
The method for acquiring the site home page comprises the following steps: taking out the domain name address from the website of the webpage, and performing jump processing on the domain name address to obtain a site home page corresponding to the webpage; taking out the domain name addresses one by one from the websites of the full-network webpage, adding the domain name addresses into a domain name address set after performing duplication removal processing, and performing jump processing on all the domain name addresses in the domain name address set to obtain a corresponding website top page; or, using the top page sample set of the site to statistically analyze the link anchor text and the website style characteristics of the site to construct a top page classifier, and using the top page classifier to analyze the webpage to obtain all the top pages of the site.
The method for acquiring the contact pages corresponding to the webpage comprises the following steps: and utilizing the contact page sample set of the sites to statistically analyze the link anchor text, the page title and the website style characteristics of the sites to construct a contact page classifier, and utilizing the contact page classifier to analyze the webpage pages to obtain the contact information pages of all the sites.
The data structuring module can preprocess and network the original big data to convert the original big data into network data or structure data, so that the characterization learning module can utilize a characterization learning algorithm of the network data to realize rapid and uniform feature extraction of the data and express the data in a high-dimensional vector form; the whole characteristic extraction process is automatically completed by using a characterization learning algorithm based on embedded mapping without human participation, and the calculation efficiency is high.
Structural information (namely effective information) in the original big data is also greatly reserved in the process of feature extraction, so that the accuracy of tasks such as classification or prediction by using the feature information is improved; furthermore, due to the adoption of the characteristic learning algorithm based on the embedded mapping, the data characteristic system mined from the original big data can be uniformly represented in a high-dimensional vector form, so that the intelligent processing system of the big data is not limited to a specific application service, and a uniform and effective processing method can be provided for various application services.
By utilizing the method, extraction analysis is carried out on data such as recruitment release information, job seeker information and the like of a recruitment website, and the recruitment information is separated into the following data dimensions after extraction: post name, salary requirements, working city, working years, working properties, academic requirements, recruiters, post descriptions, post responsibilities, post welfare, detailed work places, post publisher names, company names, industries to which the company belongs, company personnel, company properties, company descriptions, company official network addresses and the like; after job hunting information is extracted, the job hunting information is divided into the following steps according to data dimension: job seekers name, gender, date of birth, political aspect, working years, graduation institutions, academia, job positions, desired salary, technical capabilities, working experiences, honor certificates, mobile phones, mailboxes, addresses, and the like.
According to the technical steps of the invention, experiments are carried out, the crawled internet data sets are screened according to the data dimension as effective characteristics, the screened results (namely effective data) are used as data sets for carrying out the next experiment, 0.1% of the accuracy of the experimental results is verified according to the extracted data sets, the obtained experimental results are as follows, the data sets are shown in table 1, and the experimental results are shown in fig. 2.
TABLE 1
The invention provides an internet big data analysis and extraction method, and the method and the way for realizing the technical scheme are a plurality of, the above is only the preferred implementation mode of the invention, and it should be pointed out that a plurality of improvements and modifications can be made by one of ordinary skill in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.
Claims (4)
1. The Internet big data analysis and extraction method is characterized by comprising the following steps:
step 1, dividing a data object into different parts and types according to the characteristics of the data, and further analyzing to obtain a data range to be extracted;
step 2, determining causal relation between variables by defining dependent variables and independent variables, establishing a regression model, solving each parameter of the model according to measured data, evaluating whether the regression model can fit the measured data, and if so, further reducing the range of the data to be extracted according to the independent variables;
step 3, dividing the data into more than two aggregation classes according to the characteristic attribute of the data, wherein elements in each aggregation class have the same characteristics, and grouping the data to be grabbed;
step 4, calculating the similarity degree of the two data by adopting a similarity matching method;
step 5, extracting the frequently occurring data in the steps 1 to 4, and according to the attribute characteristics of the data, using word frequency as a statistical index to indicate the data segment information fed back by the data;
step 6, carrying out data decomposition on the data segment in the data to be analyzed according to the data segment decomposition regular expression, generating a data item value, associating the data item value with a data item name list corresponding to the data segment decomposition regular expression, forming an intermediate data pair corresponding to the data item name and the data item value, and carrying out statistical analysis on the intermediate data pair according to a data statistical rule to obtain a data analysis result;
in step 4, the similarity matching method includes calculation of average index and variation index, and graphic representation of data distribution form, and by calculating distance between data items, similarity between two data items is measured, and comprehensive consideration of Euclidean distance, manand distance, minkowski distance, and included angle cosine distance is adopted, wherein the calculation formula is as follows:
the Euclidean distance D is calculated by adopting the following formula 1 (X i ,X j ):
Using, for exampleThe following formula calculates the Manand ton distance D 2 (X i ,X j ):
D 2 (X i ,X j )=|x i1 -x j1 |+|x i2 -x j2 |+…+|x id -x jd |
The Minkowski distance D is calculated using the formula 3 (X i ,X j ):
The cosine distance D of the included angle is calculated by adopting the following formula 4 (X i ,X j ):
Wherein X is i ={x i1 ,x i2 ,…,x id }∈R d And X j ={x j1 ,x j2 ,…,x jd }∈R d Representing two data item samples in a data item collection, a smaller distance value representing a greater similarity of the samples, a greater distance representing a lesser similarity of the samples; i, j=1, 2,3, …, N; x is x id Representing the ith data item sample X i The d-th value of (2); r is R d Representing a real set of dimensions d;
the weighted sum distance D (X) is calculated using the following formula i ,X j ):
D(X i ,X j )=a 1 ·D 1 (X i ,X j )+a 2 ·D 2 (X i ,X j )+a 3 ·D 3 (X i ,X j )+a 4 ·D 4 (X i ,X j )
Wherein a is 1 ,a 2 ,a 3 ,a 4 The weight values of the corresponding Euclidean distance, manand ton distance, minkowski distance and included angle cosine distance are respectively the value ranges [0,1 ]]And a 1 +a 2 +a 3 +a 4 =1;
The method for acquiring the site home page further comprises the following steps: taking out the domain name address from the website of the webpage, and performing jump processing on the domain name address to obtain a website top page corresponding to the webpage; taking out the domain name addresses one by one from the websites of the full-network webpage, adding the domain name addresses into a domain name address set after performing duplication removal processing, and performing jump processing on all the domain name addresses in the domain name address set to obtain a corresponding website top page; or, using the top page sample set of the site to statistically analyze the link anchor text and the website style characteristics of the site to construct a top page classifier, and using the top page classifier to analyze the webpage to obtain all the top pages of the site.
2. The method of claim 1, wherein step 2 comprises: setting the self-variable data object as X= { X 1 ,x 2 ,…,x m The corresponding dependent variable is y= { y } 1 ,y 2 ,…,y m The regression model is:
y=w 0 +w 1 x 1 +w 2 x 2 +…+w m x m +μ
wherein x is m 、y m Respectively representing an mth independent variable and an mth dependent variable; w= { w 0 ,w 1 ,w 2 ,…,w m And the regression coefficient is set, w m Represents the mth regression coefficient, μ is the random error, and the error of fit, L (X), is measured by the square error:
from the following componentsThe method comprises the following steps:
is a parameter estimation value for w (regression coefficient);
solving the problem of under fitting by local weighted linear regression, and adding weight w to the error i The error is:
wherein W is a diagonal matrix, a Gaussian kernel is adopted, and the corresponding weight function W (j, j) is as follows:
where k represents the variance of the Gaussian function, resulting in a new regression coefficientThe method comprises the following steps:
wherein w=w T W。
3. The method according to claim 2, wherein in step 4, the similarity between data objects of different groups is required to be low, and the similarity between data objects in the same group is required to be high, J is calculated by the following objective function:
wherein J is the flatness of all objects in the measured datasetSum of square errors, x i Representing any one of the objects in the dataset, u j Is the j-th aggregation class C j The goal is to have the objective function converge.
4. A method according to claim 3, further comprising: the method for acquiring the contact page corresponding to the webpage specifically comprises the following steps: and utilizing a contact page sample set of the sites to statistically analyze the link anchor text, the page title and the website style characteristics of the contact page sample set to construct a contact page classifier, and utilizing the contact page classifier to analyze the webpage to obtain contact information pages of all the sites.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111298638.4A CN114117292B (en) | 2021-11-04 | 2021-11-04 | Internet big data analysis and extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111298638.4A CN114117292B (en) | 2021-11-04 | 2021-11-04 | Internet big data analysis and extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114117292A CN114117292A (en) | 2022-03-01 |
CN114117292B true CN114117292B (en) | 2024-04-16 |
Family
ID=80381256
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111298638.4A Active CN114117292B (en) | 2021-11-04 | 2021-11-04 | Internet big data analysis and extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114117292B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003186901A (en) * | 2001-12-21 | 2003-07-04 | Nippon Telegr & Teleph Corp <Ntt> | Web SITE RETRIEVAL METHOD AND SYSTEM, EXECUTION PROGRAM FOR THE METHOD, AND RECORDING MEDIUM WITH ITS PROGRAM RECORDED THEREON |
CN103514234A (en) * | 2012-06-30 | 2014-01-15 | 北京百度网讯科技有限公司 | Method and device for extracting page information |
CN107657032A (en) * | 2017-09-28 | 2018-02-02 | 佛山市南方数据科学研究院 | A kind of internet big data analyzes extracting method |
WO2018068360A1 (en) * | 2016-10-11 | 2018-04-19 | 国云科技股份有限公司 | Method for obtaining regression relationships between dependent variables and independent variables during data analysis |
CN109241446A (en) * | 2018-10-17 | 2019-01-18 | 重庆聚焦人才服务有限公司 | A kind of position recommended method and system |
-
2021
- 2021-11-04 CN CN202111298638.4A patent/CN114117292B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003186901A (en) * | 2001-12-21 | 2003-07-04 | Nippon Telegr & Teleph Corp <Ntt> | Web SITE RETRIEVAL METHOD AND SYSTEM, EXECUTION PROGRAM FOR THE METHOD, AND RECORDING MEDIUM WITH ITS PROGRAM RECORDED THEREON |
CN103514234A (en) * | 2012-06-30 | 2014-01-15 | 北京百度网讯科技有限公司 | Method and device for extracting page information |
WO2018068360A1 (en) * | 2016-10-11 | 2018-04-19 | 国云科技股份有限公司 | Method for obtaining regression relationships between dependent variables and independent variables during data analysis |
CN107657032A (en) * | 2017-09-28 | 2018-02-02 | 佛山市南方数据科学研究院 | A kind of internet big data analyzes extracting method |
CN109241446A (en) * | 2018-10-17 | 2019-01-18 | 重庆聚焦人才服务有限公司 | A kind of position recommended method and system |
Non-Patent Citations (2)
Title |
---|
基于LightGBM算法的移动用户信用评分研究;国强强;朱振方;;计算机技术与发展;20200910(第09期);全文 * |
基于文本频率页面分割算法对论坛正文提取;马凯凯;钱亚赫;阮东跃;;中国水运(下半月);20180215(第02期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114117292A (en) | 2022-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Boididou et al. | Detection and visualization of misleading content on Twitter | |
CN109145215B (en) | Network public opinion analysis method, device and storage medium | |
Blazquez et al. | Web data mining for monitoring business export orientation | |
CN112528025A (en) | Text clustering method, device and equipment based on density and storage medium | |
Govindasamy et al. | Analysis of student academic performance using clustering techniques | |
CN108241867B (en) | Classification method and device | |
CN109190698B (en) | Classification and identification system and method for network digital virtual assets | |
CN113177700B (en) | Risk assessment method, system, electronic equipment and storage medium | |
CN112308173B (en) | Multi-target object evaluation method based on multi-evaluation factor fusion and related equipment thereof | |
CN112395513A (en) | Public opinion transmission power analysis method | |
CN107330705A (en) | A kind of method and system according to multi-data source antifraud | |
CN114117292B (en) | Internet big data analysis and extraction method | |
CN112487306A (en) | Automatic event marking and classifying method based on knowledge graph | |
Lei et al. | Automatically classify chinese judgment documents utilizing machine learning algorithms | |
Sharafat et al. | Legal data mining from civil judgments | |
CN115269816A (en) | Core personnel mining method and device based on information processing method and storage medium | |
CN112506930B (en) | Data insight system based on machine learning technology | |
CN112434126B (en) | Information processing method, device, equipment and storage medium | |
Wang | Construction of Alumni Information Analysis Model Based on Big Data | |
CN112818215A (en) | Product data processing method, device, equipment and storage medium | |
Jittawiriyanukoon | Evaluation of a multiple regression model for noisy and missing data | |
Saraswathi et al. | Effective Search Engine Spam Classification | |
CN111967541B (en) | Data classification method and device based on multi-platform samples | |
CN107093021A (en) | Electricity power engineering goods and materials contract is honoured an agreement sincere public sentiment monitoring system | |
Kaui et al. | Detection of phishing webpages using weights computed through genetic algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |