CN107657032A

CN107657032A - A kind of internet big data analyzes extracting method

Info

Publication number: CN107657032A
Application number: CN201710897053.1A
Authority: CN
Inventors: 冯焕霞; 张剑; 彭金祥; 李正淳
Original assignee: Foshan Academy Of South Data Sciences
Current assignee: Foshan Academy Of South Data Sciences
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2018-02-02

Abstract

The invention belongs to technical field of data processing, discloses a kind of internet big data analysis extracting method, determines fundamental and basic orientation, cause the accurate positioning in analysis and extraction.The contact details of physical mechanism are extracted in slave site homepage and contact page, by the verification to multi-source data, integration with associating, and combine vision partition and obtain the web page blocks for including address information, obtain the geographical location information of structuring, improve the accuracy rate of information, and can be automatically towards whole internet physical mechanism enter row information and recall, reduce human cost, improve information recall rate.The present invention sets data segment decomposition regular expression and data item name list corresponding with data segment decomposition regular expression by being needed according to text classification analysis, the decomposition and extraction for the different demands for treating analyze data are realized, improves the versatility of data analysis.

Description

A kind of internet big data analyzes extracting method

Technical field

The invention belongs to technical field of data processing, more particularly to a kind of internet big data analysis extracting method.

Background technology

Present society is the society of a high speed development, and science and technology is flourishing, information flow, and the exchange between people is increasingly Closely, life is also more and more convenient, and big data is exactly the product of this cyberage.However, existing information data carries Method is taken to be normally based on the structured message extracting method of template, by the rule of manual compiling template, in internet Page info is excavated, and obtains structured message.Not only need to consume huge artificial cost in this way, but also It is required that excavate object has identical design feature in the page, thus limited by human cost and page structure uniformity, Can not large-scale application.Meanwhile in existing data analysis system, for the different analysis demand of text classification, it is necessary to adopt With special data analysing method, i.e., for the different analysis demand of text classification, it is required for designing a set of data analysing method. Therefore, cause the efficiency of data analysis low, it is with high costs.

Text data rule model is developed, the process such as main sub-model exploitation, model measurement and Model Monitoring.Model is opened Hair, i.e., model is formed according to exploitation language material redaction rule, the corresponding classification of rule together.Model measurement, then it is to utilize sample Notebook data is tested model, so as to assess the performance of the model and the degree of accuracy.And, it is necessary to fixed during model running Phase test model, to monitor its situation in environment is actually generated.

In existing model development flow, it is numerous to be related to personnel：There is the business personnel of management sample, the model of development model is opened Hair personnel, the model measurement personnel of test model and model reach the standard grade after operation maintenance personnel.These personnel are dispersed in each department, Exchanged under line, link up all inconveniences.Moreover, rule model exploitation is a lasting iteration, the process of Continuous optimization, further increase The workload of exchange and conmmunication.

Need height to assist that is, rule model exploitation is one, continue the development process of iteration.And existing open Flow function is sent out to disperse, coordinate inconvenience.Therefore, there is an urgent need to a set of reasonable, science engineering method, for opening rule model Hair flow optimizes, so that the energy of developer need to be only concentrated on model sheet.

To sum up, the problem of prior art is present be：Existing information data is excavated, it is necessary to consume huge artificial cost, And excavate object and limited by human cost and page structure uniformity, can not large-scale application；Meanwhile data analysis system The efficiency of data analysis is low in system, with high costs；Existing text data rule model development process function is scattered, coordinates not Just.

The content of the invention

The problem of existing for prior art, the invention provides a kind of internet big data to analyze extracting method.

The present invention is achieved in that

A kind of internet big data analyzes extracting method, and the internet big data analysis extracting method includes following step Suddenly：

Step 1, it is determined that the fundamental of extraction；Specifically include：

1) development sample is sampled, and sample is labeled by artificial interpretation, generation exploitation language material：

Data from the sample survey source is determined according to specific business demand；Sampling prescription, data from the sample survey amount are determined, sampling algorithm makes data Uniform fold；Sampling algorithm is realized in exploitation, by data pick-up to local file or volatile data base for follow-up mark and modeling Work uses；Which text classification the context of artificial mark sample belongs to；

2) line discipline exploitation, create-rule model are entered for exploitation language material：Rule model developer is according to exploitation language material Text data rule is extracted, the text data rule includes：Entity, concept, clause；Develop implementation rule model source code or Script；Rule model source code is compiled, generates binary pattern file, performing platform for following model calls；

3), model is debugged, if classification need to be adjusted, classification tree is readjusted, returns to 1)：In exploitation test environment Compiling, executing rule model；Check rule model implementing result, principle of optimality model source code or script；According to specific business Demand, whether analysis disaggregated model is reasonable, if unreasonable, changes disaggregated model；If have modified disaggregated model, repeatedly 1), Sampled again according to new classification, and carry out sample mark；

4), repeat 1) to 3), until exporting optimal rule model；

5), test sample is sampled, and sample is labeled by artificial interpretation, generates testing material；

6), model measurement personnel carry out model measurement for testing material；

7), model is optimized, if classification need to be adjusted, classification tree is readjusted, returns to 5)；Wherein, model is entered Row optimization iterative processing model be：

Wherein, X is the target classification data, and M is sytem matrix, and G is the input grouped data, and i represents iteration time Number, Xi represent the iteration result obtained after ith iteration；λ represents convergence coefficient, and λ ∈ (0,1), M^TExpression turns to matrix M Put；

8), repeat 5) to 7), until exporting optimal rule model；

9), model is reached the standard grade, and periodically model is tested using testing material, and test result is analyzed, if right In real time data, model is no longer accurate, then repeatedly 1) to 8) being optimized to model；

Step 2, obtain the data message of correlation；Specifically include：

(1) this category set of selection is combined into empty set for initialization, and candidate's text classification collection is combined into text classification complete or collected works, with tune The progress of journey is spent, this category set of selection is closed and the element of candidate's text classification set is constantly updated,WithRespectively exist Candidate and this category set of selection conjunction, n=1 ..., N at the end of nth iteration_TFor the number of iteration, n=1 is initialized；

(2) the channel information matrix H that Internet base stations feed back to text classification k_kCarry out singular value decompositionWherein, λ_k,1Represent the channel matrix of k-th of text classification Singular value,Represent dimension 1 × (N_T- 1) null vector,WithRespectively by with non-zero singular value λ_k,1With zero singular value Corresponding right singular value vector is formed, because rank (H_k)=1, sov_i,1For V_iFirst column vector, its In, rank () represents to seek rank of matrix；

(3) Internet base stations are according to the matrix construction intermediary matrix after decompositionWithWith AndWherein, diag () represents diagonalization operation；

(4) Internet base stations construction correlation matrix R, R are L × L square formations, and the element of the i-th row jth row isWherein, | | represent modulus computing；

(5) n-1 row corresponding with n-1 scheduled text classification are selected from R, form matrixWill be surplus Remaining part carries out ascending order arrangement to row element respectively, obtains matrixI.e.

(6) correlation factor of text classification is calculatedI.e. to R_nIn sum and take down respectively per the preceding ξ element of a line Number, it is equivalent to simplification and obtains column vector ψ_n=[ψ_1,n … ψ_L,n]^Η；

(7) n-th of text classification is selected according to formula；

(8) if n ＜ N_T, return (5)；Otherwise scheduling is completed, according to scheduled text classificationThe reality being subject to is done Disturb and calculate correlation factor ψ_k；If text classification is not scheduledψ_k=0, and update text classification k, k ∈ according to formula { 1 ..., L } average correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base Notice of standing activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats Perform (1)~(7)；

Candidate's text classification is considered in (6) by the text that has selected text classification and may potentially be scheduled The influence of classification calculates the correlation factor of text classification according to formula (1)I.e. to R_nIn asked respectively per the preceding ξ element of a line With and it is inverted, be equivalent to simplified style (2)；

Obtain column vector ψ_n=[ψ_1,n … ψ_L,n]^Η, wherein,Represent potential, subsequently possible selected text point Class set, card () represent the number of the element in set；

According to formula in (7)Select n-th of text classification；

s_nThe label of selected text classification is represented,It is text classification k dispatch weight,It is a upper biography Text classification k average correlation factor during defeated end cycle, renewal N=n+1；

Text classification is scheduled, over time, average correlation factorIncrease, Weights μ_kReduce therewith, dispatching priority reduces；

The formula of renewal text classification k average correlation factor is in (8)

Wherein, δ_c=0.99；

Step 3, obtain the Webpage of internet；

Step 4, obtain website homepage corresponding to Webpage or contact page；

Interest point name and corresponding address information are extracted in step 5, slave site homepage or contact page；

Step 6, the interest point name extracted is associated with address information, obtains structured message；

Step 7, structured message data will be obtained, data segment decomposes regular expression and decompose canonical table with data segment Up to data item name list corresponding to formula；

Step 8, data decomposition is carried out according to the data segment that data segment decomposition regular expression is treated in analyze data, it is raw It is associated into data item occurrence, and by data item occurrence and data item name list, it is corresponding with data item occurrence forms data item title Intermediate data pair；According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained；

The method for obtaining website homepage includes：

Domain name addresses is taken out from the network address of the Webpage, domain name address is carried out to redirect processing, is somebody's turn to do Website homepage corresponding to Webpage；

Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses after carrying out duplicate removal processing In set, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage；Or Person, using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, sharp The Webpage is analyzed with the homepage grader, obtains all website homepages；

The method of contact page includes corresponding to the acquisition Webpage：

Utilize the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics Tectonic link web page classifier, the Webpage is analyzed using the contact web page classifier, obtains the connection of all websites It is information page；

It is described to the intermediate data to carry out statistical analysis, obtain data results, including：

The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistics Analysis result field list includes at least one result field；According to statistics formula, to the corresponding of the intermediate data centering Data are counted；

Further, development sample is sampled in 1), for carrying out data sampling from data source, is marked and is used for sample；

Sample marks, for being classified by artificial interpretation to sample, generation exploitation language material and testing material；

Language material is developed, for model development personnel's redaction rule, output model；

Testing material, for being used for model measurement, the mark of testing material is compared with model result, with assessment models The degree of accuracy；

Model debugging, after output model, input sample data are tested.

Further, the value of weight determines by the average correlation factor of text classification in (7)

Further, it is described regular expression is decomposed according to data segment to treat the data segment in analyze data and enter in step 8 Row data are decomposed, and are decomposed using decomposing module；Specifically include：

Data value is solved using conjugate gradient method；

For equation Lx=b, the key of conjugate gradient method is to construct conjugate vector p, and seeks corresponding residual error；Conjugation ladder Degree method is solved with alternative manner, and in each iterative process, new conjugate vector is solved by following formula：

Further, the coefficient of conjugate direction is solved by following formula：

New x values are solved with residual error with following formula：

x_k=x_k-1+s_kp_k

r_k=r_k-1+s_kLp_k；

Element q corresponding to Lp vectors midpoint i is solved with following formula_i：

W_kIt is neighborhood corresponding to data k, ‖ w_k‖ is the size of neighborhood, and i is to surround data k neighborhoods W_kIn a data, q_i For i-th of element of q vectors, I_iFor 3-dimensional vector corresponding to data i, R, tri- passages of G, B, p are represented_iFor data in conjugate vector Element corresponding to i, μ_kIt is 3-dimensional vector, is neighborhood W_kMiddle I_iThe average of vector,For neighborhood W_kConjugate vector corresponding to middle element i p_iAverage,It is data k corresponding 3-dimensional vector,For scalar corresponding to data k.

Advantages of the present invention and good effect are：The present invention utilizes the institutional framework feature and letter of physical mechanism in internet Semantic feature is ceased, determines the fundamental and basic orientation of internet big data analysis extracting method, causes analyzing and extracting When accurate positioning；Slave site homepage and contact page in extract physical mechanism contact details, by the verification to multi-source data, Integrate and associate, and combine vision partition and obtain the web page blocks for including address information, obtain the geographical position letter of structuring Breath, improves the accuracy rate of information, and physical mechanism that can be automatically towards whole internet is entered row information and recalled, and reduces manpower Cost, improve information recall rate；The present invention sets data segment decomposition canonical table by being needed according to text classification analysis simultaneously Up to formula and the different demands of analyze data are treated in data item name list corresponding with data segment decomposition regular expression, realization Decompose and extract, improve the versatility of data analysis.

The present invention is organized rule model development process by order and circulation process so that inside functional module not Disconnected iteration, is optimal；Sequential combination between functional module, avoids mistake；The input and output of each functional module all obtain effectively Processing.Compared to original method, the present invention has the efficiency of more excellent collaborative work ability and Geng Gao；

Multiple person cooperational provided by the invention carries out text rule model development：For a complicated text data digging item Mesh, the class node of disaggregated model hundreds of at least is at most thousands of, and so complicated model development efforts are needed by one The more people of team complete jointly, in the method, the division of labor of model development efforts can be carried out according to classification, different people is responsible for not Same classification, so as to realize that concurrent cooperation is developed, improve the development efficiency of project；

More wheel iteration continue to optimize model：One complicated model needs continuous iteration optimization to meet accuracy rate and cover The requirement of lid rate, this process are the processes of alternately Data Mining and model optimization, this method provide more wheel iteration and open The method for sending out Optimized model, by the exploitation and test taken turns more, the accuracy rate and coverage rate of continuous lift scheme, reaches and meets industry Business demand is optimal；

Test link and ensure model whole structure：Here test link is different from the test of iterative process, iterative process Test still fall within the checking and debugging of model development process, and this individually test link is the complete formal hair of model development The Acceptance Test that cloth is run to production environment, this method are done by this link before model issue to the effect of model Once final inspection, as long as reaching the requirement of expected accuracy rate and coverage rate, it could issue and reach the standard grade；

Upper and lower wire loop section ensures that model version is correct：, can be constantly to model in the life-cycle processes of whole model Be adjusted and optimize, therefore many different model versions occur, this method model the festival-gathering of wire loop up and down to model Version carries out inspection confirmation, it is ensured that the use of model version will not malfunction；

Regular monitoring after reaching the standard grade ensures the accuracy rate and coverage rate of model：With the longtime running of model, can constantly have New data enter model treatment, necessarily occur the situation that existing model rule can not be completely covered, the accuracy rate of model and Downward trend occurs in coverage rate, this method provide the mechanism of regular monitoring model, when modelling effect drops to certain threshold value, The model optimization process of a new round can be started, so as to ensure the effect of model treatment.

The method that the present invention obtains related data message can obtain accurate information in real time, rate of accuracy reached to 96.73%, than Prior art provides nearly 4 percentage points.

Decomposition method provided by the invention, good effect can be obtained in the region of complexity.

Brief description of the drawings

Fig. 1 is the internet big data analysis extracting method flow chart that the present invention implements to provide.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

The application principle of the present invention is further described below in conjunction with the accompanying drawings.

Comprise the following steps as shown in figure 1, the present invention provides a kind of internet big data analysis extracting method：

S101, determine the fundamental of internet big data analysis extracting method；

S102, the basic orientation of internet big data analysis extracting method is determined, is advantageous to obtain related data message；

S103, obtain the Webpage of internet；

S104, obtain website homepage corresponding to Webpage or contact page；

Interest point name and corresponding address information are extracted in S105, slave site homepage or contact page；

S106, the interest point name extracted is associated with address information, obtains structured message；

S107, structured message data will be obtained, data segment decomposes regular expression and decompose regular expressions with data segment Data item name list corresponding to formula；

S108, data decomposition, generation are carried out according to the data segment that data segment decomposition regular expression is treated in analyze data Data item occurrence, and data item occurrence and data item name list are associated, it is corresponding with data item occurrence to form data item title Intermediate data pair；According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained.

The method of acquisition website homepage provided by the invention is as follows：

Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses after carrying out duplicate removal processing In set, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage；Or Person, using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, sharp The Webpage is analyzed with the homepage grader, obtains all website homepages.

The method of contact page is as follows corresponding to acquisition Webpage provided by the invention：

Utilize the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics Tectonic link web page classifier, the Webpage is analyzed using the contact web page classifier, obtains the connection of all websites It is information page.

It is provided by the invention to the intermediate data to carry out statistical analysis, obtain data results, including：

The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistics Analysis result field list includes at least one result field；According to statistics formula, to the corresponding of the intermediate data centering Data are counted.

With reference to concrete analysis, the invention will be further described.

1st, the fundamental of extraction is determined；Specifically include：

4), repeat 1) to 3), until exporting optimal rule model；

8), repeat 5) to 7), until exporting optimal rule model；

2nd, related data message is obtained；Specifically include：

(7) n-th of text classification is selected according to formula；

(8) if n ＜ N_T, return (5)；Otherwise scheduling is completed, according to scheduled text classificationThe reality being subject to Interference calculation correlation factor ψ_k；If text classification is not scheduledψ_k=0, and update text classification k, k ∈ according to formula { 1 ..., L } average correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base Notice of standing activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats Perform (1)~(7)；

According to formula in (7)Select n-th of text classification；

Wherein, δ_c=0.99；

1) development sample is sampled in, for carrying out data sampling from data source, is marked and is used for sample；

Model debugging, after output model, input sample data are tested.

The value of weight determines by the average correlation factor of text classification in (7)

In step 8, the data segment treated according to data segment decomposition regular expression in analyze data carries out data and divided Solution, is decomposed using decomposing module；Specifically include：

Data value is solved using conjugate gradient method；

New x values are solved with residual error with following formula：

x_k=x_k-1+s_kp_k

r_k=r_k-1+s_kLp_k；

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims

1. a kind of internet big data analyzes extracting method, it is characterised in that the internet big data analyzes extracting method bag Include following steps：

Data from the sample survey source is determined according to specific business demand；Sampling prescription, data from the sample survey amount are determined, sampling algorithm makes data uniform Covering；Sampling algorithm is realized in exploitation, by data pick-up to local file or volatile data base for follow-up mark and modeling work Use；Which text classification the context of artificial mark sample belongs to；

2) line discipline exploitation, create-rule model are entered for exploitation language material：Rule model developer is according to exploitation language material extraction Text data rule, the text data rule include：Entity, concept, clause；Develop implementation rule model source code or pin This；Rule model source code is compiled, generates binary pattern file, performing platform for following model calls；

3), model is debugged, if classification need to be adjusted, classification tree is readjusted, returns to 1)：Compiled in exploitation test environment Translate, executing rule model；Check rule model implementing result, principle of optimality model source code or script；According to specific business need Ask, whether analysis disaggregated model is reasonable, if unreasonable, changes disaggregated model；If have modified disaggregated model, repeatedly 1), press Sampled again according to new classification, and carry out sample mark；

4), repeat 1) to 3), until exporting optimal rule model；

7), model is optimized, if classification need to be adjusted, classification tree is readjusted, returns to 5)；Wherein, model is carried out excellent The iterative processing model of change is：

Wherein, X is the target classification data, and M is sytem matrix, and G is the input grouped data, and i represents iterations, Xi Represent the iteration result obtained after ith iteration；λ represents convergence coefficient, and λ ∈ (0,1), M^TRepresent the transposition to matrix M；

8), repeat 5) to 7), until exporting optimal rule model；

9), model is reached the standard grade, and periodically model is tested using testing material, and test result is analyzed, if for reality When data, model is no longer accurate, then repeatedly 1) to 8) being optimized to model；

Step 2, obtain the data message of correlation；Specifically include：

(1) this category set of selection is combined into empty set for initialization, and candidate's text classification collection is combined into text classification complete or collected works, with dispatching The progress of journey, this category set of selection is closed and the element of candidate's text classification set is constantly updated,WithRespectively in n-th Candidate and this category set of selection conjunction, n=1 ..., N at the end of iteration_TFor the number of iteration, n=1 is initialized；

(2) the channel information matrix H that Internet base stations feed back to text classification k_kCarry out singular value decompositionWherein, λ_k,1Represent the strange of the channel matrix of k-th of text classification Different value,Represent dimension 1 × (N_T- 1) null vector,WithRespectively by with non-zero singular value λ_k,1With zero singular value pair The right singular value vector answered is formed, because rank (H_k)=1, sov_i,1For V_iFirst column vector, wherein, Rank () represents to seek rank of matrix；

(3) Internet base stations are according to the matrix construction intermediary matrix after decompositionWithAndWherein, diag () represents diagonalization operation；

(4) Internet base stations construction correlation matrix R, R are L × L square formations, and the element of the i-th row jth row is Wherein, | | represent modulus computing；

(5) n-1 row corresponding with n-1 scheduled text classification are selected from R, form matrixWill be remaining Part carries out ascending order arrangement to row element respectively, obtains matrixI.e.

(6) correlation factor of text classification is calculatedI.e. to R_nIn preceding ξ element per a line sum respectively and inverted, etc. Imitate and obtain column vector ψ in simplification_n=[ψ_1,n … ψ_L,n]^Η；

(7) n-th of text classification is selected according to formula；

(8) if n ＜ N_T, return (5)；Otherwise scheduling is completed, according to scheduled text classificationThe actual interference meter being subject to Calculate correlation factor ψ_k；If text classification is not scheduledψ_k=0, and according to formula update text classification k, k ∈ 1 ..., L } be averaged correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base station notice Activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats (1)~(7)；

Candidate's text classification is considered in (6) by the text classification selected text classification and may be potentially scheduled Influence according to formula (1) calculate text classification correlation factorI.e. to R_nIn summed respectively simultaneously per the preceding ξ element of a line It is inverted, it is equivalent to simplified style (2)；

Obtain column vector ψ_n=[ψ_1,n … ψ_L,n]^Η, wherein,Represent potential, subsequently possible selected text classification collection Close, card () represents the number of the element in set；

According to formula in (7)Select n-th of text classification；

s_nThe label of selected text classification is represented,It is text classification k dispatch weight,It was upper transmission week Text classification k average correlation factor at the end of phase, renewal N=n+1；

Wherein, δ_c=0.99；

Step 3, obtain the Webpage of internet；

Step 4, obtain website homepage corresponding to Webpage or contact page；

Step 7, structured message data will be obtained, data segment decomposes regular expression and decompose regular expression with data segment Corresponding data item name list；

Step 8, the data segment in analyze data is treated according to data segment decomposition regular expression and carries out data decomposition, generates number It is associated according to entry value, and by data item occurrence and data item name list, during formation data item title is corresponding with data item occurrence Between data pair；According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained；

The method for obtaining website homepage includes：

Domain name addresses is taken out from the network address of the Webpage, domain name address is carried out to redirect processing, obtains the webpage Website homepage corresponding to the page；

Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses set after carrying out duplicate removal processing In, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage；Or Using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, utilizes The homepage grader is analyzed the Webpage, obtains all website homepages；

The method of contact page includes corresponding to the acquisition Webpage：

Using the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics construct Web page classifier is contacted, the Webpage is analyzed using the contact web page classifier, obtains the contact letter of all websites Cease page；

The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistical analysis Result field table includes at least one result field；According to statistics formula, to the corresponding data of the intermediate data centering Counted.

2. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that development sample is sampled in 1), For carrying out data sampling from data source, mark and use for sample；

Testing material, for being used for model measurement, the mark of testing material is compared with model result, with the accurate of assessment models Degree；

Model debugging, after output model, input sample data are tested.

3. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that in (7) value of weight by The average correlation factor of text classification determines.

4. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that in step 8, the basis Data segment decomposes the data segment that regular expression is treated in analyze data and carries out data decomposition, is decomposed using decomposing module； Specifically include：

Data value is solved using conjugate gradient method；

For equation Lx=b, the key of conjugate gradient method is to construct conjugate vector p, and seeks corresponding residual error；Conjugate gradient method Solved with alternative manner, in each iterative process, new conjugate vector is solved by following formula：

5. internet big data as claimed in claim 4 analyzes extracting method, it is characterised in that the coefficient of conjugate direction is under Formula solves：

New x values are solved with residual error with following formula：

x_k=x_k-1+s_kp_k

r_k=r_k-1+s_kLp_k；

W_kIt is neighborhood corresponding to data k, ‖ w_k‖ is the size of neighborhood, and i is to surround data k neighborhoods W_kIn a data, q_iFor q to I-th of element of amount, I_iFor 3-dimensional vector corresponding to data i, R, tri- passages of G, B, p are represented_iIt is corresponding for data i in conjugate vector Element, μ_kIt is 3-dimensional vector, is neighborhood W_kMiddle I_iThe average of vector,For neighborhood W_kConjugate vector p corresponding to middle element i_i's Average,It is data k corresponding 3-dimensional vector,For scalar corresponding to data k.