CN107657032A - A kind of internet big data analyzes extracting method - Google Patents

A kind of internet big data analyzes extracting method Download PDF

Info

Publication number
CN107657032A
CN107657032A CN201710897053.1A CN201710897053A CN107657032A CN 107657032 A CN107657032 A CN 107657032A CN 201710897053 A CN201710897053 A CN 201710897053A CN 107657032 A CN107657032 A CN 107657032A
Authority
CN
China
Prior art keywords
data
model
text classification
sample
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710897053.1A
Other languages
Chinese (zh)
Inventor
冯焕霞
张剑
彭金祥
李正淳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Academy Of South Data Sciences
Original Assignee
Foshan Academy Of South Data Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Academy Of South Data Sciences filed Critical Foshan Academy Of South Data Sciences
Priority to CN201710897053.1A priority Critical patent/CN107657032A/en
Publication of CN107657032A publication Critical patent/CN107657032A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to technical field of data processing, discloses a kind of internet big data analysis extracting method, determines fundamental and basic orientation, cause the accurate positioning in analysis and extraction.The contact details of physical mechanism are extracted in slave site homepage and contact page, by the verification to multi-source data, integration with associating, and combine vision partition and obtain the web page blocks for including address information, obtain the geographical location information of structuring, improve the accuracy rate of information, and can be automatically towards whole internet physical mechanism enter row information and recall, reduce human cost, improve information recall rate.The present invention sets data segment decomposition regular expression and data item name list corresponding with data segment decomposition regular expression by being needed according to text classification analysis, the decomposition and extraction for the different demands for treating analyze data are realized, improves the versatility of data analysis.

Description

A kind of internet big data analyzes extracting method
Technical field
The invention belongs to technical field of data processing, more particularly to a kind of internet big data analysis extracting method.
Background technology
Present society is the society of a high speed development, and science and technology is flourishing, information flow, and the exchange between people is increasingly Closely, life is also more and more convenient, and big data is exactly the product of this cyberage.However, existing information data carries Method is taken to be normally based on the structured message extracting method of template, by the rule of manual compiling template, in internet Page info is excavated, and obtains structured message.Not only need to consume huge artificial cost in this way, but also It is required that excavate object has identical design feature in the page, thus limited by human cost and page structure uniformity, Can not large-scale application.Meanwhile in existing data analysis system, for the different analysis demand of text classification, it is necessary to adopt With special data analysing method, i.e., for the different analysis demand of text classification, it is required for designing a set of data analysing method. Therefore, cause the efficiency of data analysis low, it is with high costs.
Text data rule model is developed, the process such as main sub-model exploitation, model measurement and Model Monitoring.Model is opened Hair, i.e., model is formed according to exploitation language material redaction rule, the corresponding classification of rule together.Model measurement, then it is to utilize sample Notebook data is tested model, so as to assess the performance of the model and the degree of accuracy.And, it is necessary to fixed during model running Phase test model, to monitor its situation in environment is actually generated.
In existing model development flow, it is numerous to be related to personnel:There is the business personnel of management sample, the model of development model is opened Hair personnel, the model measurement personnel of test model and model reach the standard grade after operation maintenance personnel.These personnel are dispersed in each department, Exchanged under line, link up all inconveniences.Moreover, rule model exploitation is a lasting iteration, the process of Continuous optimization, further increase The workload of exchange and conmmunication.
Need height to assist that is, rule model exploitation is one, continue the development process of iteration.And existing open Flow function is sent out to disperse, coordinate inconvenience.Therefore, there is an urgent need to a set of reasonable, science engineering method, for opening rule model Hair flow optimizes, so that the energy of developer need to be only concentrated on model sheet.
To sum up, the problem of prior art is present be:Existing information data is excavated, it is necessary to consume huge artificial cost, And excavate object and limited by human cost and page structure uniformity, can not large-scale application;Meanwhile data analysis system The efficiency of data analysis is low in system, with high costs;Existing text data rule model development process function is scattered, coordinates not Just.
The content of the invention
The problem of existing for prior art, the invention provides a kind of internet big data to analyze extracting method.
The present invention is achieved in that
A kind of internet big data analyzes extracting method, and the internet big data analysis extracting method includes following step Suddenly:
Step 1, it is determined that the fundamental of extraction;Specifically include:
1) development sample is sampled, and sample is labeled by artificial interpretation, generation exploitation language material:
Data from the sample survey source is determined according to specific business demand;Sampling prescription, data from the sample survey amount are determined, sampling algorithm makes data Uniform fold;Sampling algorithm is realized in exploitation, by data pick-up to local file or volatile data base for follow-up mark and modeling Work uses;Which text classification the context of artificial mark sample belongs to;
2) line discipline exploitation, create-rule model are entered for exploitation language material:Rule model developer is according to exploitation language material Text data rule is extracted, the text data rule includes:Entity, concept, clause;Develop implementation rule model source code or Script;Rule model source code is compiled, generates binary pattern file, performing platform for following model calls;
3), model is debugged, if classification need to be adjusted, classification tree is readjusted, returns to 1):In exploitation test environment Compiling, executing rule model;Check rule model implementing result, principle of optimality model source code or script;According to specific business Demand, whether analysis disaggregated model is reasonable, if unreasonable, changes disaggregated model;If have modified disaggregated model, repeatedly 1), Sampled again according to new classification, and carry out sample mark;
4), repeat 1) to 3), until exporting optimal rule model;
5), test sample is sampled, and sample is labeled by artificial interpretation, generates testing material;
6), model measurement personnel carry out model measurement for testing material;
7), model is optimized, if classification need to be adjusted, classification tree is readjusted, returns to 5);Wherein, model is entered Row optimization iterative processing model be:
Wherein, X is the target classification data, and M is sytem matrix, and G is the input grouped data, and i represents iteration time Number, Xi represent the iteration result obtained after ith iteration;λ represents convergence coefficient, and λ ∈ (0,1), MTExpression turns to matrix M Put;
8), repeat 5) to 7), until exporting optimal rule model;
9), model is reached the standard grade, and periodically model is tested using testing material, and test result is analyzed, if right In real time data, model is no longer accurate, then repeatedly 1) to 8) being optimized to model;
Step 2, obtain the data message of correlation;Specifically include:
(1) this category set of selection is combined into empty set for initialization, and candidate's text classification collection is combined into text classification complete or collected works, with tune The progress of journey is spent, this category set of selection is closed and the element of candidate's text classification set is constantly updated,WithRespectively exist Candidate and this category set of selection conjunction, n=1 ..., N at the end of nth iterationTFor the number of iteration, n=1 is initialized;
(2) the channel information matrix H that Internet base stations feed back to text classification kkCarry out singular value decompositionWherein, λk,1Represent the channel matrix of k-th of text classification Singular value,Represent dimension 1 × (NT- 1) null vector,WithRespectively by with non-zero singular value λk,1With zero singular value Corresponding right singular value vector is formed, because rank (Hk)=1, sovi,1For ViFirst column vector, its In, rank () represents to seek rank of matrix;
(3) Internet base stations are according to the matrix construction intermediary matrix after decompositionWithWith AndWherein, diag () represents diagonalization operation;
(4) Internet base stations construction correlation matrix R, R are L × L square formations, and the element of the i-th row jth row isWherein, | | represent modulus computing;
(5) n-1 row corresponding with n-1 scheduled text classification are selected from R, form matrixWill be surplus Remaining part carries out ascending order arrangement to row element respectively, obtains matrixI.e.
(6) correlation factor of text classification is calculatedI.e. to RnIn sum and take down respectively per the preceding ξ element of a line Number, it is equivalent to simplification and obtains column vector ψn=[ψ1,n … ψL,n]Η
(7) n-th of text classification is selected according to formula;
(8) if n < NT, return (5);Otherwise scheduling is completed, according to scheduled text classificationThe reality being subject to is done Disturb and calculate correlation factor ψk;If text classification is not scheduledψk=0, and update text classification k, k ∈ according to formula { 1 ..., L } average correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base Notice of standing activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats Perform (1)~(7);
Candidate's text classification is considered in (6) by the text that has selected text classification and may potentially be scheduled The influence of classification calculates the correlation factor of text classification according to formula (1)I.e. to RnIn asked respectively per the preceding ξ element of a line With and it is inverted, be equivalent to simplified style (2);
Obtain column vector ψn=[ψ1,n … ψL,n]Η, wherein,Represent potential, subsequently possible selected text point Class set, card () represent the number of the element in set;
According to formula in (7)Select n-th of text classification;
snThe label of selected text classification is represented,It is text classification k dispatch weight,It is a upper biography Text classification k average correlation factor during defeated end cycle, renewal N=n+1;
Text classification is scheduled, over time, average correlation factorIncrease, Weights μkReduce therewith, dispatching priority reduces;
The formula of renewal text classification k average correlation factor is in (8)
Wherein, δc=0.99;
Step 3, obtain the Webpage of internet;
Step 4, obtain website homepage corresponding to Webpage or contact page;
Interest point name and corresponding address information are extracted in step 5, slave site homepage or contact page;
Step 6, the interest point name extracted is associated with address information, obtains structured message;
Step 7, structured message data will be obtained, data segment decomposes regular expression and decompose canonical table with data segment Up to data item name list corresponding to formula;
Step 8, data decomposition is carried out according to the data segment that data segment decomposition regular expression is treated in analyze data, it is raw It is associated into data item occurrence, and by data item occurrence and data item name list, it is corresponding with data item occurrence forms data item title Intermediate data pair;According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained;
The method for obtaining website homepage includes:
Domain name addresses is taken out from the network address of the Webpage, domain name address is carried out to redirect processing, is somebody's turn to do Website homepage corresponding to Webpage;
Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses after carrying out duplicate removal processing In set, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage;Or Person, using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, sharp The Webpage is analyzed with the homepage grader, obtains all website homepages;
The method of contact page includes corresponding to the acquisition Webpage:
Utilize the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics Tectonic link web page classifier, the Webpage is analyzed using the contact web page classifier, obtains the connection of all websites It is information page;
It is described to the intermediate data to carry out statistical analysis, obtain data results, including:
The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistics Analysis result field list includes at least one result field;According to statistics formula, to the corresponding of the intermediate data centering Data are counted;
Further, development sample is sampled in 1), for carrying out data sampling from data source, is marked and is used for sample;
Sample marks, for being classified by artificial interpretation to sample, generation exploitation language material and testing material;
Language material is developed, for model development personnel's redaction rule, output model;
Testing material, for being used for model measurement, the mark of testing material is compared with model result, with assessment models The degree of accuracy;
Model debugging, after output model, input sample data are tested.
Further, the value of weight determines by the average correlation factor of text classification in (7)
Further, it is described regular expression is decomposed according to data segment to treat the data segment in analyze data and enter in step 8 Row data are decomposed, and are decomposed using decomposing module;Specifically include:
Data value is solved using conjugate gradient method;
For equation Lx=b, the key of conjugate gradient method is to construct conjugate vector p, and seeks corresponding residual error;Conjugation ladder Degree method is solved with alternative manner, and in each iterative process, new conjugate vector is solved by following formula:
Further, the coefficient of conjugate direction is solved by following formula:
New x values are solved with residual error with following formula:
xk=xk-1+skpk
rk=rk-1+skLpk
Element q corresponding to Lp vectors midpoint i is solved with following formulai
WkIt is neighborhood corresponding to data k, ‖ wk‖ is the size of neighborhood, and i is to surround data k neighborhoods WkIn a data, qi For i-th of element of q vectors, IiFor 3-dimensional vector corresponding to data i, R, tri- passages of G, B, p are representediFor data in conjugate vector Element corresponding to i, μkIt is 3-dimensional vector, is neighborhood WkMiddle IiThe average of vector,For neighborhood WkConjugate vector corresponding to middle element i piAverage,It is data k corresponding 3-dimensional vector,For scalar corresponding to data k.
Advantages of the present invention and good effect are:The present invention utilizes the institutional framework feature and letter of physical mechanism in internet Semantic feature is ceased, determines the fundamental and basic orientation of internet big data analysis extracting method, causes analyzing and extracting When accurate positioning;Slave site homepage and contact page in extract physical mechanism contact details, by the verification to multi-source data, Integrate and associate, and combine vision partition and obtain the web page blocks for including address information, obtain the geographical position letter of structuring Breath, improves the accuracy rate of information, and physical mechanism that can be automatically towards whole internet is entered row information and recalled, and reduces manpower Cost, improve information recall rate;The present invention sets data segment decomposition canonical table by being needed according to text classification analysis simultaneously Up to formula and the different demands of analyze data are treated in data item name list corresponding with data segment decomposition regular expression, realization Decompose and extract, improve the versatility of data analysis.
The present invention is organized rule model development process by order and circulation process so that inside functional module not Disconnected iteration, is optimal;Sequential combination between functional module, avoids mistake;The input and output of each functional module all obtain effectively Processing.Compared to original method, the present invention has the efficiency of more excellent collaborative work ability and Geng Gao;
Multiple person cooperational provided by the invention carries out text rule model development:For a complicated text data digging item Mesh, the class node of disaggregated model hundreds of at least is at most thousands of, and so complicated model development efforts are needed by one The more people of team complete jointly, in the method, the division of labor of model development efforts can be carried out according to classification, different people is responsible for not Same classification, so as to realize that concurrent cooperation is developed, improve the development efficiency of project;
More wheel iteration continue to optimize model:One complicated model needs continuous iteration optimization to meet accuracy rate and cover The requirement of lid rate, this process are the processes of alternately Data Mining and model optimization, this method provide more wheel iteration and open The method for sending out Optimized model, by the exploitation and test taken turns more, the accuracy rate and coverage rate of continuous lift scheme, reaches and meets industry Business demand is optimal;
Test link and ensure model whole structure:Here test link is different from the test of iterative process, iterative process Test still fall within the checking and debugging of model development process, and this individually test link is the complete formal hair of model development The Acceptance Test that cloth is run to production environment, this method are done by this link before model issue to the effect of model Once final inspection, as long as reaching the requirement of expected accuracy rate and coverage rate, it could issue and reach the standard grade;
Upper and lower wire loop section ensures that model version is correct:, can be constantly to model in the life-cycle processes of whole model Be adjusted and optimize, therefore many different model versions occur, this method model the festival-gathering of wire loop up and down to model Version carries out inspection confirmation, it is ensured that the use of model version will not malfunction;
Regular monitoring after reaching the standard grade ensures the accuracy rate and coverage rate of model:With the longtime running of model, can constantly have New data enter model treatment, necessarily occur the situation that existing model rule can not be completely covered, the accuracy rate of model and Downward trend occurs in coverage rate, this method provide the mechanism of regular monitoring model, when modelling effect drops to certain threshold value, The model optimization process of a new round can be started, so as to ensure the effect of model treatment.
The method that the present invention obtains related data message can obtain accurate information in real time, rate of accuracy reached to 96.73%, than Prior art provides nearly 4 percentage points.
Decomposition method provided by the invention, good effect can be obtained in the region of complexity.
Brief description of the drawings
Fig. 1 is the internet big data analysis extracting method flow chart that the present invention implements to provide.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
The application principle of the present invention is further described below in conjunction with the accompanying drawings.
Comprise the following steps as shown in figure 1, the present invention provides a kind of internet big data analysis extracting method:
S101, determine the fundamental of internet big data analysis extracting method;
S102, the basic orientation of internet big data analysis extracting method is determined, is advantageous to obtain related data message;
S103, obtain the Webpage of internet;
S104, obtain website homepage corresponding to Webpage or contact page;
Interest point name and corresponding address information are extracted in S105, slave site homepage or contact page;
S106, the interest point name extracted is associated with address information, obtains structured message;
S107, structured message data will be obtained, data segment decomposes regular expression and decompose regular expressions with data segment Data item name list corresponding to formula;
S108, data decomposition, generation are carried out according to the data segment that data segment decomposition regular expression is treated in analyze data Data item occurrence, and data item occurrence and data item name list are associated, it is corresponding with data item occurrence to form data item title Intermediate data pair;According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained.
The method of acquisition website homepage provided by the invention is as follows:
Domain name addresses is taken out from the network address of the Webpage, domain name address is carried out to redirect processing, is somebody's turn to do Website homepage corresponding to Webpage;
Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses after carrying out duplicate removal processing In set, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage;Or Person, using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, sharp The Webpage is analyzed with the homepage grader, obtains all website homepages.
The method of contact page is as follows corresponding to acquisition Webpage provided by the invention:
Utilize the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics Tectonic link web page classifier, the Webpage is analyzed using the contact web page classifier, obtains the connection of all websites It is information page.
It is provided by the invention to the intermediate data to carry out statistical analysis, obtain data results, including:
The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistics Analysis result field list includes at least one result field;According to statistics formula, to the corresponding of the intermediate data centering Data are counted.
With reference to concrete analysis, the invention will be further described.
1st, the fundamental of extraction is determined;Specifically include:
1) development sample is sampled, and sample is labeled by artificial interpretation, generation exploitation language material:
Data from the sample survey source is determined according to specific business demand;Sampling prescription, data from the sample survey amount are determined, sampling algorithm makes data Uniform fold;Sampling algorithm is realized in exploitation, by data pick-up to local file or volatile data base for follow-up mark and modeling Work uses;Which text classification the context of artificial mark sample belongs to;
2) line discipline exploitation, create-rule model are entered for exploitation language material:Rule model developer is according to exploitation language material Text data rule is extracted, the text data rule includes:Entity, concept, clause;Develop implementation rule model source code or Script;Rule model source code is compiled, generates binary pattern file, performing platform for following model calls;
3), model is debugged, if classification need to be adjusted, classification tree is readjusted, returns to 1):In exploitation test environment Compiling, executing rule model;Check rule model implementing result, principle of optimality model source code or script;According to specific business Demand, whether analysis disaggregated model is reasonable, if unreasonable, changes disaggregated model;If have modified disaggregated model, repeatedly 1), Sampled again according to new classification, and carry out sample mark;
4), repeat 1) to 3), until exporting optimal rule model;
5), test sample is sampled, and sample is labeled by artificial interpretation, generates testing material;
6), model measurement personnel carry out model measurement for testing material;
7), model is optimized, if classification need to be adjusted, classification tree is readjusted, returns to 5);Wherein, model is entered Row optimization iterative processing model be:
Wherein, X is the target classification data, and M is sytem matrix, and G is the input grouped data, and i represents iteration time Number, Xi represent the iteration result obtained after ith iteration;λ represents convergence coefficient, and λ ∈ (0,1), MTExpression turns to matrix M Put;
8), repeat 5) to 7), until exporting optimal rule model;
9), model is reached the standard grade, and periodically model is tested using testing material, and test result is analyzed, if right In real time data, model is no longer accurate, then repeatedly 1) to 8) being optimized to model;
2nd, related data message is obtained;Specifically include:
(1) this category set of selection is combined into empty set for initialization, and candidate's text classification collection is combined into text classification complete or collected works, with tune The progress of journey is spent, this category set of selection is closed and the element of candidate's text classification set is constantly updated,WithRespectively exist Candidate and this category set of selection conjunction, n=1 ..., N at the end of nth iterationTFor the number of iteration, n=1 is initialized;
(2) the channel information matrix H that Internet base stations feed back to text classification kkCarry out singular value decompositionWherein, λk,1Represent the channel matrix of k-th of text classification Singular value,Represent dimension 1 × (NT- 1) null vector,WithRespectively by with non-zero singular value λk,1With zero singular value Corresponding right singular value vector is formed, because rank (Hk)=1, sovi,1For ViFirst column vector, its In, rank () represents to seek rank of matrix;
(3) Internet base stations are according to the matrix construction intermediary matrix after decompositionWithWith AndWherein, diag () represents diagonalization operation;
(4) Internet base stations construction correlation matrix R, R are L × L square formations, and the element of the i-th row jth row isWherein, | | represent modulus computing;
(5) n-1 row corresponding with n-1 scheduled text classification are selected from R, form matrixWill be surplus Remaining part carries out ascending order arrangement to row element respectively, obtains matrixI.e.
(6) correlation factor of text classification is calculatedI.e. to RnIn sum and take down respectively per the preceding ξ element of a line Number, it is equivalent to simplification and obtains column vector ψn=[ψ1,n … ψL,n]Η
(7) n-th of text classification is selected according to formula;
(8) if n < NT, return (5);Otherwise scheduling is completed, according to scheduled text classificationThe reality being subject to Interference calculation correlation factor ψk;If text classification is not scheduledψk=0, and update text classification k, k ∈ according to formula { 1 ..., L } average correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base Notice of standing activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats Perform (1)~(7);
Candidate's text classification is considered in (6) by the text that has selected text classification and may potentially be scheduled The influence of classification calculates the correlation factor of text classification according to formula (1)I.e. to RnIn asked respectively per the preceding ξ element of a line With and it is inverted, be equivalent to simplified style (2);
Obtain column vector ψn=[ψ1,n … ψL,n]Η, wherein,Represent potential, subsequently possible selected text point Class set, card () represent the number of the element in set;
According to formula in (7)Select n-th of text classification;
snThe label of selected text classification is represented,It is text classification k dispatch weight,It is a upper biography Text classification k average correlation factor during defeated end cycle, renewal N=n+1;
Text classification is scheduled, over time, average correlation factorIncrease, Weights μkReduce therewith, dispatching priority reduces;
The formula of renewal text classification k average correlation factor is in (8)
Wherein, δc=0.99;
1) development sample is sampled in, for carrying out data sampling from data source, is marked and is used for sample;
Sample marks, for being classified by artificial interpretation to sample, generation exploitation language material and testing material;
Language material is developed, for model development personnel's redaction rule, output model;
Testing material, for being used for model measurement, the mark of testing material is compared with model result, with assessment models The degree of accuracy;
Model debugging, after output model, input sample data are tested.
The value of weight determines by the average correlation factor of text classification in (7)
In step 8, the data segment treated according to data segment decomposition regular expression in analyze data carries out data and divided Solution, is decomposed using decomposing module;Specifically include:
Data value is solved using conjugate gradient method;
For equation Lx=b, the key of conjugate gradient method is to construct conjugate vector p, and seeks corresponding residual error;Conjugation ladder Degree method is solved with alternative manner, and in each iterative process, new conjugate vector is solved by following formula:
Further, the coefficient of conjugate direction is solved by following formula:
New x values are solved with residual error with following formula:
xk=xk-1+skpk
rk=rk-1+skLpk
Element q corresponding to Lp vectors midpoint i is solved with following formulai
WkIt is neighborhood corresponding to data k, ‖ wk‖ is the size of neighborhood, and i is to surround data k neighborhoods WkIn a data, qi For i-th of element of q vectors, IiFor 3-dimensional vector corresponding to data i, R, tri- passages of G, B, p are representediFor data in conjugate vector Element corresponding to i, μkIt is 3-dimensional vector, is neighborhood WkMiddle IiThe average of vector,For neighborhood WkConjugate vector corresponding to middle element i piAverage,It is data k corresponding 3-dimensional vector,For scalar corresponding to data k.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims (5)

1. a kind of internet big data analyzes extracting method, it is characterised in that the internet big data analyzes extracting method bag Include following steps:
Step 1, it is determined that the fundamental of extraction;Specifically include:
1) development sample is sampled, and sample is labeled by artificial interpretation, generation exploitation language material:
Data from the sample survey source is determined according to specific business demand;Sampling prescription, data from the sample survey amount are determined, sampling algorithm makes data uniform Covering;Sampling algorithm is realized in exploitation, by data pick-up to local file or volatile data base for follow-up mark and modeling work Use;Which text classification the context of artificial mark sample belongs to;
2) line discipline exploitation, create-rule model are entered for exploitation language material:Rule model developer is according to exploitation language material extraction Text data rule, the text data rule include:Entity, concept, clause;Develop implementation rule model source code or pin This;Rule model source code is compiled, generates binary pattern file, performing platform for following model calls;
3), model is debugged, if classification need to be adjusted, classification tree is readjusted, returns to 1):Compiled in exploitation test environment Translate, executing rule model;Check rule model implementing result, principle of optimality model source code or script;According to specific business need Ask, whether analysis disaggregated model is reasonable, if unreasonable, changes disaggregated model;If have modified disaggregated model, repeatedly 1), press Sampled again according to new classification, and carry out sample mark;
4), repeat 1) to 3), until exporting optimal rule model;
5), test sample is sampled, and sample is labeled by artificial interpretation, generates testing material;
6), model measurement personnel carry out model measurement for testing material;
7), model is optimized, if classification need to be adjusted, classification tree is readjusted, returns to 5);Wherein, model is carried out excellent The iterative processing model of change is:
Wherein, X is the target classification data, and M is sytem matrix, and G is the input grouped data, and i represents iterations, Xi Represent the iteration result obtained after ith iteration;λ represents convergence coefficient, and λ ∈ (0,1), MTRepresent the transposition to matrix M;
8), repeat 5) to 7), until exporting optimal rule model;
9), model is reached the standard grade, and periodically model is tested using testing material, and test result is analyzed, if for reality When data, model is no longer accurate, then repeatedly 1) to 8) being optimized to model;
Step 2, obtain the data message of correlation;Specifically include:
(1) this category set of selection is combined into empty set for initialization, and candidate's text classification collection is combined into text classification complete or collected works, with dispatching The progress of journey, this category set of selection is closed and the element of candidate's text classification set is constantly updated,WithRespectively in n-th Candidate and this category set of selection conjunction, n=1 ..., N at the end of iterationTFor the number of iteration, n=1 is initialized;
(2) the channel information matrix H that Internet base stations feed back to text classification kkCarry out singular value decompositionWherein, λk,1Represent the strange of the channel matrix of k-th of text classification Different value,Represent dimension 1 × (NT- 1) null vector,WithRespectively by with non-zero singular value λk,1With zero singular value pair The right singular value vector answered is formed, because rank (Hk)=1, sovi,1For ViFirst column vector, wherein, Rank () represents to seek rank of matrix;
(3) Internet base stations are according to the matrix construction intermediary matrix after decompositionWithAndWherein, diag () represents diagonalization operation;
(4) Internet base stations construction correlation matrix R, R are L × L square formations, and the element of the i-th row jth row is Wherein, | | represent modulus computing;
(5) n-1 row corresponding with n-1 scheduled text classification are selected from R, form matrixWill be remaining Part carries out ascending order arrangement to row element respectively, obtains matrixI.e.
(6) correlation factor of text classification is calculatedI.e. to RnIn preceding ξ element per a line sum respectively and inverted, etc. Imitate and obtain column vector ψ in simplificationn=[ψ1,n … ψL,n]Η
(7) n-th of text classification is selected according to formula;
(8) if n < NT, return (5);Otherwise scheduling is completed, according to scheduled text classificationThe actual interference meter being subject to Calculate correlation factor ψk;If text classification is not scheduledψk=0, and according to formula update text classification k, k ∈ 1 ..., L } be averaged correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base station notice Activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats (1)~(7);
Candidate's text classification is considered in (6) by the text classification selected text classification and may be potentially scheduled Influence according to formula (1) calculate text classification correlation factorI.e. to RnIn summed respectively simultaneously per the preceding ξ element of a line It is inverted, it is equivalent to simplified style (2);
Obtain column vector ψn=[ψ1,n … ψL,n]Η, wherein,Represent potential, subsequently possible selected text classification collection Close, card () represents the number of the element in set;
According to formula in (7)Select n-th of text classification;
snThe label of selected text classification is represented,It is text classification k dispatch weight,It was upper transmission week Text classification k average correlation factor at the end of phase, renewal N=n+1;
Text classification is scheduled, over time, average correlation factorIncrease, weights μkReduce therewith, dispatching priority reduces;
The formula of renewal text classification k average correlation factor is in (8)
Wherein, δc=0.99;
Step 3, obtain the Webpage of internet;
Step 4, obtain website homepage corresponding to Webpage or contact page;
Interest point name and corresponding address information are extracted in step 5, slave site homepage or contact page;
Step 6, the interest point name extracted is associated with address information, obtains structured message;
Step 7, structured message data will be obtained, data segment decomposes regular expression and decompose regular expression with data segment Corresponding data item name list;
Step 8, the data segment in analyze data is treated according to data segment decomposition regular expression and carries out data decomposition, generates number It is associated according to entry value, and by data item occurrence and data item name list, during formation data item title is corresponding with data item occurrence Between data pair;According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained;
The method for obtaining website homepage includes:
Domain name addresses is taken out from the network address of the Webpage, domain name address is carried out to redirect processing, obtains the webpage Website homepage corresponding to the page;
Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses set after carrying out duplicate removal processing In, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage;Or Using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, utilizes The homepage grader is analyzed the Webpage, obtains all website homepages;
The method of contact page includes corresponding to the acquisition Webpage:
Using the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics construct Web page classifier is contacted, the Webpage is analyzed using the contact web page classifier, obtains the contact letter of all websites Cease page;
It is described to the intermediate data to carry out statistical analysis, obtain data results, including:
The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistical analysis Result field table includes at least one result field;According to statistics formula, to the corresponding data of the intermediate data centering Counted.
2. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that development sample is sampled in 1), For carrying out data sampling from data source, mark and use for sample;
Sample marks, for being classified by artificial interpretation to sample, generation exploitation language material and testing material;
Language material is developed, for model development personnel's redaction rule, output model;
Testing material, for being used for model measurement, the mark of testing material is compared with model result, with the accurate of assessment models Degree;
Model debugging, after output model, input sample data are tested.
3. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that in (7) value of weight by The average correlation factor of text classification determines.
4. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that in step 8, the basis Data segment decomposes the data segment that regular expression is treated in analyze data and carries out data decomposition, is decomposed using decomposing module; Specifically include:
Data value is solved using conjugate gradient method;
For equation Lx=b, the key of conjugate gradient method is to construct conjugate vector p, and seeks corresponding residual error;Conjugate gradient method Solved with alternative manner, in each iterative process, new conjugate vector is solved by following formula:
5. internet big data as claimed in claim 4 analyzes extracting method, it is characterised in that the coefficient of conjugate direction is under Formula solves:
New x values are solved with residual error with following formula:
xk=xk-1+skpk
rk=rk-1+skLpk
Element q corresponding to Lp vectors midpoint i is solved with following formulai
WkIt is neighborhood corresponding to data k, ‖ wk‖ is the size of neighborhood, and i is to surround data k neighborhoods WkIn a data, qiFor q to I-th of element of amount, IiFor 3-dimensional vector corresponding to data i, R, tri- passages of G, B, p are representediIt is corresponding for data i in conjugate vector Element, μkIt is 3-dimensional vector, is neighborhood WkMiddle IiThe average of vector,For neighborhood WkConjugate vector p corresponding to middle element ii's Average,It is data k corresponding 3-dimensional vector,For scalar corresponding to data k.
CN201710897053.1A 2017-09-28 2017-09-28 A kind of internet big data analyzes extracting method Pending CN107657032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710897053.1A CN107657032A (en) 2017-09-28 2017-09-28 A kind of internet big data analyzes extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710897053.1A CN107657032A (en) 2017-09-28 2017-09-28 A kind of internet big data analyzes extracting method

Publications (1)

Publication Number Publication Date
CN107657032A true CN107657032A (en) 2018-02-02

Family

ID=61117070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710897053.1A Pending CN107657032A (en) 2017-09-28 2017-09-28 A kind of internet big data analyzes extracting method

Country Status (1)

Country Link
CN (1) CN107657032A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694462A (en) * 2018-03-30 2018-10-23 北京合享智慧科技有限公司 Patent value degree evaluation method and computer readable storage medium
CN108763507A (en) * 2018-05-30 2018-11-06 北京百度网讯科技有限公司 Enterprise's incidence relation method for digging and device
CN109101531A (en) * 2018-06-22 2018-12-28 联想(北京)有限公司 Document handling method, apparatus and system
CN109738438A (en) * 2018-12-29 2019-05-10 扬州大学 A kind of Wheat Leaves Senescence degree method for fast measuring
CN114117292A (en) * 2021-11-04 2022-03-01 中通服咨询设计研究院有限公司 Internet big data analysis and extraction method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110022550A1 (en) * 2009-07-24 2011-01-27 Yahoo! Inc. Mixing knowledge sources with auto learning for improved entity extraction
CN102880647A (en) * 2012-08-24 2013-01-16 北京百度网讯科技有限公司 Method and device for acquiring another name of organization
CN103763782A (en) * 2014-01-13 2014-04-30 西安电子科技大学 Dispatching method for MU-MIMO down link based on fairness related to weighting users
CN105760492A (en) * 2016-02-18 2016-07-13 广东睿江云计算股份有限公司 Data analysis method and device
CN106407357A (en) * 2016-09-07 2017-02-15 深圳市中易科技有限责任公司 Engineering method for developing text data rule model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110022550A1 (en) * 2009-07-24 2011-01-27 Yahoo! Inc. Mixing knowledge sources with auto learning for improved entity extraction
CN102880647A (en) * 2012-08-24 2013-01-16 北京百度网讯科技有限公司 Method and device for acquiring another name of organization
CN103763782A (en) * 2014-01-13 2014-04-30 西安电子科技大学 Dispatching method for MU-MIMO down link based on fairness related to weighting users
CN105760492A (en) * 2016-02-18 2016-07-13 广东睿江云计算股份有限公司 Data analysis method and device
CN106407357A (en) * 2016-09-07 2017-02-15 深圳市中易科技有限责任公司 Engineering method for developing text data rule model

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694462A (en) * 2018-03-30 2018-10-23 北京合享智慧科技有限公司 Patent value degree evaluation method and computer readable storage medium
CN108694462B (en) * 2018-03-30 2022-03-15 北京合享智慧科技有限公司 Patent retrieval result sorting method and computer-readable storage medium
CN108763507A (en) * 2018-05-30 2018-11-06 北京百度网讯科技有限公司 Enterprise's incidence relation method for digging and device
US11397778B2 (en) 2018-05-30 2022-07-26 Beijing Baidu Netcom Service and Technology Co., Ltd. Method and device for mining an enterprise relationship
CN109101531A (en) * 2018-06-22 2018-12-28 联想(北京)有限公司 Document handling method, apparatus and system
CN109101531B (en) * 2018-06-22 2022-05-31 联想(北京)有限公司 File processing method, device and system
CN109738438A (en) * 2018-12-29 2019-05-10 扬州大学 A kind of Wheat Leaves Senescence degree method for fast measuring
CN114117292A (en) * 2021-11-04 2022-03-01 中通服咨询设计研究院有限公司 Internet big data analysis and extraction method
CN114117292B (en) * 2021-11-04 2024-04-16 中通服咨询设计研究院有限公司 Internet big data analysis and extraction method

Similar Documents

Publication Publication Date Title
CN107657032A (en) A kind of internet big data analyzes extracting method
Ren et al. Spatially explicit simulation of land use/land cover changes: Current coverage and future prospects
Huppmann et al. The MESSAGEix Integrated Assessment Model and the ix modeling platform (ixmp): An open framework for integrated and cross-cutting analysis of energy, climate, the environment, and sustainable development
Jain et al. Manufacturing data analytics using a virtual factory representation
Balci How to assess the acceptability and credibility of simulation results
Zhang et al. Improving probabilistic load forecasting using quantile regression NN with skip connections
CN106407357B (en) A kind of engineering method of text data rule model exploitation
Zhang et al. Forecasting the transport energy demand based on PLSR method in China
CN104572449A (en) Automatic test method based on case library
CN106022614A (en) Data mining method of neural network based on nearest neighbor clustering
CN107918639A (en) Based on electric power big data main transformer peak load forecasting method and data warehouse
Agami et al. A neural network based dynamic forecasting model for Trend Impact Analysis
Feng et al. [Retracted] Design and Simulation of Human Resource Allocation Model Based on Double‐Cycle Neural Network
CN107515822B (en) Software defect positioning method based on multiple-objection optimization
CN104732087A (en) GIS-based neural-network cellular automaton infectious-disease transmission simulation analysis method
CN107402859A (en) Software function verification system and verification method thereof
Saremi et al. Website structure improvement: Quadratic assignment problem approach and ant colony meta-heuristic technique
Mitrovic et al. Gp cc-opf: Gaussian process based optimization tool for chance-constrained optimal power flow
Pang et al. Early termination strategies with asynchronous parallel optimization in application to automatic calibration of groundwater PDE models
Maidantchik et al. Managing a worldwide software process
CN116959249A (en) City information management platform and method based on CIM
Sun et al. Optimal path for overcoming barriers in developing China’s wind energy industry
Schlenzig et al. MESAP-III: An information and decision support system for energy and environmental planning
Islam et al. An Efficient Framework For Software Maintenance Cost Estimation Using Genetic Hybrid Algorithm: OOPs Prospective
Muhammad et al. A decision process for the applications of artificial intelligence in sustainable operations and supply chain management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180202