CN107657032A - A kind of internet big data analyzes extracting method - Google Patents
A kind of internet big data analyzes extracting method Download PDFInfo
- Publication number
- CN107657032A CN107657032A CN201710897053.1A CN201710897053A CN107657032A CN 107657032 A CN107657032 A CN 107657032A CN 201710897053 A CN201710897053 A CN 201710897053A CN 107657032 A CN107657032 A CN 107657032A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- text classification
- sample
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to technical field of data processing, discloses a kind of internet big data analysis extracting method, determines fundamental and basic orientation, cause the accurate positioning in analysis and extraction.The contact details of physical mechanism are extracted in slave site homepage and contact page, by the verification to multi-source data, integration with associating, and combine vision partition and obtain the web page blocks for including address information, obtain the geographical location information of structuring, improve the accuracy rate of information, and can be automatically towards whole internet physical mechanism enter row information and recall, reduce human cost, improve information recall rate.The present invention sets data segment decomposition regular expression and data item name list corresponding with data segment decomposition regular expression by being needed according to text classification analysis, the decomposition and extraction for the different demands for treating analyze data are realized, improves the versatility of data analysis.
Description
Technical field
The invention belongs to technical field of data processing, more particularly to a kind of internet big data analysis extracting method.
Background technology
Present society is the society of a high speed development, and science and technology is flourishing, information flow, and the exchange between people is increasingly
Closely, life is also more and more convenient, and big data is exactly the product of this cyberage.However, existing information data carries
Method is taken to be normally based on the structured message extracting method of template, by the rule of manual compiling template, in internet
Page info is excavated, and obtains structured message.Not only need to consume huge artificial cost in this way, but also
It is required that excavate object has identical design feature in the page, thus limited by human cost and page structure uniformity,
Can not large-scale application.Meanwhile in existing data analysis system, for the different analysis demand of text classification, it is necessary to adopt
With special data analysing method, i.e., for the different analysis demand of text classification, it is required for designing a set of data analysing method.
Therefore, cause the efficiency of data analysis low, it is with high costs.
Text data rule model is developed, the process such as main sub-model exploitation, model measurement and Model Monitoring.Model is opened
Hair, i.e., model is formed according to exploitation language material redaction rule, the corresponding classification of rule together.Model measurement, then it is to utilize sample
Notebook data is tested model, so as to assess the performance of the model and the degree of accuracy.And, it is necessary to fixed during model running
Phase test model, to monitor its situation in environment is actually generated.
In existing model development flow, it is numerous to be related to personnel:There is the business personnel of management sample, the model of development model is opened
Hair personnel, the model measurement personnel of test model and model reach the standard grade after operation maintenance personnel.These personnel are dispersed in each department,
Exchanged under line, link up all inconveniences.Moreover, rule model exploitation is a lasting iteration, the process of Continuous optimization, further increase
The workload of exchange and conmmunication.
Need height to assist that is, rule model exploitation is one, continue the development process of iteration.And existing open
Flow function is sent out to disperse, coordinate inconvenience.Therefore, there is an urgent need to a set of reasonable, science engineering method, for opening rule model
Hair flow optimizes, so that the energy of developer need to be only concentrated on model sheet.
To sum up, the problem of prior art is present be:Existing information data is excavated, it is necessary to consume huge artificial cost,
And excavate object and limited by human cost and page structure uniformity, can not large-scale application;Meanwhile data analysis system
The efficiency of data analysis is low in system, with high costs;Existing text data rule model development process function is scattered, coordinates not
Just.
The content of the invention
The problem of existing for prior art, the invention provides a kind of internet big data to analyze extracting method.
The present invention is achieved in that
A kind of internet big data analyzes extracting method, and the internet big data analysis extracting method includes following step
Suddenly:
Step 1, it is determined that the fundamental of extraction;Specifically include:
1) development sample is sampled, and sample is labeled by artificial interpretation, generation exploitation language material:
Data from the sample survey source is determined according to specific business demand;Sampling prescription, data from the sample survey amount are determined, sampling algorithm makes data
Uniform fold;Sampling algorithm is realized in exploitation, by data pick-up to local file or volatile data base for follow-up mark and modeling
Work uses;Which text classification the context of artificial mark sample belongs to;
2) line discipline exploitation, create-rule model are entered for exploitation language material:Rule model developer is according to exploitation language material
Text data rule is extracted, the text data rule includes:Entity, concept, clause;Develop implementation rule model source code or
Script;Rule model source code is compiled, generates binary pattern file, performing platform for following model calls;
3), model is debugged, if classification need to be adjusted, classification tree is readjusted, returns to 1):In exploitation test environment
Compiling, executing rule model;Check rule model implementing result, principle of optimality model source code or script;According to specific business
Demand, whether analysis disaggregated model is reasonable, if unreasonable, changes disaggregated model;If have modified disaggregated model, repeatedly 1),
Sampled again according to new classification, and carry out sample mark;
4), repeat 1) to 3), until exporting optimal rule model;
5), test sample is sampled, and sample is labeled by artificial interpretation, generates testing material;
6), model measurement personnel carry out model measurement for testing material;
7), model is optimized, if classification need to be adjusted, classification tree is readjusted, returns to 5);Wherein, model is entered
Row optimization iterative processing model be:
Wherein, X is the target classification data, and M is sytem matrix, and G is the input grouped data, and i represents iteration time
Number, Xi represent the iteration result obtained after ith iteration;λ represents convergence coefficient, and λ ∈ (0,1), MTExpression turns to matrix M
Put;
8), repeat 5) to 7), until exporting optimal rule model;
9), model is reached the standard grade, and periodically model is tested using testing material, and test result is analyzed, if right
In real time data, model is no longer accurate, then repeatedly 1) to 8) being optimized to model;
Step 2, obtain the data message of correlation;Specifically include:
(1) this category set of selection is combined into empty set for initialization, and candidate's text classification collection is combined into text classification complete or collected works, with tune
The progress of journey is spent, this category set of selection is closed and the element of candidate's text classification set is constantly updated,WithRespectively exist
Candidate and this category set of selection conjunction, n=1 ..., N at the end of nth iterationTFor the number of iteration, n=1 is initialized;
(2) the channel information matrix H that Internet base stations feed back to text classification kkCarry out singular value decompositionWherein, λk,1Represent the channel matrix of k-th of text classification
Singular value,Represent dimension 1 × (NT- 1) null vector,WithRespectively by with non-zero singular value λk,1With zero singular value
Corresponding right singular value vector is formed, because rank (Hk)=1, sovi,1For ViFirst column vector, its
In, rank () represents to seek rank of matrix;
(3) Internet base stations are according to the matrix construction intermediary matrix after decompositionWithWith
AndWherein, diag () represents diagonalization operation;
(4) Internet base stations construction correlation matrix R, R are L × L square formations, and the element of the i-th row jth row isWherein, | | represent modulus computing;
(5) n-1 row corresponding with n-1 scheduled text classification are selected from R, form matrixWill be surplus
Remaining part carries out ascending order arrangement to row element respectively, obtains matrixI.e.
(6) correlation factor of text classification is calculatedI.e. to RnIn sum and take down respectively per the preceding ξ element of a line
Number, it is equivalent to simplification and obtains column vector ψn=[ψ1,n … ψL,n]Η;
(7) n-th of text classification is selected according to formula;
(8) if n < NT, return (5);Otherwise scheduling is completed, according to scheduled text classificationThe reality being subject to is done
Disturb and calculate correlation factor ψk;If text classification is not scheduledψk=0, and update text classification k, k ∈ according to formula
{ 1 ..., L } average correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base
Notice of standing activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats
Perform (1)~(7);
Candidate's text classification is considered in (6) by the text that has selected text classification and may potentially be scheduled
The influence of classification calculates the correlation factor of text classification according to formula (1)I.e. to RnIn asked respectively per the preceding ξ element of a line
With and it is inverted, be equivalent to simplified style (2);
Obtain column vector ψn=[ψ1,n … ψL,n]Η, wherein,Represent potential, subsequently possible selected text point
Class set, card () represent the number of the element in set;
According to formula in (7)Select n-th of text classification;
snThe label of selected text classification is represented,It is text classification k dispatch weight,It is a upper biography
Text classification k average correlation factor during defeated end cycle, renewal N=n+1;
Text classification is scheduled, over time, average correlation factorIncrease,
Weights μkReduce therewith, dispatching priority reduces;
The formula of renewal text classification k average correlation factor is in (8)
Wherein, δc=0.99;
Step 3, obtain the Webpage of internet;
Step 4, obtain website homepage corresponding to Webpage or contact page;
Interest point name and corresponding address information are extracted in step 5, slave site homepage or contact page;
Step 6, the interest point name extracted is associated with address information, obtains structured message;
Step 7, structured message data will be obtained, data segment decomposes regular expression and decompose canonical table with data segment
Up to data item name list corresponding to formula;
Step 8, data decomposition is carried out according to the data segment that data segment decomposition regular expression is treated in analyze data, it is raw
It is associated into data item occurrence, and by data item occurrence and data item name list, it is corresponding with data item occurrence forms data item title
Intermediate data pair;According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained;
The method for obtaining website homepage includes:
Domain name addresses is taken out from the network address of the Webpage, domain name address is carried out to redirect processing, is somebody's turn to do
Website homepage corresponding to Webpage;
Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses after carrying out duplicate removal processing
In set, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage;Or
Person, using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, sharp
The Webpage is analyzed with the homepage grader, obtains all website homepages;
The method of contact page includes corresponding to the acquisition Webpage:
Utilize the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics
Tectonic link web page classifier, the Webpage is analyzed using the contact web page classifier, obtains the connection of all websites
It is information page;
It is described to the intermediate data to carry out statistical analysis, obtain data results, including:
The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistics
Analysis result field list includes at least one result field;According to statistics formula, to the corresponding of the intermediate data centering
Data are counted;
Further, development sample is sampled in 1), for carrying out data sampling from data source, is marked and is used for sample;
Sample marks, for being classified by artificial interpretation to sample, generation exploitation language material and testing material;
Language material is developed, for model development personnel's redaction rule, output model;
Testing material, for being used for model measurement, the mark of testing material is compared with model result, with assessment models
The degree of accuracy;
Model debugging, after output model, input sample data are tested.
Further, the value of weight determines by the average correlation factor of text classification in (7)
Further, it is described regular expression is decomposed according to data segment to treat the data segment in analyze data and enter in step 8
Row data are decomposed, and are decomposed using decomposing module;Specifically include:
Data value is solved using conjugate gradient method;
For equation Lx=b, the key of conjugate gradient method is to construct conjugate vector p, and seeks corresponding residual error;Conjugation ladder
Degree method is solved with alternative manner, and in each iterative process, new conjugate vector is solved by following formula:
Further, the coefficient of conjugate direction is solved by following formula:
New x values are solved with residual error with following formula:
xk=xk-1+skpk
rk=rk-1+skLpk;
Element q corresponding to Lp vectors midpoint i is solved with following formulai:
WkIt is neighborhood corresponding to data k, ‖ wk‖ is the size of neighborhood, and i is to surround data k neighborhoods WkIn a data, qi
For i-th of element of q vectors, IiFor 3-dimensional vector corresponding to data i, R, tri- passages of G, B, p are representediFor data in conjugate vector
Element corresponding to i, μkIt is 3-dimensional vector, is neighborhood WkMiddle IiThe average of vector,For neighborhood WkConjugate vector corresponding to middle element i
piAverage,It is data k corresponding 3-dimensional vector,For scalar corresponding to data k.
Advantages of the present invention and good effect are:The present invention utilizes the institutional framework feature and letter of physical mechanism in internet
Semantic feature is ceased, determines the fundamental and basic orientation of internet big data analysis extracting method, causes analyzing and extracting
When accurate positioning;Slave site homepage and contact page in extract physical mechanism contact details, by the verification to multi-source data,
Integrate and associate, and combine vision partition and obtain the web page blocks for including address information, obtain the geographical position letter of structuring
Breath, improves the accuracy rate of information, and physical mechanism that can be automatically towards whole internet is entered row information and recalled, and reduces manpower
Cost, improve information recall rate;The present invention sets data segment decomposition canonical table by being needed according to text classification analysis simultaneously
Up to formula and the different demands of analyze data are treated in data item name list corresponding with data segment decomposition regular expression, realization
Decompose and extract, improve the versatility of data analysis.
The present invention is organized rule model development process by order and circulation process so that inside functional module not
Disconnected iteration, is optimal;Sequential combination between functional module, avoids mistake;The input and output of each functional module all obtain effectively
Processing.Compared to original method, the present invention has the efficiency of more excellent collaborative work ability and Geng Gao;
Multiple person cooperational provided by the invention carries out text rule model development:For a complicated text data digging item
Mesh, the class node of disaggregated model hundreds of at least is at most thousands of, and so complicated model development efforts are needed by one
The more people of team complete jointly, in the method, the division of labor of model development efforts can be carried out according to classification, different people is responsible for not
Same classification, so as to realize that concurrent cooperation is developed, improve the development efficiency of project;
More wheel iteration continue to optimize model:One complicated model needs continuous iteration optimization to meet accuracy rate and cover
The requirement of lid rate, this process are the processes of alternately Data Mining and model optimization, this method provide more wheel iteration and open
The method for sending out Optimized model, by the exploitation and test taken turns more, the accuracy rate and coverage rate of continuous lift scheme, reaches and meets industry
Business demand is optimal;
Test link and ensure model whole structure:Here test link is different from the test of iterative process, iterative process
Test still fall within the checking and debugging of model development process, and this individually test link is the complete formal hair of model development
The Acceptance Test that cloth is run to production environment, this method are done by this link before model issue to the effect of model
Once final inspection, as long as reaching the requirement of expected accuracy rate and coverage rate, it could issue and reach the standard grade;
Upper and lower wire loop section ensures that model version is correct:, can be constantly to model in the life-cycle processes of whole model
Be adjusted and optimize, therefore many different model versions occur, this method model the festival-gathering of wire loop up and down to model
Version carries out inspection confirmation, it is ensured that the use of model version will not malfunction;
Regular monitoring after reaching the standard grade ensures the accuracy rate and coverage rate of model:With the longtime running of model, can constantly have
New data enter model treatment, necessarily occur the situation that existing model rule can not be completely covered, the accuracy rate of model and
Downward trend occurs in coverage rate, this method provide the mechanism of regular monitoring model, when modelling effect drops to certain threshold value,
The model optimization process of a new round can be started, so as to ensure the effect of model treatment.
The method that the present invention obtains related data message can obtain accurate information in real time, rate of accuracy reached to 96.73%, than
Prior art provides nearly 4 percentage points.
Decomposition method provided by the invention, good effect can be obtained in the region of complexity.
Brief description of the drawings
Fig. 1 is the internet big data analysis extracting method flow chart that the present invention implements to provide.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
The application principle of the present invention is further described below in conjunction with the accompanying drawings.
Comprise the following steps as shown in figure 1, the present invention provides a kind of internet big data analysis extracting method:
S101, determine the fundamental of internet big data analysis extracting method;
S102, the basic orientation of internet big data analysis extracting method is determined, is advantageous to obtain related data message;
S103, obtain the Webpage of internet;
S104, obtain website homepage corresponding to Webpage or contact page;
Interest point name and corresponding address information are extracted in S105, slave site homepage or contact page;
S106, the interest point name extracted is associated with address information, obtains structured message;
S107, structured message data will be obtained, data segment decomposes regular expression and decompose regular expressions with data segment
Data item name list corresponding to formula;
S108, data decomposition, generation are carried out according to the data segment that data segment decomposition regular expression is treated in analyze data
Data item occurrence, and data item occurrence and data item name list are associated, it is corresponding with data item occurrence to form data item title
Intermediate data pair;According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained.
The method of acquisition website homepage provided by the invention is as follows:
Domain name addresses is taken out from the network address of the Webpage, domain name address is carried out to redirect processing, is somebody's turn to do
Website homepage corresponding to Webpage;
Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses after carrying out duplicate removal processing
In set, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage;Or
Person, using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, sharp
The Webpage is analyzed with the homepage grader, obtains all website homepages.
The method of contact page is as follows corresponding to acquisition Webpage provided by the invention:
Utilize the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics
Tectonic link web page classifier, the Webpage is analyzed using the contact web page classifier, obtains the connection of all websites
It is information page.
It is provided by the invention to the intermediate data to carry out statistical analysis, obtain data results, including:
The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistics
Analysis result field list includes at least one result field;According to statistics formula, to the corresponding of the intermediate data centering
Data are counted.
With reference to concrete analysis, the invention will be further described.
1st, the fundamental of extraction is determined;Specifically include:
1) development sample is sampled, and sample is labeled by artificial interpretation, generation exploitation language material:
Data from the sample survey source is determined according to specific business demand;Sampling prescription, data from the sample survey amount are determined, sampling algorithm makes data
Uniform fold;Sampling algorithm is realized in exploitation, by data pick-up to local file or volatile data base for follow-up mark and modeling
Work uses;Which text classification the context of artificial mark sample belongs to;
2) line discipline exploitation, create-rule model are entered for exploitation language material:Rule model developer is according to exploitation language material
Text data rule is extracted, the text data rule includes:Entity, concept, clause;Develop implementation rule model source code or
Script;Rule model source code is compiled, generates binary pattern file, performing platform for following model calls;
3), model is debugged, if classification need to be adjusted, classification tree is readjusted, returns to 1):In exploitation test environment
Compiling, executing rule model;Check rule model implementing result, principle of optimality model source code or script;According to specific business
Demand, whether analysis disaggregated model is reasonable, if unreasonable, changes disaggregated model;If have modified disaggregated model, repeatedly 1),
Sampled again according to new classification, and carry out sample mark;
4), repeat 1) to 3), until exporting optimal rule model;
5), test sample is sampled, and sample is labeled by artificial interpretation, generates testing material;
6), model measurement personnel carry out model measurement for testing material;
7), model is optimized, if classification need to be adjusted, classification tree is readjusted, returns to 5);Wherein, model is entered
Row optimization iterative processing model be:
Wherein, X is the target classification data, and M is sytem matrix, and G is the input grouped data, and i represents iteration time
Number, Xi represent the iteration result obtained after ith iteration;λ represents convergence coefficient, and λ ∈ (0,1), MTExpression turns to matrix M
Put;
8), repeat 5) to 7), until exporting optimal rule model;
9), model is reached the standard grade, and periodically model is tested using testing material, and test result is analyzed, if right
In real time data, model is no longer accurate, then repeatedly 1) to 8) being optimized to model;
2nd, related data message is obtained;Specifically include:
(1) this category set of selection is combined into empty set for initialization, and candidate's text classification collection is combined into text classification complete or collected works, with tune
The progress of journey is spent, this category set of selection is closed and the element of candidate's text classification set is constantly updated,WithRespectively exist
Candidate and this category set of selection conjunction, n=1 ..., N at the end of nth iterationTFor the number of iteration, n=1 is initialized;
(2) the channel information matrix H that Internet base stations feed back to text classification kkCarry out singular value decompositionWherein, λk,1Represent the channel matrix of k-th of text classification
Singular value,Represent dimension 1 × (NT- 1) null vector,WithRespectively by with non-zero singular value λk,1With zero singular value
Corresponding right singular value vector is formed, because rank (Hk)=1, sovi,1For ViFirst column vector, its
In, rank () represents to seek rank of matrix;
(3) Internet base stations are according to the matrix construction intermediary matrix after decompositionWithWith
AndWherein, diag () represents diagonalization operation;
(4) Internet base stations construction correlation matrix R, R are L × L square formations, and the element of the i-th row jth row isWherein, | | represent modulus computing;
(5) n-1 row corresponding with n-1 scheduled text classification are selected from R, form matrixWill be surplus
Remaining part carries out ascending order arrangement to row element respectively, obtains matrixI.e.
(6) correlation factor of text classification is calculatedI.e. to RnIn sum and take down respectively per the preceding ξ element of a line
Number, it is equivalent to simplification and obtains column vector ψn=[ψ1,n … ψL,n]Η;
(7) n-th of text classification is selected according to formula;
(8) if n < NT, return (5);Otherwise scheduling is completed, according to scheduled text classificationThe reality being subject to
Interference calculation correlation factor ψk;If text classification is not scheduledψk=0, and update text classification k, k ∈ according to formula
{ 1 ..., L } average correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base
Notice of standing activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats
Perform (1)~(7);
Candidate's text classification is considered in (6) by the text that has selected text classification and may potentially be scheduled
The influence of classification calculates the correlation factor of text classification according to formula (1)I.e. to RnIn asked respectively per the preceding ξ element of a line
With and it is inverted, be equivalent to simplified style (2);
Obtain column vector ψn=[ψ1,n … ψL,n]Η, wherein,Represent potential, subsequently possible selected text point
Class set, card () represent the number of the element in set;
According to formula in (7)Select n-th of text classification;
snThe label of selected text classification is represented,It is text classification k dispatch weight,It is a upper biography
Text classification k average correlation factor during defeated end cycle, renewal N=n+1;
Text classification is scheduled, over time, average correlation factorIncrease,
Weights μkReduce therewith, dispatching priority reduces;
The formula of renewal text classification k average correlation factor is in (8)
Wherein, δc=0.99;
1) development sample is sampled in, for carrying out data sampling from data source, is marked and is used for sample;
Sample marks, for being classified by artificial interpretation to sample, generation exploitation language material and testing material;
Language material is developed, for model development personnel's redaction rule, output model;
Testing material, for being used for model measurement, the mark of testing material is compared with model result, with assessment models
The degree of accuracy;
Model debugging, after output model, input sample data are tested.
The value of weight determines by the average correlation factor of text classification in (7)
In step 8, the data segment treated according to data segment decomposition regular expression in analyze data carries out data and divided
Solution, is decomposed using decomposing module;Specifically include:
Data value is solved using conjugate gradient method;
For equation Lx=b, the key of conjugate gradient method is to construct conjugate vector p, and seeks corresponding residual error;Conjugation ladder
Degree method is solved with alternative manner, and in each iterative process, new conjugate vector is solved by following formula:
Further, the coefficient of conjugate direction is solved by following formula:
New x values are solved with residual error with following formula:
xk=xk-1+skpk
rk=rk-1+skLpk;
Element q corresponding to Lp vectors midpoint i is solved with following formulai:
WkIt is neighborhood corresponding to data k, ‖ wk‖ is the size of neighborhood, and i is to surround data k neighborhoods WkIn a data, qi
For i-th of element of q vectors, IiFor 3-dimensional vector corresponding to data i, R, tri- passages of G, B, p are representediFor data in conjugate vector
Element corresponding to i, μkIt is 3-dimensional vector, is neighborhood WkMiddle IiThe average of vector,For neighborhood WkConjugate vector corresponding to middle element i
piAverage,It is data k corresponding 3-dimensional vector,For scalar corresponding to data k.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.
Claims (5)
1. a kind of internet big data analyzes extracting method, it is characterised in that the internet big data analyzes extracting method bag
Include following steps:
Step 1, it is determined that the fundamental of extraction;Specifically include:
1) development sample is sampled, and sample is labeled by artificial interpretation, generation exploitation language material:
Data from the sample survey source is determined according to specific business demand;Sampling prescription, data from the sample survey amount are determined, sampling algorithm makes data uniform
Covering;Sampling algorithm is realized in exploitation, by data pick-up to local file or volatile data base for follow-up mark and modeling work
Use;Which text classification the context of artificial mark sample belongs to;
2) line discipline exploitation, create-rule model are entered for exploitation language material:Rule model developer is according to exploitation language material extraction
Text data rule, the text data rule include:Entity, concept, clause;Develop implementation rule model source code or pin
This;Rule model source code is compiled, generates binary pattern file, performing platform for following model calls;
3), model is debugged, if classification need to be adjusted, classification tree is readjusted, returns to 1):Compiled in exploitation test environment
Translate, executing rule model;Check rule model implementing result, principle of optimality model source code or script;According to specific business need
Ask, whether analysis disaggregated model is reasonable, if unreasonable, changes disaggregated model;If have modified disaggregated model, repeatedly 1), press
Sampled again according to new classification, and carry out sample mark;
4), repeat 1) to 3), until exporting optimal rule model;
5), test sample is sampled, and sample is labeled by artificial interpretation, generates testing material;
6), model measurement personnel carry out model measurement for testing material;
7), model is optimized, if classification need to be adjusted, classification tree is readjusted, returns to 5);Wherein, model is carried out excellent
The iterative processing model of change is:
Wherein, X is the target classification data, and M is sytem matrix, and G is the input grouped data, and i represents iterations, Xi
Represent the iteration result obtained after ith iteration;λ represents convergence coefficient, and λ ∈ (0,1), MTRepresent the transposition to matrix M;
8), repeat 5) to 7), until exporting optimal rule model;
9), model is reached the standard grade, and periodically model is tested using testing material, and test result is analyzed, if for reality
When data, model is no longer accurate, then repeatedly 1) to 8) being optimized to model;
Step 2, obtain the data message of correlation;Specifically include:
(1) this category set of selection is combined into empty set for initialization, and candidate's text classification collection is combined into text classification complete or collected works, with dispatching
The progress of journey, this category set of selection is closed and the element of candidate's text classification set is constantly updated,WithRespectively in n-th
Candidate and this category set of selection conjunction, n=1 ..., N at the end of iterationTFor the number of iteration, n=1 is initialized;
(2) the channel information matrix H that Internet base stations feed back to text classification kkCarry out singular value decompositionWherein, λk,1Represent the strange of the channel matrix of k-th of text classification
Different value,Represent dimension 1 × (NT- 1) null vector,WithRespectively by with non-zero singular value λk,1With zero singular value pair
The right singular value vector answered is formed, because rank (Hk)=1, sovi,1For ViFirst column vector, wherein,
Rank () represents to seek rank of matrix;
(3) Internet base stations are according to the matrix construction intermediary matrix after decompositionWithAndWherein, diag () represents diagonalization operation;
(4) Internet base stations construction correlation matrix R, R are L × L square formations, and the element of the i-th row jth row is
Wherein, | | represent modulus computing;
(5) n-1 row corresponding with n-1 scheduled text classification are selected from R, form matrixWill be remaining
Part carries out ascending order arrangement to row element respectively, obtains matrixI.e.
(6) correlation factor of text classification is calculatedI.e. to RnIn preceding ξ element per a line sum respectively and inverted, etc.
Imitate and obtain column vector ψ in simplificationn=[ψ1,n … ψL,n]Η;
(7) n-th of text classification is selected according to formula;
(8) if n < NT, return (5);Otherwise scheduling is completed, according to scheduled text classificationThe actual interference meter being subject to
Calculate correlation factor ψk;If text classification is not scheduledψk=0, and according to formula update text classification k, k ∈ 1 ...,
L } be averaged correlation factor, for calculating the text classification dispatch weight in next transmission cycle, after the completion of scheduling, base station notice
Activation text classification simultaneously carries out downlink data communication, in the overhead slot stage of next transmission cycle (t+1), repeats
(1)~(7);
Candidate's text classification is considered in (6) by the text classification selected text classification and may be potentially scheduled
Influence according to formula (1) calculate text classification correlation factorI.e. to RnIn summed respectively simultaneously per the preceding ξ element of a line
It is inverted, it is equivalent to simplified style (2);
Obtain column vector ψn=[ψ1,n … ψL,n]Η, wherein,Represent potential, subsequently possible selected text classification collection
Close, card () represents the number of the element in set;
According to formula in (7)Select n-th of text classification;
snThe label of selected text classification is represented,It is text classification k dispatch weight,It was upper transmission week
Text classification k average correlation factor at the end of phase, renewal N=n+1;
Text classification is scheduled, over time, average correlation factorIncrease, weights
μkReduce therewith, dispatching priority reduces;
The formula of renewal text classification k average correlation factor is in (8)
Wherein, δc=0.99;
Step 3, obtain the Webpage of internet;
Step 4, obtain website homepage corresponding to Webpage or contact page;
Interest point name and corresponding address information are extracted in step 5, slave site homepage or contact page;
Step 6, the interest point name extracted is associated with address information, obtains structured message;
Step 7, structured message data will be obtained, data segment decomposes regular expression and decompose regular expression with data segment
Corresponding data item name list;
Step 8, the data segment in analyze data is treated according to data segment decomposition regular expression and carries out data decomposition, generates number
It is associated according to entry value, and by data item occurrence and data item name list, during formation data item title is corresponding with data item occurrence
Between data pair;According to setting statistical rules, to intermediate data to carrying out statistical analysis, data results are obtained;
The method for obtaining website homepage includes:
Domain name addresses is taken out from the network address of the Webpage, domain name address is carried out to redirect processing, obtains the webpage
Website homepage corresponding to the page;
Domain name addresses is taken out in network address by the whole network Webpage one by one, is added to domain name addresses set after carrying out duplicate removal processing
In, all domain name addresses in domain name address set are carried out redirecting processing, obtain corresponding website homepage;Or
Using the homepage sample set of website, it links Anchor Text and network address style characteristics construction homepage grader for statistical analysis, utilizes
The homepage grader is analyzed the Webpage, obtains all website homepages;
The method of contact page includes corresponding to the acquisition Webpage:
Using the contact page sample set of website, statistical analysis its link Anchor Text, page title and network address style characteristics construct
Web page classifier is contacted, the Webpage is analyzed using the contact web page classifier, obtains the contact letter of all websites
Cease page;
It is described to the intermediate data to carry out statistical analysis, obtain data results, including:
The result field in statistic analysis result field list is obtained, result field includes statistics formula, the statistical analysis
Result field table includes at least one result field;According to statistics formula, to the corresponding data of the intermediate data centering
Counted.
2. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that development sample is sampled in 1),
For carrying out data sampling from data source, mark and use for sample;
Sample marks, for being classified by artificial interpretation to sample, generation exploitation language material and testing material;
Language material is developed, for model development personnel's redaction rule, output model;
Testing material, for being used for model measurement, the mark of testing material is compared with model result, with the accurate of assessment models
Degree;
Model debugging, after output model, input sample data are tested.
3. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that in (7) value of weight by
The average correlation factor of text classification determines.
4. internet big data as claimed in claim 1 analyzes extracting method, it is characterised in that in step 8, the basis
Data segment decomposes the data segment that regular expression is treated in analyze data and carries out data decomposition, is decomposed using decomposing module;
Specifically include:
Data value is solved using conjugate gradient method;
For equation Lx=b, the key of conjugate gradient method is to construct conjugate vector p, and seeks corresponding residual error;Conjugate gradient method
Solved with alternative manner, in each iterative process, new conjugate vector is solved by following formula:
5. internet big data as claimed in claim 4 analyzes extracting method, it is characterised in that the coefficient of conjugate direction is under
Formula solves:
New x values are solved with residual error with following formula:
xk=xk-1+skpk
rk=rk-1+skLpk;
Element q corresponding to Lp vectors midpoint i is solved with following formulai:
WkIt is neighborhood corresponding to data k, ‖ wk‖ is the size of neighborhood, and i is to surround data k neighborhoods WkIn a data, qiFor q to
I-th of element of amount, IiFor 3-dimensional vector corresponding to data i, R, tri- passages of G, B, p are representediIt is corresponding for data i in conjugate vector
Element, μkIt is 3-dimensional vector, is neighborhood WkMiddle IiThe average of vector,For neighborhood WkConjugate vector p corresponding to middle element ii's
Average,It is data k corresponding 3-dimensional vector,For scalar corresponding to data k.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710897053.1A CN107657032A (en) | 2017-09-28 | 2017-09-28 | A kind of internet big data analyzes extracting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710897053.1A CN107657032A (en) | 2017-09-28 | 2017-09-28 | A kind of internet big data analyzes extracting method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107657032A true CN107657032A (en) | 2018-02-02 |
Family
ID=61117070
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710897053.1A Pending CN107657032A (en) | 2017-09-28 | 2017-09-28 | A kind of internet big data analyzes extracting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107657032A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108694462A (en) * | 2018-03-30 | 2018-10-23 | 北京合享智慧科技有限公司 | Patent value degree evaluation method and computer readable storage medium |
CN108763507A (en) * | 2018-05-30 | 2018-11-06 | 北京百度网讯科技有限公司 | Enterprise's incidence relation method for digging and device |
CN109101531A (en) * | 2018-06-22 | 2018-12-28 | 联想(北京)有限公司 | Document handling method, apparatus and system |
CN109738438A (en) * | 2018-12-29 | 2019-05-10 | 扬州大学 | A kind of Wheat Leaves Senescence degree method for fast measuring |
CN114117292A (en) * | 2021-11-04 | 2022-03-01 | 中通服咨询设计研究院有限公司 | Internet big data analysis and extraction method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110022550A1 (en) * | 2009-07-24 | 2011-01-27 | Yahoo! Inc. | Mixing knowledge sources with auto learning for improved entity extraction |
CN102880647A (en) * | 2012-08-24 | 2013-01-16 | 北京百度网讯科技有限公司 | Method and device for acquiring another name of organization |
CN103763782A (en) * | 2014-01-13 | 2014-04-30 | 西安电子科技大学 | Dispatching method for MU-MIMO down link based on fairness related to weighting users |
CN105760492A (en) * | 2016-02-18 | 2016-07-13 | 广东睿江云计算股份有限公司 | Data analysis method and device |
CN106407357A (en) * | 2016-09-07 | 2017-02-15 | 深圳市中易科技有限责任公司 | Engineering method for developing text data rule model |
-
2017
- 2017-09-28 CN CN201710897053.1A patent/CN107657032A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110022550A1 (en) * | 2009-07-24 | 2011-01-27 | Yahoo! Inc. | Mixing knowledge sources with auto learning for improved entity extraction |
CN102880647A (en) * | 2012-08-24 | 2013-01-16 | 北京百度网讯科技有限公司 | Method and device for acquiring another name of organization |
CN103763782A (en) * | 2014-01-13 | 2014-04-30 | 西安电子科技大学 | Dispatching method for MU-MIMO down link based on fairness related to weighting users |
CN105760492A (en) * | 2016-02-18 | 2016-07-13 | 广东睿江云计算股份有限公司 | Data analysis method and device |
CN106407357A (en) * | 2016-09-07 | 2017-02-15 | 深圳市中易科技有限责任公司 | Engineering method for developing text data rule model |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108694462A (en) * | 2018-03-30 | 2018-10-23 | 北京合享智慧科技有限公司 | Patent value degree evaluation method and computer readable storage medium |
CN108694462B (en) * | 2018-03-30 | 2022-03-15 | 北京合享智慧科技有限公司 | Patent retrieval result sorting method and computer-readable storage medium |
CN108763507A (en) * | 2018-05-30 | 2018-11-06 | 北京百度网讯科技有限公司 | Enterprise's incidence relation method for digging and device |
US11397778B2 (en) | 2018-05-30 | 2022-07-26 | Beijing Baidu Netcom Service and Technology Co., Ltd. | Method and device for mining an enterprise relationship |
CN109101531A (en) * | 2018-06-22 | 2018-12-28 | 联想(北京)有限公司 | Document handling method, apparatus and system |
CN109101531B (en) * | 2018-06-22 | 2022-05-31 | 联想(北京)有限公司 | File processing method, device and system |
CN109738438A (en) * | 2018-12-29 | 2019-05-10 | 扬州大学 | A kind of Wheat Leaves Senescence degree method for fast measuring |
CN114117292A (en) * | 2021-11-04 | 2022-03-01 | 中通服咨询设计研究院有限公司 | Internet big data analysis and extraction method |
CN114117292B (en) * | 2021-11-04 | 2024-04-16 | 中通服咨询设计研究院有限公司 | Internet big data analysis and extraction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107657032A (en) | A kind of internet big data analyzes extracting method | |
Ren et al. | Spatially explicit simulation of land use/land cover changes: Current coverage and future prospects | |
Huppmann et al. | The MESSAGEix Integrated Assessment Model and the ix modeling platform (ixmp): An open framework for integrated and cross-cutting analysis of energy, climate, the environment, and sustainable development | |
Jain et al. | Manufacturing data analytics using a virtual factory representation | |
Balci | How to assess the acceptability and credibility of simulation results | |
Zhang et al. | Improving probabilistic load forecasting using quantile regression NN with skip connections | |
CN106407357B (en) | A kind of engineering method of text data rule model exploitation | |
Zhang et al. | Forecasting the transport energy demand based on PLSR method in China | |
CN104572449A (en) | Automatic test method based on case library | |
CN106022614A (en) | Data mining method of neural network based on nearest neighbor clustering | |
CN107918639A (en) | Based on electric power big data main transformer peak load forecasting method and data warehouse | |
Agami et al. | A neural network based dynamic forecasting model for Trend Impact Analysis | |
Feng et al. | [Retracted] Design and Simulation of Human Resource Allocation Model Based on Double‐Cycle Neural Network | |
CN107515822B (en) | Software defect positioning method based on multiple-objection optimization | |
CN104732087A (en) | GIS-based neural-network cellular automaton infectious-disease transmission simulation analysis method | |
CN107402859A (en) | Software function verification system and verification method thereof | |
Saremi et al. | Website structure improvement: Quadratic assignment problem approach and ant colony meta-heuristic technique | |
Mitrovic et al. | Gp cc-opf: Gaussian process based optimization tool for chance-constrained optimal power flow | |
Pang et al. | Early termination strategies with asynchronous parallel optimization in application to automatic calibration of groundwater PDE models | |
Maidantchik et al. | Managing a worldwide software process | |
CN116959249A (en) | City information management platform and method based on CIM | |
Sun et al. | Optimal path for overcoming barriers in developing China’s wind energy industry | |
Schlenzig et al. | MESAP-III: An information and decision support system for energy and environmental planning | |
Islam et al. | An Efficient Framework For Software Maintenance Cost Estimation Using Genetic Hybrid Algorithm: OOPs Prospective | |
Muhammad et al. | A decision process for the applications of artificial intelligence in sustainable operations and supply chain management |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180202 |