CN106960002A

CN106960002A - A kind of cross-cutting information extraction method of feature based model

Info

Publication number: CN106960002A
Application number: CN201710076390.4A
Authority: CN
Inventors: 朱文浩; 姚滕俊; 胡冠男; 金鑫; 周资力
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2017-02-13
Filing date: 2017-02-13
Publication date: 2017-07-18

Abstract

The invention discloses a kind of cross-cutting information extraction method of feature based model.This method is divided into the foundation of multi-level features model, feature selecting and combined and the part of feedback iteration mechanism three.Introduce genetic algorithm and the feature space generating algorithm of cross validation is carried out using SVMs, according to the atomic features of different field, generation meets the characteristic model for extracting mission requirements.The problem of it avoids the field fitness difference of traditional Web information abstracting method.Substantial amounts of numerical experiment shows, compared with congenic method, and this method has higher accuracy and stability, and algorithm has good scalability in itself.

Description

A kind of cross-cutting information extraction method of feature based model

Technical field

Field, particularly a kind of cross-cutting information extraction method of feature based model are extracted the present invention relates to Web information.

Background technology

Web information, which extracts to obtain by certain decimation rule from non-structured web page text, to be met in the rule Hold, structurally and semantically relatively sharp form (XML, relation data, data of object-oriented etc.) is then translated into again and is deposited Storage.Web information extracts the information extraction being not exclusively equal to plain text, and current webpage is generally semi-structured form, One key property is the form of expression variation of the page, and this extracts to Web information and brings difficulty.And for webpage across neck Domain information abstracting method research be exactly in order to solve Web information extract in general sex chromosome mosaicism.On the whole, at present across neck In terms of the challenge that domain information is extracted is mainly three below：

1st, the semi-structured text of magnanimity

With developing rapidly for Internet industry, Web has become a huge information bank.Believed according to China Internet Issue at breath center (CNNIC)《33rd China Internet network state of development statistical report》It has been shown that, by December, 2013, China Webpage quantity is 150,000,000,000, and compared to 2012 same period increased 22.2%.The average webpage number of single website and single webpage Average byte number maintain increase, show the content more horn of plenty in China Internet.While 2014 China Internet nets The people account for the 44% of population, and quantity is still in rapid growth, and internet has turned into be propagated and shared business, religion in people's life Educate, the main path of the information such as news and scientific research.

2nd, webpage dynamic

The dynamic of webpage refers to that the pattern and content of webpage are dynamically generated by program.The webpage of early stage is referred to as static state The page, its pattern and content are all first to be organized on backstage, send content of text and pattern after browser to and no longer change.At present Dynamic web page technique using Java Script as main flow is widely used, and Java Script codes not only can dynamically change Become page layout format and pattern, can also dynamically change content of pages by asking back-end data.The dynamic of webpage is given across neck What domain information was extracted has researched and proposed new problem, and past extraction system has not adapted to the real-time change of current web page, Once webpage finds change, effective information extraction cannot be carried out.

3rd, the isomerism of webpage

The isomerism of webpage is primarily referred to as the otherness of text style and subject content between different web pages.Webpage can be with Diversified form exhibition information, even the pattern that the different pages of identical content are shown in same website may not also Together.If coming to the webpage of different web sites, the otherness of its form of expression will be bigger.In summary, due in different web sites The diversified exhibition method of information in the differentiation of page layout and same website, the isomerism of webpage is also that cross-cutting information is taken out The difficult point taken.

At present, research of some the existing research groups in terms of cross-cutting information extraction is carried out, also develops a small amount of work Tool.These methods respectively have its advantage, but also respectively have its limitation, it is impossible to the need for fully meeting Web information extraction.It is rule-based Method automaticity it is low, it is necessary to substantial amounts of manual working, and just for specific webpage effectively, poor universality.Base The method for automatically generating extraction inference pattern in machine learning solves the problem of manual construction is regular to a certain extent, still Because it needs substantial amounts of sample training, still suffer from needing re -training to push away when website structure or the change of extraction task The problem of managing model, or even need to add certain manual intervention, it is difficult to promote in actual applications.By in our method The characteristic model of proposition, the field change of adaptation information extraction task that can very well quickly, and it is more general.

The content of the invention

It is an object of the invention to provide a kind of cross-cutting information of feature based model in order to solve the above problems and take out Method is taken, this method can decompose field correlated characteristic, obtain the weak related subcharacter in field, and construction feature model.Utilize this Model, can be evaluated the matching degree between feature and information extraction task and discrimination.Based on this feature model, propose A kind of to rapidly adapt to the information extraction method of field change, the process employs the machine that feedback iteration optimizes inference pattern System, can extract the combinations of features that task quick obtaining is directed to the field, so as to adapt to the change in field for customizing messages.

To reach above-mentioned purpose, idea of the invention is that：The feature with field correlation is degraded first, by spy Levy be divided into compound characteristics (by one or more atomic features with some form or logical constitution feature, with field feature) and Atomic features (the independent characteristic including any other feature, does not have or only have a small amount of field feature), pass through reduction The field correlation of feature, reduces the field dependence of information extraction method；Feature is realized then in conjunction with feature parameterization method The instantiation of module, by the selection and combination of the characteristic block to the bottom one by one, is adapted to every field one by one to constitute Or each extracts extraction template of target.When changing in face of field, it is only necessary to according to the field phase of target domain text Close feature (e.g., context relation, style of writing mode etc.) (e.g., has A features but special in the absence of B by feature progress logical combination again Levy), constitutive characteristic vector, while training obtains inference pattern, the vector space of generation is continued to optimize by feedback iteration technology, To reach more excellent information extraction effect.

Conceived according to foregoing invention, the present invention is adopted the following technical scheme that：

A kind of cross-cutting information extraction method of feature based model, concrete operation step is as follows：

A. multi-level features model is set up, induction and conclusion is carried out to the feature used in existing information abstracting method, will These feature decompositions are the relatively low atomic features of field dependence, and set up multi-level features model according to the degree of decomposition, are joined The parametrization that existing feature parameterization method carries out feature is examined, finally, the adaptation of feature field is set up to the feature after parametrization Property analysis appraisement system, i.e., each feature has an initial fitness value for different fields, and this value is used as spy Levy the initialization foundation of selection；

B. feature selecting, by the parametrization result of calculation of the feature obtained in step a, uses similar TF-IDF with combining Method calculate the field fitness value of feature, suitable feature is selected according to the field fitness value of feature, construction is special Levy vector space；

C. feedback iteration, concentrates in training sample according to the obtained characteristic vector spaces of step b and carries out cross validation, obtain The extraction effect of inference pattern is extracted as the result of feedback, the feature selecting of genetic algorithm is taken based on according to the result of feedback Method corrects characteristic vector space.

Field fitness in the step b carrys out evaluating characteristic using two indices：Characteristic matching degree and characteristic area indexing； Circular is：

B-1. characteristic matching degree represents that some characteristic matching extracts the number of times of target, and its specific calculation is using following Formula：

Wherein, n_i,jRepresent in sample set j, the sample number that feature i is correctly matched, S_i,jRepresent in sample set j, feature i The total number of samples matched, MD_i,jRepresent matching degrees of the feature i in sample j；

B-2. characteristic area indexing represents the frequency that the sample comprising some feature occurs in sample set, its specific calculating side Formula uses following formula：

Wherein, S represents sample number total in sample set, | { j:f_i∈s_j| represent to include feature i sample in sample s Collect number, DD_iRepresent the frequency that the sample number comprising feature i occurs in sample set；

B-3. the computing formula of feature i field fitness is MD_i,j*DD_i。

Construction characteristic vector space in the step b, specific method is：

The semi-random half construction characteristic vector space method intervened, by the half individual of initialization feature vector space with The mode of machine is produced, it is ensured that result Global Optimality；Fit in second half individual many selection field of being tried one's best in the form of manual intervention The high candidate feature of angle value is answered to optimize initialization feature vector space；The method of manual intervention refers to the mould that Holland is proposed Intend the operation of gambling disk, its general principle is that the selection according to the ratio of the field fitness value of each feature to determine this feature is general Rate, the computing formula of probability selected feature i is as follows：

Wherein, P_iRepresent the selected probability of feature i, F_iRepresent feature i field fitness value.

The feedback iteration based on genetic algorithm in the step c, be specially：

C-1. according to the return value of characteristic vector fitness function in every generation population, to adjust the fitness of each feature The genetic manipulation offer foundation of value, for it latter wheel characteristic vector, its general principle is that basis the feature of this feature each occurs The average value of vectorial fitness function return value determines the field fitness value of this feature, and its specific calculation is using following Formula：

Wherein, F_jRepresent the field fitness value of the feature j after feedback, f (G_i) represent characteristic vector G_iFitness function Return value, G_i,jThe characteristic value of j-th of feature in characteristic vector i is represented, i.e., 0 or 1, m represent the maximum individual in colony Number；

C-2. the number of times in optimal characteristics vector is appeared according to feature in each round iteration, to adjust the suitable of each feature Angle value is answered, the genetic manipulation of latter wheel characteristic vector provides foundation for it, its general principle is according to often wheel optimal characteristics vector The middle number of times summation for this feature occur accounts for the ratio of current iteration wheel number to determine the field fitness value of this feature, its specific meter Calculation mode uses following formula：

Wherein, H_t,jThe feature j field fitness value fed back after t wheel iteration is represented, t represents current iteration wheel Number, B_k,jRepresent the characteristic value of j-th of feature in the optimal characteristics vector of kth wheel, i.e., 0 or 1.

The feature selection approach based on genetic algorithm in the step c, using direct sequencing selection algorithm, specific algorithm It is as follows：

The candidate feature vector mixed before cross and variation and after cross and variation is concentrated, according to the excellent of characteristic vector Bad degree descending sort, selects half characteristic vector in the top to remain into the next generation, characteristic vector evaluation side here Method be not using fitness function return value as Appreciation gist, but according to this feature vector historical performance index conduct The evaluation criterion of this feature vector；Historical performance index refers to that all selected features occur in iteration before in this feature vector The average value of number of times in optimal characteristics vector, specific formula for calculation is as follows：

Wherein, HE_iRepresent characteristic vector i historical performance desired value, N_i,jRepresent j-th of feature in characteristic vector i The number of times occurred in all previous optimal characteristics vector, G_i,jRepresent the characteristic value of j-th of feature in characteristic vector i, i.e., 0 or 1。

The inventive method compared with prior art, with substantive distinguishing features and remarkable advantage following prominent：

The inventive method employs decomposition field correlated characteristic and obtains the weak related subcharacter in field to carry out characteristic model Structure, it is to avoid the problem of field dependency degree is too high in traditional Web information abstracting method.This method is achieving very high precision While rate, time efficiency is also very high.This method has flexible scalability, by giving the feature of different field, energy Enough rapid information extraction tasks for adapting to different field.

Brief description of the drawings

Fig. 1 is the cross-cutting information extraction block schematic illustration of feature based model.

Fig. 2 is the characteristic vector fitness function evaluation procedure schematic diagram based on SVMs.

Fig. 3 is across the content information extraction comparison diagram in same website.

Fig. 4 is the identical inter-network station information extraction experimental result picture for extracting content.

Fig. 5 extracts experimental result picture for across the type site information of identical extraction content.

Embodiment

The preferred embodiments of the present invention are further described below in conjunction with accompanying drawing.

As shown in figure 1, a kind of cross-cutting information extraction method of feature based model, is broadly divided into three parts, is respectively The foundation of multi-level features model, feature selecting and combine and feedback iteration mechanism.The establishment stage of multi-level features model, Three pieces can be divided into again：Multi-level features model, feature parameterization method, feature parameterization method appraisement system；Feature selecting With Assembly Phase, three parts are included：Feature parameterization result of calculation, calculating characteristic matching degree and construction feature space；Instead The genetic algorithm that feedback iterator mechanism with reference to classics is characterized space offer optimization.

Feature selecting algorithm with reference to the thought of genetic algorithm, and its key step is as follows：

C1. initial characteristicses are selected according to characteristic matching degree index, the dimension N of characteristic vector is determined, then by random Mode combines to form initial characteristicses space F0；

C2. ith iteration (i is since 0) is started, the feature space in the generation is Fi；

C3. training sample set Si and test sample set Ti are generated at random according to sample；

C4. machine learning is carried out using Si and Fi, obtains model M i；

C5. model M i is evaluated with Ti；

If C6. Mi than current optimum more preferably, records Fmax=Fi；

C7. when meeting halt condition, stop iteration, export Fmax；

C8. using the means reconstruct combinations of features such as natural selection, mutation and hybridization, new feature space Fi+1 is formed；

C9. step C2 is jumped to, starts new iteration.

Due to having used genetic algorithm as iteration engine, this method can effectively obtain the feature for adapting to application field Combination.

Initialization of population：Quantity individual under normal circumstances is selected between 30-160 in genetic algorithm, generally individual 4 times of dimension.It is 80 that this method fixes individual amount by experimental analysis.

Under normal circumstances, initial vector dimension with reference to the excellent of a large amount of feature selecting algorithms both at home and abroad between 10-100 Change result finds that the characteristic vector space that final optimization pass goes out is basic not over 30 dimensions, is generally concentrated at 15 dimensions or so.This method The compatibility test interpretation of result of fixed reference feature, it is found that for most information extraction task, the high feature of fitness value is led to Super many 20 of Chang Buhui, therefore when field initialization is carried out, select in the field for current extraction task fitness Evaluate 20 dimensions that 20 high features are characteristic vector.Selecting 20 dimensions not result in, characteristic matching is too high to cause plan Close and local optimal problem, also will not too much cause to restrain because of feature low with efficiency of algorithm.

Feedback mechanism is provided with the evaluation index of effect characteristicses in terms of two, and specific targets are as follows：

E1. according to the return value of characteristic vector fitness function in every generation population, to adjust the fitness of each feature The genetic manipulation offer foundation of value, for it latter wheel characteristic vector.Its general principle is that basis the feature of this feature each occurs The average value of vectorial fitness function return value determines the field fitness value of this feature.

E2. the number of times in optimal characteristics vector is appeared according to feature in each round iteration, to adjust the suitable of each feature Angle value is answered, the genetic manipulation of latter wheel characteristic vector provides foundation for it.Its general principle is according to often wheel optimal characteristics vector The middle number of times summation for this feature occur accounts for the ratio of current iteration wheel number to determine the field fitness value of this feature.

Machine learning cross validation：The present invention uses the method for SVMs as the inference pattern in information extraction, The accuracy value for obtaining some combinations of features by way of cross validation returns to genetic algorithm, then root as fitness value Genetic manipulation after being carried out according to fitness value.Its main process is as shown in Figure 2.For some personal feature in population to Fi and training sample set S is measured, for the Q-character that characteristic vector intermediate value is 1, corresponding feature parameterization function is called, obtains The value that same characteristic action is returned in different samples.1 number is 5 in such as characteristic vector in figure, is characterized respectively 1st, feature 5, feature 11, feature 13 and feature 16.Line number so in input matrix is just the number in sample set, and columns Just it is 5+1=6 row, wherein first is classified as the marker bit of sample, for distinguishing positive negative sample.For input matrix, using intersection The mode of checking calls SVMs as the inference pattern of information extraction, and the accuracy rate of output is suitable in genetic manipulation Answer the input of angle value.

In the present embodiment, because that need not consider run time, the cross-cutting information extraction side of feature based model of the invention The experiment of method is carried out on four core CPU (intel Core i5-321M 2.50Ghz dominant frequency) and the PC of 8GB internal memories.

Data set for participating in test, have chosen a data set for including 533 webpages, from eight different nets Stand.

The size of eight source webs, URL and webpage scale is as shown in the table：

The experiment that the test set is carried out is comprised the following steps that：

1. build the parametric method of N number of feature, and each method under a certain specific extraction target to that should have one Initial fitness value.According to target is extracted, 20 high features of fitness value are selected.

2. initialize 80*20 characteristic vector 0-1 matrix Fs.80 rows indicate 80 chromosomes in genetic algorithm, And 20 are listed in genetic algorithm and represent there are 20 genes in each chromosome.For every a line characteristic vector, F (i, j)=1 is represented It has selected j-th feature.In order to improve efficiency of algorithm, author thinks the selected characteristic number that should try one's best few, the handle in initialization The probability for occurring 1 in characteristic vector per a line is set as P≤0.4.

3. for the characteristic vector F (i, *) of every a line, initialization builds S*n matrix R.Wherein S represents sample Number, n represents to be selected the number (i.e. 1 number) of feature in this feature vector F (i, *).If F (i, j)=1 is for webpage A-th of sample in sample set, then call j-th feature parameterization method and return to corresponding value, deposit R (a, j).Circulation knot Shu Hou, obtains SVM input matrix R.

4. for the matrix R corresponding to the i-th row characteristic vector, a row mark value (1, -1) is added up front, wherein 1 table It is positive sample to show the sample, otherwise -1 is expressed as negative sample.Then call libsvm bags to do cross validation and (take 3：7) spy is obtained The extraction accuracy rate of vector is levied, repetition, which is done 5 times and averaged, changes the fitness value (Fitness of characteristic vector as this score).Create one-dimension array S [80].The fitness return value of this feature vector is saved into one-dimension array S [i].

5. if reach halt condition (such as maximum iteration), stopping iteration, output optimal characteristics vector sum is relative The fitness value answered.

6. selection：Descending arrangement array S [80], preserves the characteristic vector corresponding to first maximum S [0] --- retains Optimal solution is not made a variation and intersected.The characteristic vector corresponding to last minimum value S [N] is deleted, first optimal value is replicated.

7. intersect：Determine crossover probability 0.4 (there are 32 to be chosen to intersect two-by-two in i.e. 80 chromosomes), randomly choose two Chromosome does two point point and intersects that (it is that 7 and 13, then the 7-13 position genic values of first chromosome and second are dyed to select single-point The 7-13 positions genic value of body is exchanged)；

8. mutation：Determine mutation probability 0.1 (there are 8 to be chosen mutation in i.e. 80 chromosomes), it is 0.05 to determine mutation factor (having 1 mutation in 20 genes of i.e. each chromosome) is 0-1 mutation here.

9. return to step 3.

By the experimental result of three different extraction tasks, the extraction accuracy and versatility of this method are evaluated：Fig. 3 It show across the content information extraction comparing result in same website；Fig. 4 show it is identical extraction content across site information Extract experimental result；Across the type site information that Fig. 5 show identical extraction content extracts experimental result；For different extraction classes The task of type, this method is obtained for preferable experimental result.

Claims

1. a kind of cross-cutting information extraction method of feature based model, it is characterised in that concrete operation step is as follows：

A. multi-level features model is set up, induction and conclusion is carried out to the feature used in existing information abstracting method, by these Feature decomposition is the relatively low atomic features of field dependence, and sets up multi-level features model according to the degree of decomposition, with reference to existing Some feature parameterization methods carry out the parametrization of feature, finally, and feature field adaptability point is set up to the feature after parametrization The appraisement system of analysis, i.e., each feature has an initial fitness value for different fields, and this value is selected as feature The initialization foundation selected；

B. feature selecting, by the parametrization result of calculation of the feature obtained in step a, uses similar TF-IDF side with combining Method calculates the field fitness value of feature, selects suitable feature according to the field fitness value of feature, construction feature to Quantity space；

C. feedback iteration, concentrates in training sample according to the obtained characteristic vector spaces of step b and carries out cross validation, extracted The extraction effect of inference pattern is taken based on the feature selection approach of genetic algorithm according to the result of feedback as the result of feedback To correct characteristic vector space.

2. the cross-cutting information extraction method of feature based model according to claim 1, it is characterised in that the step Field fitness in b carrys out evaluating characteristic using two indices：Characteristic matching degree and characteristic area indexing；Circular is：

B-1. characteristic matching degree represents that some characteristic matching extracts the number of times of target, and its specific calculation uses following public affairs Formula：

{MD}_{i, j} = \frac{n_{i, j}}{| S_{i, j} |}

Wherein, n_i,jRepresent in sample set j, the sample number that feature i is correctly matched, S_i,jRepresent in sample set j, feature i matchings The total number of samples arrived, MD_i,jRepresent matching degrees of the feature i in sample j；

B-2. characteristic area indexing represents the frequency that the sample comprising some feature occurs in sample set, and its specific calculation is adopted Use following formula：

{DD}_{i} = \lg \frac{| S |}{| {j : f_{i} &Element; s_{j}} |}

Wherein, S represents sample number total in sample set, | { j:f_i∈s_j| represent the sample set comprising feature i in sample s Number, DD_iRepresent the frequency that the sample number comprising feature i occurs in sample set；

B-3. the computing formula of feature i field fitness is MD_i,j*DD_i。

3. the cross-cutting information extraction method of feature based model according to claim 1, it is characterised in that the step Construction characteristic vector space in b, specific method is：

The semi-random half construction characteristic vector space method intervened, by the half individual of initialization feature vector space with random Mode is produced, it is ensured that result Global Optimality；Second half individual many selection field fitness of being tried one's best in the form of manual intervention Value high candidate feature optimizes initialization feature vector space；The simulation gambling that the method for manual intervention is proposed with reference to Holland The operation of disk, its general principle is to determine the select probability of this feature according to the ratio of the field fitness value of each feature, The computing formula of probability selected feature i is as follows：

P_{i} = \frac{F_{i}}{Σ_{i = 1}^{20} F_{i}}

4. the cross-cutting information extraction method of feature based model according to claim 1, it is characterised in that the step The feedback iteration based on genetic algorithm in c, be specially：

C-1. according to the return value of characteristic vector fitness function in every generation population, to adjust the fitness value of each feature, The genetic manipulation of latter wheel characteristic vector provides foundation for it, and its general principle is that basis the characteristic vector of this feature each occurs The average value of fitness function return value determines the field fitness value of this feature, and its specific calculation uses following public affairs Formula：

F_{j} = \frac{Σ_{i = 1}^{m} f (G_{i}) * G_{i, j}}{Σ_{i = 0}^{m} G_{i, j}}

Wherein, F_jRepresent the field fitness value of the feature j after feedback, f (G_i) represent characteristic vector G_iFitness function return Value, G_i,jThe characteristic value of j-th of feature in characteristic vector i is represented, i.e., 0 or 1, m represent the maximum number of individuals in colony；

C-2. the number of times in optimal characteristics vector is appeared according to feature in each round iteration, to adjust the fitness of each feature The genetic manipulation offer foundation of value, for it latter wheel characteristic vector, its general principle is gone out according in often wheel optimal characteristics vector The number of times summation of existing this feature accounts for the ratio of current iteration wheel number to determine the field fitness value of this feature, its specific calculating side Formula uses following formula：

H_{t, j} = \frac{Σ_{k = 1}^{t} B_{k, j}}{t}

5. the cross-cutting information extraction method of feature based model according to claim 1, it is characterised in that the step The feature selection approach based on genetic algorithm in c, using direct sequencing selection algorithm, specific algorithm is as follows：

The candidate feature vector mixed before cross and variation and after cross and variation is concentrated, according to the good and bad journey of characteristic vector Descending sort is spent, selects half characteristic vector in the top to remain into the next generation, characteristic vector evaluation method here is not Be using fitness function return value as Appreciation gist, but according to this feature vector historical performance index be used as the spy Levy the evaluation criterion of vector；Historical performance index refers to that all selected features are appeared in most in iteration before in this feature vector The average value of number of times in excellent characteristic vector, specific formula for calculation is as follows：

{HE}_{i} = \frac{Σ_{j = 1}^{20} N_{i, j} * G_{i, j}}{Σ_{j = 0}^{20} G_{i, j}}

Wherein, HE_iRepresent characteristic vector i historical performance desired value, N_i,jRepresent that j-th of feature in characteristic vector i is being gone through The number of times occurred in suboptimum characteristic vector, G_i,jRepresent the characteristic value of j-th of feature in characteristic vector i, i.e., 0 or 1.