CN104408173B

CN104408173B - A kind of kernel keyword extraction method based on B2B platform

Info

Publication number: CN104408173B
Application number: CN201410765503.8A
Authority: CN
Inventors: 徐飞
Original assignee: Focus Technology Co Ltd
Current assignee: Focus Technology Co Ltd
Priority date: 2014-12-11
Filing date: 2014-12-11
Publication date: 2016-12-07
Anticipated expiration: 2034-12-11
Also published as: CN104408173A

Abstract

The invention discloses a kind of kernel keyword extraction method based on B2B platform, for English name of product, based on English grammar and semanteme, to extract kernel keyword.The present invention provide kernel keyword extraction method based on B2B platform, big Data Concurrent calculate in, be converted into prototype at the various tense of English word time, carry out word process according to one group of rule and carry out self-teaching in terms of be respectively provided with clear superiority.

Description

A kind of kernel keyword extraction method based on B2B platform

Technical field

The present invention relates to a kind of kernel keyword extraction method based on B2B platform.

Background technology

E-commerce development so far, have accumulated the information of magnanimity, and substantial amounts of user, including visitor, transaction Person, informant etc.；And the height of information repeats to occupy substantial amounts of server resource.

When use search engine carry out keyword search time, need to be submitted in server key word, server according to Key word scans in mass data, returns Search Results after finding relevant one group information；If the searching of concurrency Rope, then can have a huge impact server.Efficiency (search speed) and the quality of search (are searched for knot by the quality of key word The dependency of fruit) have a very big impact, it is therefore desirable to set up the method that a kernel keyword automatically extracts, by key word (in conjunction with other data) by a series of filtration, participle, mate, restructuring etc. processes, and to draw kernel keyword, allows server Scan for according to kernel keyword, to improve efficiency and the quality of search.

Key word that product information supplier is arranged for its product and the related term of a collection of high-quality, the standard to product attribute Really, reflection is very helpful comprehensively.In theory, key word, related term and ProductName product information supplier arranged Claim, after using segmentation methods, stem algorithm, word reassembly algorithm etc. to process, value word can be extracted and set up index, thus Finally extract kernel keyword.

Domestic more existing segmenting methods are the most single, especially for automatically extracting of English kernel keyword, only Carry out coupling for continuous individual character to extract, it is impossible to mate continuous phrase or discontinuous word, easily miss a lot of valuable core Heart key word, such as:

Chinese patent CN200710122439.1, gives a kind of Words partition system and method, and it utilizes cutting labelling to split Character string, is then identified according to the continuous individual character in machine word segmentation result, finally extracts core word.But at the method Reason result may result in the loss of some core value words, and again separating character string is carried out by mechanical segmentation method Coupling, the efficiency in big data quantity is low-down.

Chinese patent CN200910083775.9, gives a kind of participle processing method and text searching method, and it passes through Create new Words partition system based on database feature item, and described database feature item is added in described new Words partition system； And the query word that user submits to is carried out word segmentation processing to generate word segmentation result collection using described database feature item as vocabulary. In the method selected data storehouse, field carries out participle as characteristic item, make use of database feature item and the pass of text in data base Connection relation, effectively improves the word segmentation accuracy of the tradition segmenting methods such as unitary, binary, preset vocabulary；But the method is Relevant dictionary based on preset vocabulary carries out participle, and this method effect when processing English words is low, and does not has Relate to stem (word prototype) extraction and.

Accurately in participle, English word, the accurate match of stem is that mass data Chinese and English kernel keyword automatically extracts Important content, is also to improve Mass Data Searching efficiency and the important content of quality.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the present invention provides a kind of core based on B2B platform Key word extraction method, for English name of product, based on English grammar and semanteme, to extract kernel keyword.

Technical scheme: for achieving the above object, the technical solution used in the present invention is:

A kind of kernel keyword extraction method based on B2B platform, comprises the steps:

(1) using the user setup name of product in B2B platform, search word and industry hot topic word as dictionary source, to word Source, storehouse is saved in Data Mart after carrying out pretreatment, constitutes name of product core word bank；Dictionary source is carried out the side of pretreatment Method is:

To user setup name of product, the principle used initially with user setup name of product high frequency, reject and wherein make With number of times less user setup name of product；Again the user setup key word of corresponding user setup name of product is saved in use Family is arranged in keywords database；

To search word, first filter out the stop word including punctuate and special symbol；Search word high frequency is used to make again Principle, reject wherein and recently use the search word that frequency is less half a year；Then pre-place is carried out by core word segmentation processing device Reason, forms search key word and is saved in search high frequency dictionary；

To industry hot topic word, by trade classification, first filter out the stop word including punctuate and special symbol；Adopt again The principle used with industry hot topic word high frequency, rejects wherein access times less industry hot topic word；Then by core participle Processor carries out pretreatment, forms industry hot topic key word and is saved in industry high frequency dictionary；

(2) by effective names of product all in current site, first filter out including punctuate and special symbol Stop word；Then carry out pretreatment by core word segmentation processing device, products obtained therefrom title is saved in product high frequency dictionary；

(3) name of product in product high frequency dictionary is mated with name of product core word bank, coupling is obtained According to the sequencing output occurred in name of product, one record of each name of product after name of product duplicate removal, it is saved in number According in fairground, constitute the kernel keyword of name of product；Matched rule is:

If 1. occurring search key word in name of product, and this search key word is user setup key word；

If 2. occurring search key word in name of product, and this search key word is industry hot topic key word；

It is name of product by meeting the search key definition occurred in the name of product of an any of the above matched rule Kernel keyword.

In described step (2), product high frequency dictionary derives from product information, and industry high frequency dictionary comprises trade information, needs Product associative processor to be passed through carries out dependency process to product information；Product information includes product IDs and product key word, Trade information includes industry ID and industry hot topic key word；

Product key word industry type corresponding for each name of product is classified, specifically includes following steps:

(21) by word matched, product key word and industry hot topic key word are mated, according to the common spy occurred Levy, determine category of employment belonging to this product；

(22) according to the category of employment determined, by product key word output to synonym corpus, crucial according to product The common word occurred expands product key word to word with synonym corpus；

(23) the product key word not having in dictionary is first rejected, then by product key word that is uncommon and that cannot mate Export to learning database, the output of remaining product key word is carried out pretreatment to core word segmentation processing device.

Described core word segmentation processing device include word segmentation processing device, affixe processor, root process device, DANFU number processor, Tense processor, similarity processor, word recombination module, keyword index storehouse and learning database, wherein:

Described word segmentation processing device, to English name of product, is split by traversal space, comes according to word and phrase Carrying out word segmentation processing, combination forms<name of product, key word>sequence, and is ranked up according to product IDs；

Described affixe processor, the data produced after word segmentation processing device is processed, remove that each word is front/rear to be sewed, by word Other forms be converted into noun, or derivative is converted into noun, the noun obtained is mated with dictionary；For cannot The word matched with dictionary, by corresponding word output to learning database；For the word that can match with dictionary, update Form<name of product, key word>sequence；

Described root process device, the data produced after processing affixe processor, according to the word of root algorithm foundation word Property carries out root extraction, then is mated with dictionary by the root of extraction；For the word that cannot match with dictionary, will be corresponding Word output in learning database；For the word that can match with dictionary, more it is newly formed<name of product, key word>sequence Row；

Described DANFU number processor, the data produced after root process device is processed, carry out single complex processing, word is turned It is changed to prototype, is more newly formed<name of product, key word>sequence；

Described tense processor, the data produced after DANFU number processor is processed, carry out tense process, turning word It is changed to prototype, is more newly formed<name of product, key word>sequence；

Described similarity processor, when the word that coupling obtains exists two or more implication, by similarity processor Calculate the word meaning of maximum similarity；

Described word recombination module, the data produced after tense processor is processed, first pass through the inspection of word-forming dictionary, word Shape distance calculates with smallest edit distance, similar key rule processes, and rejects the word that spelling makes mistakes；Then pass through the place of learning database Reason, provides the word of correct spelling, is combined into the data of just data structure, deposit to caching；Finally according to industry class Data in caching are set up index by type, export to kernel keyword index database；

Data in caching are created as kernel keyword index text file by described keyword index storehouse；Meanwhile, for The industry core word of word recombination module output sets up industry core word index text file, for the output of word recombination module Search core word sets up search core word index text file；High frequency words in search core word bank i.e. constitutes previously described Search high frequency dictionary, the high frequency words in industry core word bank constitutes previously described industry high frequency dictionary, key dictionary In high frequency words constitute previously described name of product high frequency dictionary；

Described learning database, including learner, knowledge base, executor and four essential parts of scorer, when affixe processor, The data that root process device, product associative processor and word recombination module produce export to learning database, and data initially enter Practise device；Input data are learnt by the knowledge that learner combines in knowledge base, first set up one group of rule, then computation rule power Weight and variable weight, export the rule set up and amount of calculation to knowledge base；Knowledge base carries out a series of think of to input data Dimension process is to obtain knowledge, and described knowledge refers to a series of regular algorithm, if the algorithm obtained has existed knowledge base In, then checking whether the condition meeting more new knowledge base, if meeting update condition, knowledge base being updated, otherwise by data In return value learner；The knowledge that knowledge base is obtained by executor performs, and the result that executor is performed by scorer is carried out Scoring, if it is qualified to mark, then this knowledge meets the condition of more new knowledge base.

Described word segmentation processing device, to English name of product, is split by traversal space, comprises the steps:

1. name of product is split as word according to space；

2. the stop word including punctuate and special symbol is removed, to residue word according to 0,1,2 ..., N compiles Number；

3. for the n-th word, the n-th word and the n-th+i word are mated: if the n-th word and the n-th+i are individual Word is phrase, then n=n+1, until n=N；Otherwise, the n-th word and the n-th+i word are word, i=i+1, until n+i =N；N=0,1,2 ..., N, i=1,2 ....

Beneficial effect: the kernel keyword extraction method based on B2B platform that the present invention provides, compared to existing skill Art, has the advantage that

1, in big Data Concurrent calculates, there is clear superiority: height can be provided the user by distributed memory database Performance, High Availabitity, scalable data calculate service, by data are distributed to multiple calculating service node, directly at internal memory Middle calculating, manage and safeguard data, unified access interface and optional redundancy backup mechanism are externally provided；

2, there is clear superiority when the various tense of English word is converted into prototype: had various by series of algorithms The word of English tense, is converted into prototype；

3, carry out word process according to one group of rule and carry out self-teaching aspect there is clear superiority: for English word The feature of itself and common cacography, a kind of method giving English-word spelling error correction.

Accompanying drawing explanation

Fig. 1 is the structured flowchart of learning database；

Fig. 2 is the block architecture diagram of the inventive method；

Fig. 3 is the implementing procedure figure of the inventive method.

Detailed description of the invention

Data in caching are created as kernel keyword index text file by described keyword index storehouse；Meanwhile, for The industry core word of word recombination module output sets up industry core word index text file, for the output of word recombination module Search core word sets up search core word index text file；

Each ingredient with regard to core word segmentation processing device is specifically described below.

Word segmentation processing device, to English name of product, is split by traversal space, comprises the steps:

1. name of product is split as word according to space；

Such as:

Name of product is: Collapsible Silicone Lunch Box Cooker Food Container

It is split as by space: Collapsible/Silicone/Lunch/Box/Cooker/Food/Container

Group of words is searched: for word Collapsible, first judge whether Collapsible Silicone is phrase: If Collapsible Silicone is phrase, then terminates this circulation, start to judge Silicone；If Collapsible Silicone is not phrase, then judge whether Collapsible Lunch is phrase.According to this rule, obtain following form:

The words/phrases split result that table 1 word segmentation processing device obtains

Product IDs	Word (key word)	Type (type)
			1	Collapsible	0

1	Silicone	1
			1	Lunch Box	1
1	Cooker	0
			1	Food Container	1

In table, in type, 1 is phrase, and 0 is word, it is assumed here that Lunch Box and Food Container is phrase.

Affixe processor, the data produced after word segmentation processing device is processed, remove that each word is front/rear to be sewed, by its of word His form is converted into noun, or derivative is converted into noun, such as, pronounce is converted into pronunciation, general Explain is converted into explanation etc..The noun obtained mated with dictionary: for matching with dictionary Word, by corresponding word output to learning database；For the word that can match with dictionary, be more newly formed name of product, Key word > sequence.

Root process device, the data produced after affixe processor is processed, enter according to the part of speech of word according to root algorithm Row root extracts, then is mated with dictionary by the root of extraction；For the word that cannot match with dictionary, will be single accordingly Word exports to learning database；For the word that can match with dictionary, more it is newly formed<name of product, key word>sequence.

DANFU number processor, the data produced after root process device is processed, carry out single complex processing, word is converted to Prototype, concrete conversion method is as follows:

1. to general word, by constituting plural form at word end plus suffix-s, when word endings is voiceless consonant Time read [s], when word endings is voiced consonant or vowel read [z]；For the plural form of this kind of word, ended up by removal Letter s is to be converted to prototype；Such as word shoes, is converted to prototype shoe；

2. to the general word ended up with s, z, x, ch, sh, by constituting plural number at word end plus suffix-es；Right In the plural form of this kind of word, the alphabetical es ended up by removal is to be converted to prototype；Such as word buses, is converted to former Type bus；

3. for the general word ended up with consonant y, by the y of ending is become i, add suffix-es and constitute Plural number；For the plural form of this kind of word, by the ies of ending is transformed to y to be converted to prototype；Such as word Candies, is converted to prototype candy；

4. to the general word ended up with o, if special word or abbreviation, then by adding suffix-es at word end Constitute plural number, otherwise, by constituting plural number at word end plus suffix-s；First for the plural form of this kind of word, Join whether word is special word or abbreviation, the most then the alphabetical es ended up by removal, to be converted to prototype, otherwise, is passed through Remove the alphabetical s of ending to be converted to prototype；Such as word tomatoes, is converted to prototype tomato；

The most special dictionary, in order to deposit the word of special DANFU transformation of variables form, such as piano, photo, roof, Affix, fish, men, child, in following several situations, first carry out special dictionary coupling to word, if can not be with Word matched in special dictionary, is converted the most as the case may be；

6. for the general word ended up with consonant f or fe, plural number is constituted by f or fe of ending being become ves； For the plural form of this kind of word, by the ves of ending is transformed to f or fe to be converted to prototype；Such as word knives, Be converted to prototype knife；

7. for the general word ended up with consonant is, plural number is constituted by the is of ending being become es；For this The plural form of class word, by being transformed to is to be converted to prototype by the es of ending；Such as word axes, is converted to prototype axis；

8. for the general word ended up with consonant ix, plural number is constituted by the ix of ending being become ices；For The plural form of this kind of word, by being transformed to ix to be converted to prototype by the ices of ending；Such as word appendices, turns It is changed to prototype appendix；

9. use semantic analyzer, although special word is ended up with s, but being not plural number, not having if analyzing this book Standby plural form then retains prototype, otherwise removes s；Such as word goods is construed to goods.

Tense processor, the data produced after DANFU number processor is processed, carry out tense process, being converted to word Prototype；The present invention, by processing present indefinite simple present state, processes the feature changed over, as having various probability The model of tree-shaped bifurcated, is divided into following several feature:

1. by algorithm process past idenfinite state, processing time point is the feature changed in the past；

2. by algorithm process future simple tense state, processing time point is the feature changed in the future；

3. by algorithm process past future tense state, processing time point is the feature changed the most in the future

The present invention define Always, Sometime, Until be time sequence status operator to control logic: work as procedure match During to Always word, after taking out Always word, remainder content performs process；When procedure match to Sometime, Take out Sometime context and perform process by logic；When procedure match to Until, the same Sometime of processing mode.

Similarity processor, when the word that coupling obtains exists two or more implication, is calculated by similarity processor Go out the word meaning of maximum similarity.Such as, the word park obtained for coupling, we cannot specify its indication is that park is still stopped Parking lot, can specify the implication of word now by similarity processor.

The present invention uses the higher-dimension of Words similarity algorithm based on vector space to process rule, by adding a weight Matrix, carries out the extraction of characteristic vector, to reduce the complexity of similarity, improves efficiency.

Particularly as follows: represent characteristic item with T, refer to the basic language list occurring in text D and text D content can being represented Position, such text D just can use the incompatible expression of collection of characteristic item T, i.e. D (T₁,T₂,…,T_k,…,T_n)；Such as one text Have tetra-characteristic items of a, b, c and d, then the text just can be expressed as D (a, b, c, d).For the text containing n characteristic item For, it will usually give certain weight to each characteristic item and represent its significance level, i.e. D=D (T₁w₁,T₂w₂,…, T_kw_k,…,T_nw_n), it is abbreviated as D=DW=D (w₁,w₂,…,w_k,…,w_n), D is referred to as the vector representation of text D, w_kRepresent T_k Weight.In the example above, it is assumed that the weight of tetra-characteristic items of a, b, c and d is respectively 30,20,20 and 10, then give literary composition This vector representation is D=D (30,20,20,10).In vector space model, two text D₁And D₂Between content degree of association Sim(D₁,D₂) represent with the cosine of the angle between vector, formula is:

Sim (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k = 1}^{n} w_{1 k} \times w_{1 k}}{\sqrt{(Σ_{k = 1}^{n} w_{1 k}^{2}) \times (Σ_{k = 1}^{n} w_{2 k}^{2})}}

Sim(D₁,D₂) value is the biggest, represents D₁And D₂Between similarity the biggest.

Word recombination module, to tense processor process after produce data, first pass through word-forming dictionary check, morphology away from Process from smallest edit distance calculating, similar key rule, reject the word that spelling makes mistakes, provide correct spelling, and build one Individual unambiguous misspelling dictionary；When carrying out spell check, search misspelling dictionary, if it is possible to coupling, then it represents that spelling words Mistake, carries out error correction to this word.The method for building up of misspelling dictionary is as follows.

Sum is occurred for words all in training sample, according to the going out of each word in corpus statistics corpus Occurrence number, frequency, calculate preposition probability.For language material does not has the word occurred, calculate 1/N, N for instruction by smoothing processing Practice the occurrence number sum of all words in sample.

Conditional probability uses 1/M, M to be all possible word sum, each conjecture word of such as light word Conditional probability 1/290,290 be editing distance be all possible conjecture of 1.

It is matrix by 26 letter representations, calculates each letter distance on keyboard by algorithm.

Design conditions Probability p (D | h), it is assumed that this word is the probability size of the word that we input, used here as editor The concept of distance, calculates the possible editor that all of editing distance is 1.

According to Bayes principle, rearmounted probability is unrelated with generating probability p of each input (D), thus p (h | D) ∝ P (h) × p (D | h), calculate most probable spelling.

When processing English word, word has the biggest redundancy, possesses following statistical nature:

1. initial comprises the important information of English word；

2., in the case of most of misspellings, the probability that in word, unique letter changes is less；

3. unique consonant sequence more embodies characteristic than unique vowel sequence；

It is 4. double in misspelling word that to write misspelling probability bigger；Such as transposition-> transposition, Insertion->insertrion。

Based on above statistical nature, employing constructs fault-tolerant function and is:

The original series of the original series of initial+unique consonant+unique vowel

Describe:

If letter collection Σ={ ' a ', ' b ' .... ' z ', ' A ', ' B ' .... ' Z '；

English word is designated as L1L2 ... ..Lm, and wherein Li (1≤i≤m) ∈ Σ, m are that word is long；

Vowel V={ ' a ', ' e ', ' i ', ' o ', ' u ', ' A ', ' T ', ' O ', U}；

Consonant C=Σ-V；

Word first letter

FLetter=L1；

The unique sequence code of word medial vowel letter is V_seq, and former sound letter number is Vm, and in word, consonant is unique Sequence is C_seq, and the number of consonant is Cm.

Fault-tolerant function=fLetter+V_seq+C_seq, if the fault-tolerant value of word Wi is Si in dictionary, pretreatment check and correction is single The fault-tolerant value of word is Sp, if Sp=Si, then the word that in dictionary, fault-tolerant value Si is corresponding is error correction term.

The mean scan number of times processed according to this method is:

Misspelling word generates fault-tolerant value, if word length is m, in addition to initial, needing scanning total letter number is m-1 time.

Processing average number of comparisons according to this method is:

Misspelling word needs to generate unique vowel original series when generating fault-tolerant value and unique consonant is former Beginning sequence.In addition to initial, word medial vowel letter number is Vm, and the number of consonant is Cm, wherein Vm+Cm=m-1.

We's average number of comparisons computational methods of method employing are:

CT=1+2+ ... .+Vm-1+1+2+ ...+Cm-1

=[(m-1)²+(m-1)]/2-VmCm-2

=[(m-1)²+(m-1)]/2

If the average word a length of 7 of current English word, then average number of comparisons is 21 times.

The training corpus employing method that the present invention uses is:

Input original large-scale corpus, first the vocabulary that each sentence is correct carried out automatic Evaluation, and provides a scoring, Then by the height of quality score number, correct vocabulary is ranked up, and on the basis of quality score, considers coverage Problem, choice of dynamical one set, the set that output is chosen from original corpus is as training corpus.

Whole framework is divided into two parts, quality evaluation part and corpus based on coverage to extract.

The present invention, for quality evaluation part, chooses a less training of high-quality relative size from existing language material Set, first considers that quality is the most individual, and what definition quality was high to meet following condition:

Source statement and object statement are all than smoother statement；

The present invention is for the evaluation of quality, and with Q, (f e) represents that text is to (f, quality e)；

Source statement and object statement use intertranslation rule more accurate；

Log (Q (f, e)) = Σ_{i = 1}^{k} WiLog (Pi)

K represents the Characteristic Number of model integration, and e represents that source statement, f represent that object statement, Wi represent each character pair Weight, each weight can be obtained by automatic method in the training set of manual construction,

As k=5, the order successively of P1 to P5 is P_dic(f, e), P_LM(e), P_LM(f), P_TM(f | e), P_TM(e|f)；

The measurement of coverage size is respectively compared three indexs:

Word covers, n-gram covers, translation covers；

With first in candidate's training corpus to as first element of subset in selected corpus, and It is scanned backward performing, if current sentence is to selecting corpus to be concentrated with new phrase translation, then preferential by this sentence Add in corpus subset.

Learning database, being divided into is four parts, four essential parts compositions such as learner, knowledge base, executor and scorer, Relation between various piece is as shown in Figure 1.

The learner purpose that the present invention uses is to go to evaluate learning link with algorithm to recommend it is assumed that by a series of rule Composition, needs to call many rules if performing an algorithm, and learner will take out knowledge from knowledge base, and by rule Obtain, perform finally by executor.

The knowledge base purpose that the present invention uses is the form of expression of knowledge, storage feature vector, rule-based algorithm, generation here Formula rule, process function, semantic network and framework, it then follows alterability and expandability, and be also metadata knowledge base, It it is a kind of method based on model.

The present invention use executor's purpose carry out performing operation by one group of logic, when learner obtain a kind of new knowledge Time, then performed by executor, final updating and maintenance knowledge storehouse.

The present invention uses the purpose of scorer to be to be counted in real time when user retrieves by the document scores in knowledge base Calculate, the frequency that certain keyword occurs in certain document, and the score of all documents all has relation with keyword, and And be real-time operation, the frequency occurred in knowledge base when certain keyword is the highest, then the scoring of this keyword is the highest, Vector space model and the Boolean Model of information retrieval are applied in combination it.

Described learner, the present invention is to use one group of learning model, and this model is with one group of linear restriction complexity non-thread Property model, utilize Gradient Boost framework, owing to GBDT is an application algorithm the most widely, can be used to point Class, recurrence.Having good effect in a lot of data, calculating each time is the residual error in order to reduce the last time (residual), and in order to eliminate residual error, we can set up one newly on gradient (Gradient) direction that residual error reduces Model.Thus, in Gradient Boost, the resume of each new model be so that before the residual error of model past Gradient direction reduces, and with traditional B oost, sample correct, mistake is weighted the biggest difference.Specific algorithm is permissible See TreeBoost paper TreeBoost.MH:A boosting algorithm for multi-label Hierarchical text categorization (2006), by Andrea Esuli, Tiziano Fagni, Fabrizio Sebastiani, Istituto di Scienza e Tecnologie dell ' Informazione, Consiglio Nazionale delle Ricerche, Via G Moruzzi, 1 56124Pisa, Italy.Idiographic flow is described as walking as follows Rapid:

1) a given initial value, when our treated core word is set to initial value；

2) M decision tree (iteration M time) is set up；

3) Function Estimation value F (x) is carried out Logistic conversion；

4) for K the operation being grouped into row vector, the classification yi that K kind that each sample point xi is corresponding is possible, institute With yi, F (xi), p (xi) is the vector of a K dimension；

5) gradient direction that residual error reduces is tried to achieve；

6) according to each sample point x, the gradient direction reduced with its residual error, obtain one and be made up of J leaf node Decision tree；

7) after decision tree has been set up, by last formula, the gain of each leaf node can be obtained.

Described scorer: use method calculated below to calculate the score value computing formula of scorer:

Score (q, d)=coord (q, d) queryNorm (q) Σ_tinq(tf(t in d)·idf(t)2· t.getBoost()·norm(t,d))

Tf (t in d) represents that item frequency term frequency refers to the number of times that a t occurs in document d.

Idf (t) expression is associated with inverse document frequency, and document frequency points out the number of files docFreq, docFreq of existing item t The fewest idf is the highest, but value is identical under same inquiry.

(q d) represents the scoring factor, is based on the number occurring query term in document coord.In query string, the word of hit The most, the biggest more many query term of value that coord calculates, in a document, illustrates that the matcher of a little document is the highest.Silent Recognize the percentage ratio that query term occurs.

The standard queries that queryNorm (q) inquires about, makes can compare between different inquiry.This factor does not affect document Sequence, has document all can use this factor because all, so this value is the biggest, and the heaviest default value of impact on entirety scoring:

Norm value, setting up index when, is compiled into a byte type and is saved in index database by the present invention.Take out Time again norm in index is compiled into the value of a float type.

This patent is by Similarity Measure rule, it is established that the network of personal connections between text, to predict existing text u and item The scoring of mesh i ∈ I, then first to find out the nearest-neighbors collection T of u_u, then according to T_uIn i has been commented the score value of undue text.

A=R_x∩R_yRepresent the common scoring item collection of x and y of document.Because different score values represents their association Degree is different, and again because when comparing two Documents Similarity difference, the project the most simultaneously marked does not possesses comparable at this Property, therefore with them, the absolute value of the absolute value of project scoring difference in a can be represented the similar differences between them, for D_xy。

The present invention is for scoring difference three kinds of situations of consideration:

When | R_x∩R_y| when=0, because the document that x and y marks the most jointly, so there is no similarity between them, so Similar differences is 0；

Work as D_xy=0, | R_x∩R_y| when ≠ 0, represent that x and y has the document of common scoring, and their similarity is by jointly Neighbor node determine；

Work as D_xy≠0,|R_x∩R_y| when ≠ 0, their similarity has common neighbor node and identical distance.

Described knowledge base is to obtain external information by the search of environment, then through analyzing, comprehensive, analogy, conclusion etc. think Dimension process obtains knowledge, and these knowledge is stored in knowledge base.

The present invention is according to the association rule mining of a kind of new subject-oriented and learner.Enter in the range of naming logistics The renewal of row knowledge base, represents knowledge base with T here, and knowledge base is the finite aggregate of propositional formula, represents to be added new with p Knowledge, T p represents and adds new knowledge data p in knowledge base.

Update method, spirit T is the knowledge base meeting condition, and p is the new knowledge meeting condition, if

W (p, T) = {T^{'} &SubsetEqual; T | T^{'} | &NotEqual; ~ p, T^{'} &Subset; S &SubsetEqual; T &DoubleRightArrow; S | = ~ p}

I.e. W (p, T) is all maximum set of formulas compatible with p in T.Then:

T cp={T' ∪ p} | T' ∈ W (p, T) }

In knowledge base, the contact of each knowledge is the closest, judges and elimination contradiction is the most complicated when adding new knowledge, so needing Contact between knowledge is limited.

First { constraints graph of p}, constraints graph may have multiple connected component to structure T ∪, and the addition of p is only possible and comprises p Branch in formula calculate conflict, on other without impact, the most only need to consider the branch comprising p, if this branch knot Structure is tree, then p is the knowledge base after belonging to renewal, so being processed as root node by p.

A subtree in tree with p as root, this patent sets root as formula R, by shared variable C1C2C3 ... ..Ck and each son Tree is connected, for C1, C2 ... a group of Ck determines that value, whole tree can be regarded as by R, T1, T2 ... Tk independent sector forms, The deletion union of sets collection of this part, for whole tree at C1, C2 ... ..Ck is the deletion collection of this group, travels through these and deletes set All values, deletion collection during available some particular values, so, the deletion collection of subtree can obtain the deletion collection of whole tree.

So to meeting knowledge base T and new knowledge p can be met, when T ∪ the constraints graph of p} is tree-like, the T p asked, because of This, the update complexity of knowledge base is relevant to the structure of knowledge base.

Described executor, for the process of a single-input single-output, exists such as drag:

Y (k+1)=f [y (k), y (k-1) ..., y (k-P+1), u (k), u (k-1) ..., u (k-Q)]

Wherein: y be output, u be input, k is discrete time coefficient, P, Q be positive integer .f [] be function.

The input u of the executor of object is limited in scope in amplitude, both there is lower limit um and upper limit uM for appointing What k, has:

um≤u(k)≤uM

Process object described by assuming is reversible, there is retrography document in knowledge base, then can existence function g [], this patent sets:

U (k)=g [y (k+1), y (k) ..., y (k-p+1), u (k-1), u (k-2) ..., u (k-Q)]

For described object inversion model.Further, input as m dimensional vector Xc, be output as Uc, then output input relation table It is shown as:

Uc=ψ c (Xc)

Wherein: ψ c is input-output mappings, i.e. can input document into executor from knowledge base, then by executor's retrography Return in knowledge base.

If the output of ψ c () approaches the output of g ()., can be regarded as the inversion model of executor.In the k moment, false If input xc (k) is:

Xc (k)=[r (k+1), y (k) ..., y (k-p+1), u (k-1) ..., u (k-q)] and T

Unknown y (k+1) is replaced with given input r (k+1)；P and q is the estimated value of P and Q respectively.

When in knowledge base, executor NC has output:

Uc (k)=ψ c [r (k+1), y (k) ..., y (k-p+1), u (k-1) ..., u (k-q)]

Input to executor from knowledge base, and when scorer exports to executor, then training result be enough to make Output bias e (k)=r (k)-y (k), and when remaining a value the least. then have

Xc (k)=[r (k+1), r (k) ..., r (k-p+1), u (k-1) ..., u (k-q)] T

It is to replace y (t) with r (t)；Owing to needing the output from learner to be reflected to executor, this patent by this formula Feed-forward characteristic, typically requires that departure function J defined in deviation e (k)=r (k) y (k) that object exports minimizes.In order to Can make calculate accuracy, the most also need calculate text scorer with output knowledge base differential, after having S, pass through BP algorithm just can improve the weight coefficient of NC.On the basis of this, it is considered to the most inverse execution, directly adapt to perform and indirectly fit The old structure of instruction that should perform.

The present invention, in executor, improves and carries out retrography after knowledge base is taken out document and marked with scorer, Further, each extracting cycle performs once to learn, and will reduce learning time.By real-time learning, learning cycle is TL, due to Learning cycle TL is simply determined by the time of program, so, this patent at general learning cycle TL ratio from scorer and knowledge base Middle extraction sample cycle Ts is much smaller.

Below in conjunction with example, the present invention is made further instructions.

The present invention combines the access behavior of user in e-commerce platform, by the product key word to user, industry heat Door search word, and industry key word carry out extract kernel keyword, search information, the setting of key word information for user are entered Row optimizes, and the Optimized Approaches such as trade information are expanded out the sequence of high-quality, and processes at core search library and each word Module has reached the effect of self study, provides most crucial key word for consumer products.

It is illustrated in figure 2 the Organization Chart of the present invention, excavates module including data source modules, data memory module and statistics. Described data source modules, is used for preserving the basic datas such as network log, product information, search information, trade information, as data Analyze and the Data Source of data mining.Including web log file unit, product information unit, trade information unit.Website Log unit is used for preserving web log file information, records and the search record of user, product information list including accessing of user Unit is for preserving the various information of product, and trade information unit is for preserving the various information of industry.Described data memory module, For preserving the data after each processor processes, and after carrying out cleaned filter, Data Mart unit generates multiple centre Data base and Relational database.Data memory module is made up of data warehouse unit, cleaned filter, Data Mart unit.Number It is sequentially connected with according to warehouse unit, cleaned filter, Data Mart unit.Described data warehouse unit, for preserving through ETL process Data after device process.

It is illustrated in figure 3 the implementing procedure figure of the present invention, specifically includes following steps:

Step 1: using the site search daily record in B2B E-commerce platform website, product information as key word and phase thereof Close the data source of word.

It is divided into following steps in detail:

(1) name of product attribute, industry in the web log file in B2B E-commerce platform website, product information is selected to close Keyword and the data source of search key word.

(2) data extracted carry out ETL process, form the key word in user's search behavior and product information and are correlated with Use information, including the key word in network log, search time, search client IP, product key word etc., and be stored in In data warehouse.

Wherein, described ETL, refer to be responsible for laggard to interim intermediate layer for data pick-up that be distributed, in heterogeneous data source Row clean, conversion, integrated, be finally loaded in data warehouse or Data Mart, become the base of on-line analytical processing, data mining Plinth.

(3) product keyword data is processed through word filter device, remove non-word letter.

1, take data line record, add up word number, and each word space is split.

2, one by one each character is distinguished, if space is not added up, if unblank, and be word, then unite Meter.

3, judging that previous is space, later is character, then carry out adding up a word.

Step 2: product dictionary derives from product information, industry dictionary of being correlated with comprises trade information, by the relevant place of product Reason device carries out by industry, product information is carried out dependency process, is carried out point by industry type by key word corresponding for each product Class.

(1) carry out product key word by word matching method to mate with industry key word, according to the common feature occurred, Determine which class the key word corresponding to this product belongs to.

(2) after carrying out matching treatment by product key word and industry word, then the result after coupling is exported to synonym Word corpus processes, and according to key word, the common word occurred expands matching range with corpus.

(3) delete the key word not having in product dictionary, data are back in Relational database.

The data of product information have: product IDs, product key word；The data of industry keywords database have: industry ID, and industry is closed Keyword.In addition, it is additionally operable to be saved in statistics mining process the middle transition data produced.

Step 3: after product treatment device, if the key word of this product is uncommon word, that needs to carry out entering study Storehouse processes, and is otherwise directly entered word segmentation processing device.

It is divided into following steps in detail:

(1) delete the key word not having in product dictionary, thus simplify product information middle database.

(2) judge that whether current key word is the key word that uncommon word maybe cannot mate, export to learning database and process, Otherwise, output is to word segmentation processing device.

Step 4: segmenter receives from product associative processor and the data of comprehensive relevant dictionary, will determine that whether be English Literary composition word, if English word, will be split by traversal space, and combine formation<name of product, key word>sequence Row.

(1) every pair of<name of product, key word>sequence, it is ranked up according to product IDs, is stored in data buffer storage, shape Complete product is become to be correlated with dictionary.

(2) when being judged as non-English word, the output of this word is learnt, if English word then to learning database Carry out traversal loop and process a product information.

Step 5: affixe processor accepts the data from segmenter, is closed at input affixe by a group data set

Reason device, if there is the word that cannot match with dictionary after the process of affixe processor, then exports to learning database, Otherwise combine formation<name of product, key word>sequence.

Step 6: by result output to root process device after affixe processor has processed, root process device accept from The data of affixe processor process, and data are mated further according to dictionary, input to similarity processor, meter the most again Calculate the root of maximum similarity, finally return again to result, result combination is formed<name of product, key word>sequence.

Step 7: after root process device has processed, by result sequence output to DANFU number processor, when plural number accepts One data then circulates each letter to each word and processes, and is then back to root process device if there is abnormal, success Then data combination is formed<name of product, key word>sequence.

Step 8: by DANFU number processor by data sequence output to tense processor, after tense processor receives data, Assert which kind of tense judgement belongs to according to time sequence status sample temporal logic, call different results, call according to type Always, Sometime, Until, Next tetra-kinds process.

Step 9: by the output of tense processor, enters word recombination module, first passes through word-forming dictionary inspection, carry out morphology Distance and smallest edit distance calculate, and similar key rule processes, and reject the word that spelling makes mistakes, then pass through learning database Process, provide corresponding correct spelling, be combined into legacy data structure, deposit to caching, further according to dissimilar, will be slow Data in depositing, carry out setting up index, export to index database.

Step 10: exported data by word recombination module, to kernel keyword index database, by caching in index database Data are created as kernel keyword index text file, if being industry core word from word restructuring output type, then will set up The index text file of industry core word, if being search core word from word restructuring output type, then sets up search core word Index file text.

Step 11: when the output of affixe processor, product associative processor, root process device and word recombination module is to learning Practising storehouse, after learning database receives these data, data initially enter learner and learn, and first set up one group of rule base, so Rear computation rule weight and variable weight, determine the input-output space of model, then matching degree tolerance, need limit further according to data Making unmatched model quantity, if can not find the rule of correspondence, then reducing the degree of association of this rule.Learning database is by rule number According to output to knowledge base, when knowledge base receives the data from learning database, then obtain knowledge by a series of thinking process, And these knowledge are stored in knowledge base.

Step 12: if current knowledge data have existed in knowledge base, reexamine and whether meet update condition, if full Foot update condition is then updated, and otherwise these data is back in learning database.

Step 13: after learning database acceptance processes data, if there is multiple result in a data, first by this number According to output to scorer, whole scoring process is performed by executor, in executor, first takes out document also from knowledge base After marking with scorer again in retrography knowledge base.

The above is only the preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For Yuan, under the premise without departing from the principles of the invention, it is also possible to make some improvements and modifications, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

1. a kernel keyword extraction method based on B2B platform, it is characterised in that: comprise the steps:

(1) using the user setup name of product in B2B platform, search word and industry hot topic word as dictionary source, to dictionary source It is saved in Data Mart after carrying out pretreatment, constitutes name of product core word bank；The method that dictionary source carries out pretreatment is:

To user setup name of product, the principle used initially with user setup name of product high frequency, reject wherein use time The less user setup name of product of number；Again the user setup key word of corresponding user setup name of product is saved in user to set Put in keywords database；

To search word, first filter out the stop word including punctuate and special symbol；Search word high frequency is used to use again Principle, rejects and wherein uses the search word that frequency is less half a year recently；Then pretreatment, shape are carried out by core word segmentation processing device Search key word is become to be saved in search high frequency dictionary；

To industry hot topic word, by trade classification, first filter out the stop word including punctuate and special symbol；Use row again The principle that industry hot topic word high frequency uses, rejects wherein access times less industry hot topic word；Then by core word segmentation processing Device carries out pretreatment, forms industry hot topic key word and is saved in industry high frequency dictionary；

(2) by effective names of product all in current site, the non-use including punctuate and special symbol is first filtered out Word；Then carry out pretreatment by core word segmentation processing device, products obtained therefrom title is saved in product high frequency dictionary；

(3) name of product in product high frequency dictionary is mated with name of product core word bank, the product that coupling is obtained According to the sequencing output occurred in name of product, one record of each name of product after title duplicate removal, it is saved in data set In city, constitute the kernel keyword of name of product；Matched rule is:

It is the core of name of product by meeting the search key definition occurred in the name of product of an any of the above matched rule Heart key word；

Described core word segmentation processing device includes word segmentation processing device, affixe processor, root process device, DANFU number processor, tense Processor, similarity processor, word recombination module, keyword index storehouse and learning database, wherein:

Described word segmentation processing device, to English name of product, is split by traversal space, carries out according to word and phrase Word segmentation processing, combination forms<name of product, key word>sequence, and is ranked up according to product IDs；

Described affixe processor, the data produced after word segmentation processing device is processed, remove that each word is front/rear to be sewed, by its of word His form is converted into noun, or derivative is converted into noun, is mated with dictionary by the noun obtained；For cannot be with word The word that allusion quotation matches, by corresponding word output to learning database；For the word that can match with dictionary, more it is newly formed <name of product, key word>sequence；

Described root process device, the data produced after affixe processor is processed, enter according to the part of speech of word according to root algorithm Row root extracts, then is mated with dictionary by the root of extraction；For the word that cannot match with dictionary, will be single accordingly Word exports to learning database；For the word that can match with dictionary, more it is newly formed<name of product, key word>sequence；

Described DANFU number processor, the data produced after root process device is processed, carry out single complex processing, word is converted to Prototype, is more newly formed<name of product, key word>sequence；

Described tense processor, the data produced after DANFU number processor is processed, carry out tense process, word is converted to former Type, is more newly formed<name of product, key word>sequence；

Described similarity processor, when the word that coupling obtains exists two or more implication, is calculated by similarity processor Go out the word meaning of maximum similarity；

Described word recombination module, to tense processor process after produce data, first pass through word-forming dictionary check, morphology away from Process from smallest edit distance calculating, similar key rule, reject the word that spelling makes mistakes；Then pass through the process of learning database, Provide the word of correct spelling, be combined into the data of just data structure, deposit to caching；Finally according to industry type, Data in caching are set up index, exports to kernel keyword index database；

Data in caching are created as kernel keyword index text file by described keyword index storehouse；Meanwhile, for word The industry core word of recombination module output sets up industry core word index text file, for the search of word recombination module output Core word sets up search core word index text file；

Described learning database, including learner, knowledge base, executor and four essential parts of scorer, when affixe processor, root The data that processor, product associative processor and word recombination module produce export to learning database, and data initially enter learner； Learner combines knowledge in knowledge base and learns input data, first sets up one group of rule, then computation rule weight and Variable weight, exports the rule set up and amount of calculation to knowledge base；Knowledge base carries out a series of thinking to input data Journey is to obtain knowledge, and described knowledge refers to a series of regular algorithm, if the algorithm obtained has existed in knowledge base, then Checking whether the condition meeting more new knowledge base, if meeting update condition, knowledge base being updated, otherwise data are returned In learner；The knowledge that knowledge base is obtained by executor performs, and the result that executor is performed by scorer is marked, if Mark qualified, then this knowledge meets the condition of more new knowledge base.

Kernel keyword extraction method based on B2B platform the most according to claim 1, it is characterised in that: described In step (2), product high frequency dictionary derives from product information, and industry high frequency dictionary comprises trade information, needs by product phase Close processor and product information is carried out dependency process；Product information includes product IDs and product key word, and trade information includes Industry ID and industry hot topic key word；

(21) by word matched, product key word and industry hot topic key word are mated, according to the common feature occurred, Determine category of employment belonging to this product；

(22) according to the category of employment determined, by the output of product key word in synonym corpus, according to product key word with In synonym corpus, the common word occurred expands product key word；

(23) first reject the product key word not having in dictionary, then product key word that is uncommon and that cannot mate is exported To learning database, the output of remaining product key word is carried out pretreatment to core word segmentation processing device.

Kernel keyword extraction method based on B2B platform the most according to claim 1, it is characterised in that: described Word segmentation processing device, to English name of product, is split by traversal space, comprises the steps:

1. name of product is split as word according to space；

2. the stop word including punctuate and special symbol is removed, to residue word according to 0,1,2 ..., N is numbered；

3. for the n-th word, the n-th word and the n-th+i word are mated: if the n-th word and the n-th+i word For phrase, then n=n+1, until n=N；Otherwise, the n-th word and the n-th+i word are word, i=i+1, until n+i=N； N=0,1,2 ..., N, i=1,2 ....