CN104408173A

CN104408173A - Method for automatically extracting kernel keyword based on B2B platform

Info

Publication number: CN104408173A
Application number: CN201410765503.8A
Authority: CN
Inventors: 徐飞
Original assignee: Focus Technology Co Ltd
Current assignee: Focus Technology Co Ltd
Priority date: 2014-12-11
Filing date: 2014-12-11
Publication date: 2015-03-11
Anticipated expiration: 2034-12-11
Also published as: CN104408173B

Abstract

The invention discloses a method for automatically extracting a kernel keyword based on a B2B platform. The method is used for extracting the kernel keyword based on English grammar and semanteme according to the name of an English product. The method for automatically extracting the kernel keyword based on the B2B platform, disclosed by the invention, has obvious advantages in word treatment and self-learning according to a group of rules when various tenses of English words are changed into the original modes during big data concurrence compution.

Description

A kind of kernel keyword extraction method based on B2B platform

Technical field

The present invention relates to a kind of kernel keyword extraction method based on B2B platform.

Background technology

E-commerce development so far, have accumulated the information of magnanimity, and a large amount of users, comprises visitor, dealer, informant etc.; And the height of information repeats to occupy a large amount of server resources.

When using search engine to carry out keyword search, need keyword to be submitted in server, server is searched in mass data according to keyword, returns Search Results after finding one group of relevant information; If the search of concurrency, then can have a huge impact server.The quality of keyword has a great impact the efficiency (search speed) of search and quality (correlativity of Search Results) tool, therefore need to set up the method that a kernel keyword extracts automatically, by keyword (in conjunction with other data) by process such as a series of filtration, participle, coupling, restructuring, to draw kernel keyword, server is allowed to search for according to kernel keyword, to improve efficiency and the quality of search.

The related term of the keyword that product information supplier is arranged for its product and a collection of high-quality, is very helpful to accurate, comprehensive reflection of product performance.In theory, to keyword, related term and name of product that product information supplier is arranged, after adopting the process such as segmentation methods, stem algorithm, word reassembly algorithm, can extract and be worth word and set up index, thus finally extract kernel keyword.

Domestic more existing segmenting methods are comparatively single, particularly for the automatic extraction of English kernel keyword, only carry out coupling for continuous individual character and extract, cannot mate continuous phrase or discontinuous word, easily miss a lot of valuable kernel keyword, such as:

Chinese patent CN200710122439.1, gives a kind of Words partition system and method, and it utilizes cutting to mark separating character string, then identifies according to the continuous individual character in machine word segmentation result, finally extracts core word.But may cause the loss of some core value words in the method result, and be mated separating character string by mechanical segmentation method again, the efficiency in big data quantity is low-down.

Chinese patent CN200910083775.9, gives a kind of participle processing method and text searching method, and described database feature item by creating the new Words partition system based on database feature item, and adds in described new Words partition system by it; And the query word that user submits to is carried out word segmentation processing to generate word segmentation result collection using described database feature item as vocabulary.In the method selected data storehouse, field carries out participle as characteristic item, make use of the incidence relation of text in database feature item and database, effectively improves the word segmentation accuracy of traditional segmenting methods such as unitary, binary, preset vocabulary; But the method carries out participle based on the relevant dictionary of preset vocabulary, and this method effect when processing english is low, and do not relate to stem (word prototype) extraction and.

In accurate participle, English word, the accurate match of stem is the important content that mass data Chinese and English kernel keyword extracts automatically, is also the important content improving Mass Data Searching efficiency and quality.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of kernel keyword extraction method based on B2B platform, for English name of product, based on English grammar and semanteme, to extract kernel keyword.

Technical scheme: for achieving the above object, the technical solution used in the present invention is:

Based on a kernel keyword extraction method for B2B platform, comprise the steps:

(1) user in B2B platform is arranged the popular word of name of product, search word and industry as dictionary source, be kept in Data Mart after pre-service is carried out to dictionary source, form name of product core word bank; Carrying out pretreated method to dictionary source is:

Arrange name of product to user, first adopt user to arrange the principle of name of product high frequency use, the user rejecting wherein access times less arranges name of product; The user again respective user being arranged name of product arranges keyword and is kept at user and arranges in keywords database;

To search word, first filter out the stop word comprising punctuate and special symbol; Adopt the principle that search word high frequency uses again, reject wherein the less search word of frequency of utilization nearest half a year; Then carry out pre-service by core word segmentation processing device, form search keyword and be kept in search high frequency dictionary;

To the popular word of industry, by trade classification, first filter out the stop word comprising punctuate and special symbol; The principle adopting the popular word high frequency of industry to use again, rejects the popular word of the less industry of wherein access times; Then carry out pre-service by core word segmentation processing device, form the popular keyword of industry and be kept in industry high frequency dictionary;

(2) by effective names of product all in current site, the stop word comprising punctuate and special symbol is first filtered out; Then carry out pre-service by core word segmentation processing device, products obtained therefrom title is kept in product high frequency dictionary;

(3) name of product in product high frequency dictionary is mated with name of product core word bank, export according to the sequencing occurred in name of product after the name of product duplicate removal that coupling is obtained, each name of product record, be kept in Data Mart, form the kernel keyword of name of product; Matched rule is:

If 1. occur search keyword in name of product, and this search keyword arranges keyword for user;

If 2. there is search keyword in name of product, and this search keyword is industry hot topic keyword;

Be the kernel keyword of name of product by the search key definition that occurs in the name of product of any matched rule above meeting.

In described step (2), product high frequency dictionary derives from product information, and industry high frequency dictionary comprises trade information, needs to carry out correlativity process by product associative processor to product information; Product information comprises product IDs and product keyword, and trade information comprises industry ID and the popular keyword of industry;

Product keyword industry type corresponding for each name of product is classified, specifically comprises the steps:

(21) by word matched, product keyword and the popular keyword of industry are mated, according to the common feature occurred, determine category of employment belonging to this product;

(22) according to the category of employment determined, exported to by product keyword in synonym corpus, the word according to jointly occurring in product keyword and synonym corpus expands product keyword;

(23) first reject the product keyword do not had in dictionary, then uncommon is exported in learning database with the product keyword that cannot mate, remaining product keyword is exported in core word segmentation processing device and carries out pre-service.

Described core word segmentation processing device comprises word segmentation processing device, affixe processor, root process device, single complex processing device, Temporal dispose device, similarity processor, word recombination module, keyword index storehouse and learning database, wherein:

Described word segmentation processing device, to the name of product of English, is split by traversal space, word segmentation processing is carried out according to word and phrase, be combined to form < name of product, keyword > sequence, and sort according to product IDs;

Described affixe processor, to the data produced after the process of word segmentation processing device, removes that each word is front/rear to be sewed, other forms of word is converted into noun, or derivative is converted into noun, mated by the noun obtained with dictionary; For the word that cannot match with dictionary, corresponding word is exported in learning database; For the word that can match with dictionary, upgrade and form < name of product, keyword > sequence;

Described root process device, to the data produced after the process of affixe processor, carries out root extraction according to root algorithm according to the part of speech of word, then is mated with dictionary by the root of extraction; For the word that cannot match with dictionary, corresponding word is exported in learning database; For the word that can match with dictionary, upgrade and form < name of product, keyword > sequence;

Described single complex processing device, to the data produced after the process of root process device, carries out single complex processing, word is converted to prototype, upgrades and forms < name of product, keyword > sequence;

Described Temporal dispose device, to the data produced after the process of single complex processing device, carries out Temporal dispose, and word is converted to prototype, upgrades and forms < name of product, keyword > sequence;

Described similarity processor, when mating the word obtained and there is two or more implication, calculates the word meaning of maximum similarity by similarity processor;

Described word recombination module, to the data produced after the process of Temporal dispose device, first by spelling the inspection of word dictionary, the calculating of morphology Distance geometry smallest edit distance, the process of similar key rule, rejects the word spelt and make mistakes; Then by the process of learning database, provide the word of correct spelling, be combined into the data of correct data structure, deposit in buffer memory; Last according to industry type, index is set up to the data in buffer memory, exports in kernel keyword index database;

Data in buffer memory are created as kernel keyword index text file by described keyword index storehouse; Meanwhile, the industry core word exported for word recombination module sets up industry core word index text file, and the search core word exported for word recombination module sets up search core word index text file; Namely high frequency words in search core word bank constitutes previously described search high frequency dictionary, high frequency words in industry core word bank constitutes previously described industry high frequency dictionary, and the high frequency words in key dictionary constitutes previously described name of product high frequency dictionary;

Described learning database, comprises learner, knowledge base, actuator and scorer four essential parts, and the data produced when affixe processor, root process device, product associative processor and word recombination module export learning database to, and first data enter learner; Learner learns input data in conjunction with the knowledge in knowledge base, and first set up one group of rule, then computation rule weight and variable weight, exports the rule of foundation and calculated amount to knowledge base; Knowledge base carries out a series of thought process with acquire knowledge to input data, described knowledge refers to a series of regular algorithm, if the algorithm obtained has existed in knowledge base, then check whether the condition meeting storehouse of refreshing one's knowledge, if meet update condition, knowledge base is upgraded, otherwise by data rreturn value learner; Actuator performs the knowledge that knowledge base obtains, and scorer is marked to the result that actuator performs, if it is qualified to mark, then this knowledge meets the condition in storehouse of refreshing one's knowledge.

Described word segmentation processing device, to the name of product of English, is split by traversal space, comprises the steps:

1. name of product is split as word according to space;

2. the stop word comprising punctuate and special symbol is removed, to residue word according to 0,1,2 ..., N is numbered;

3. for the n-th word, the n-th word and the n-th+i word are mated: if the n-th word and the n-th+i word are phrase, then n=n+1, until n=N; Otherwise the n-th word and the n-th+i word are word, and i=i+1, until n+i=N; N=0,1,2 ..., N, i=1,2 ...

Beneficial effect: the kernel keyword extraction method based on B2B platform provided by the invention, compared to prior art, tool has the following advantages:

1, in large Data Concurrent calculates, there is clear superiority: high-performance, High Availabitity, scalable data calculation services can be provided for user by distributed memory database, by by Data distribution8 to multiple calculation services node, directly calculate in internal memory, administer and maintain data, unified access interface and optional redundancy backup mechanism are externally provided;

2, when the various tense of English word converts prototype to, there is clear superiority: by series of algorithms by the various word with English tense, be converted into prototype;

3, word process is carried out and carries out self-teaching aspect that there is clear superiority according to one group of rule: the misspelling that the characteristic sum for English word itself is common, gives a kind of method of English-word spelling error correction.

Accompanying drawing explanation

Fig. 1 is the structured flowchart of learning database;

Fig. 2 is the block architecture diagram of the inventive method;

Fig. 3 is the implementing procedure figure of the inventive method.

Embodiment

Data in buffer memory are created as kernel keyword index text file by described keyword index storehouse; Meanwhile, the industry core word exported for word recombination module sets up industry core word index text file, and the search core word exported for word recombination module sets up search core word index text file;

Each ingredient below with regard to core word segmentation processing device is specifically described.

Word segmentation processing device, to the name of product of English, is split by traversal space, comprises the steps:

1. name of product is split as word according to space;

Such as:

Name of product is: Collapsible Silicone Lunch Box Cooker Food Container

Be split as by space: Collapsible/Silicone/Lunch/Box/Cooker/Food/Container

Group of words is searched: for word Collapsible, first judges whether Collapsible Silicone is phrase: if Collapsible Silicone is phrase, then terminate this circulation, start to judge Silicone; If Collapsible Silicone is not phrase, then judge whether Collapsible Lunch is phrase.According to this rule, obtain following form:

The words/phrases split result that table 1 word segmentation processing device obtains

Product IDs	Word (keyword)	Type (type)
			1	Collapsible	0

1	Silicone	1
			1	Lunch Box	1
1	Cooker	0
			1	Food Container	1

In table, in type, 1 is phrase, and 0 is word, supposes that Lunch Box and Food Container is phrase here.

Affixe processor, to the data produced after the process of word segmentation processing device, remove that each word is front/rear to be sewed, other forms of word are converted into noun, or derivative is converted into noun, such as pronounce is converted into pronunciation, explain is converted into explanation etc.The noun obtained is mated with dictionary: for the word that cannot match with dictionary, corresponding word is exported in learning database; For the word that can match with dictionary, upgrade and form < name of product, keyword > sequence.

Root process device, to the data produced after the process of affixe processor, carries out root extraction according to root algorithm according to the part of speech of word, then is mated with dictionary by the root of extraction; For the word that cannot match with dictionary, corresponding word is exported in learning database; For the word that can match with dictionary, upgrade and form < name of product, keyword > sequence.

Single complex processing device, to the data produced after the process of root process device, carry out single complex processing, word is converted to prototype, concrete conversion method is as follows:

1. to general word, by adding that at word end suffix-s forms plural form, read [s] when word endings is voiceless consonant, when word endings reads [z] for when voiced consonant or vowel; For the plural form of this kind of word, by removing the alphabetical s of ending to be converted to prototype; Such as word shoes, is converted to prototype shoe;

2. to the general word ended up with s, z, x, ch, sh, by adding that at word end suffix-es forms plural number; For the plural form of this kind of word, by removing the alphabetical es of ending to be converted to prototype; Such as word buses, is converted to prototype bus;

3. for the general word ended up with consonant y, by the y of ending is become i, add suffix-es and form plural number; For the plural form of this kind of word, by the ies of ending is transformed to y to be converted to prototype; Such as word candies, is converted to prototype candy;

4. to the general word ended up with o, if special word or abb., then by adding that at word end suffix-es forms plural number, otherwise, by adding that at word end suffix-s forms plural number; Whether for the plural form of this kind of word, first mating word is special word or abb., if so, then by removing the alphabetical es of ending to be converted to prototype, otherwise, by removing the alphabetical s of ending to be converted to prototype; Such as word tomatoes, is converted to prototype tomato;

5. special dictionary, in order to deposit the word of special single complex transform form, such as piano, photo, roof, affix, fish, men, child, in following several situations, all first, special dictionary coupling is carried out to word, if with the word matched in special dictionary, then can not be converted as the case may be;

6. for the general word ended up with consonant f or fe, plural number is formed by f or fe of ending being become ves; For the plural form of this kind of word, by the ves of ending is transformed to f or fe to be converted to prototype; Such as word knives, is converted to prototype knife;

7. for the general word ended up with consonant is, plural number is formed by the is of ending being become es; For the plural form of this kind of word, by the es of ending is transformed to is to be converted to prototype; Such as word axes, is converted to prototype axis;

8. for the general word ended up with consonant ix, plural number is formed by the ix of ending being become ices; For the plural form of this kind of word, by the ices of ending is transformed to ix to be converted to prototype; Such as word appendices, is converted to prototype appendix;

9. adopt semantic analyzer, although to special word with s ending, be not plural number, if analyze this book not possess plural form, retain prototype, otherwise remove s; Such as word goods is interpreted as goods.

Temporal dispose device, to the data produced after the process of single complex processing device, carries out Temporal dispose, and word is converted to prototype; The present invention, by process present indefinite simple present state, processes in time and the feature that changes, as the model of tree-shaped bifurcated with various possibility, is divided into following several feature:

1. by algorithm process past idenfinite state, the feature of processing time point for changing in the past is carried out;

2. by algorithm process future simple tense state, the feature of processing time point for changing in the future is carried out;

3. by algorithm process past future tense state, the feature of processing time point for changing in the future is in the past carried out

It is that time sequence status operational character carrys out steering logic that the present invention defines Always, Sometime, Until: when procedure match is to Always word, and after taking out Always word, remainder content performs process; When procedure match is to Sometime, takes out Sometime context and perform process by logic; When procedure match is to Until, the same Sometime of processing mode.

Similarity processor, when mating the word obtained and there is two or more implication, calculates the word meaning of maximum similarity by similarity processor.Such as, for mating the word park obtained, we cannot specify its indication is park or parking lot, now can be specified the implication of word by similarity processor.

The present invention adopts the higher-dimension process rule based on the Words similarity algorithm of vector space, by adding a weight matrix, carrying out the extraction of proper vector, to reduce the complicacy of similarity, raising the efficiency.

Be specially: with T representation feature item, refer to and appear in text D and can represent the basic language unit of text D content, such text D just can represent in the set of operating characteristic item T, i.e. D (T ₁, T ₂..., T _k..., T _n); Such as one section of text has a, and b, c and d tetra-characteristic items, so the text just can be expressed as D (a, b, c, d).For the text containing n characteristic item, give certain weight usually can to each characteristic item and represent its significance level, i.e. D=D (T ₁w ₁, T ₂w ₂..., T _kw _k..., T _nw _n), referred to as D=DW=D (w ₁, w ₂..., w _k..., w _n), D is called the vector representation of text D, w _krepresent T _kweight.In example above, suppose a, the weight of b, c and d tetra-characteristic items is respectively 30,20,20 and 10, is so expressed as D=D (30,20,20,10) to text vector.In vector space model, two text D ₁and D ₂between content degree of correlation Sim (D ₁, D ₂) represent with the cosine of the angle between vector, formula is:

Sim (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k = 1}^{n} w_{1 k} \times w_{1 k}}{\sqrt{(Σ_{k = 1}^{n} w_{1 k}^{2}) \times (Σ_{k = 1}^{n} w_{2 k}^{2})}}

Sim (D ₁, D ₂) value is larger, represents D ₁and D ₂between similarity larger.

Word recombination module, to the data produced after the process of Temporal dispose device, first by spelling the inspection of word dictionary, the calculating of morphology Distance geometry smallest edit distance, the process of similar key rule, reject the word spelt and make mistakes, provide correct spelling, and build a unambiguous mosaic allusion quotation by mistake; When carrying out spell check, searching mosaic allusion quotation by mistake, if can mate, then representing spelling words mistake, error correction is carried out to this word.The method for building up of mosaic allusion quotation is as follows by mistake.

Sum is occurred for words all in training sample, according to occurrence number, the frequency of each word in corpus statistics corpus, calculates preposition probability.For the word not having in language material to occur, calculate 1/N by smoothing processing, N is the occurrence number sum of all words in training sample.

Conditional probability adopts 1/M, M to be all possible word sum, the conditional probability 1/290,290 of each conjecture word of such as light word to be editing distance be 1 all possible conjecture.

Be matrix by 26 letter representations, calculate the distance of each letter on keyboard by algorithm.

Design conditions Probability p (D|h), supposes that this word is the probability size of the word that we input, uses the concept of editing distance here, and calculating all editing distances is 1 may edit.

According to Bayes principle, the generating probability p (D) of rearmounted probability and each input has nothing to do, so p (h|D) ∝ P (h) × p (D|h), calculates most probable spelling.

When processing English word, word has very large redundance, possesses following statistical nature:

1. initial comprises the important information of English word;

2., when great majority are spelled by mistake, the probability that in word, unique letter changes is less;

3. unique consonant sequence more embodies characteristic than unique vowel sequence;

4. twoly in spelling write that to spell probability larger by mistake by mistake; Such as transposition->transposition, Insertion->insertrion.

Based on above statistical nature, the fault-tolerant function of employing structure is:

The original series of original series+unique vowel of initial+unique consonant

Describe:

If letter collection Σ={ ' a ', ' b ' .... ' z ', ' A ', ' B ' .... ' Z ';

English word is designated as L1L2 ... ..Lm, wherein Li (1≤i≤m) ∈ Σ, m are that word is long;

Vowel V={ ' a ', ' e ', ' i ', ' o ', ' u ', ' A ', ' T ', ' O ', U};

Consonant C=Σ-V;

Word first letter

fLetter＝L1；

The unique sequence code of word medial vowel letter is V_seq, and former sound letter number is Vm, and in word, the unique sequence code of consonant is C_seq, and the number of consonant is Cm.

Fault-tolerant function=fLetter+V_seq+C_seq, if the fault-tolerant value of word Wi is Si in dictionary, the fault-tolerant value of pre-service check and correction word is Sp, if Sp=Si, then the word that in dictionary, fault-tolerant value Si is corresponding is error correction term.

According to the mean scan number of times of this method process be:

Spell by mistake and generate fault-tolerant value, if word length is m, except initial, need the total alphabetical number of scanning to be m-1 time.

According to the average number of comparisons of this method process be:

Spell when generating fault-tolerant value by mistake and need to generate unique vowel original series and unique consonant original series.Except initial, word medial vowel letter number is Vm, and the number of consonant is Cm, wherein Vm+Cm=m-1.

We method adopts average number of comparisons computing method to be:

CT＝1+2+….+Vm-1+1+2+…+Cm-1

＝[(m-1) ²+(m-1)]/2-VmCm-2

＝[(m-1) ²+(m-1)]/2

If the average word length of current English word is 7, so average number of comparisons is 21 times.

The training corpus employing method that the present invention adopts is:

Input original large-scale corpus, first automatic Evaluation is carried out to each correct vocabulary, and provide a scoring, then by the height of quality score number, correct vocabulary is sorted, and on the basis of quality score, consider the problem of coverage, choice of dynamical one set, exports the set chosen from original corpus as training corpus.

Whole framework is divided into two parts, and quality assessment part and the corpus based on coverage extract.

The present invention, for quality assessment part, chooses the training set that a high-quality relative scale is less from existing language material, first considers that quality is individual preferably, high will the meeting the following conditions of definition quality:

Source statement and object statement are all more smooth statements;

The present invention, for the evaluation of quality, represents the quality of text to (f, e) with Q (f, e);

Source statement and object statement adopt intertranslation rule more accurate;

Log (Q (f, e)) = Σ_{i = 1}^{k} WiLog (Pi)

K represents the Characteristic Number of model integration, and e represents source statement, and f represents object statement, and Wi represents the weight of each character pair, and each weight can be obtained by automatic method on the training set of manual construction,

As k=5, the order successively of P1 to P5 is P _dic(f, e), P _lM(e), P _lM(f), P _tM(f|e), P _tM(e|f);

Three indexs are compared in the measurement of coverage size respectively:

Word covers, n-gram covers, translation covers;

With first in candidate's training corpus to first element as subset in selected corpus, and carry out scanning execution backward, if current sentence is to having new phrase translation selecting in corpus subset, then preferential this sentence to be added in corpus subset.

Learning database, being divided into is four parts, and four essential parts such as learner, knowledge base, actuator and scorer form, and the relation between various piece as shown in Figure 1.

The learner object that the present invention adopts goes to evaluate the hypothesis of learning link recommendation, be made up of a series of rule, need to call many rules if perform an algorithm, learner will take out knowledge from knowledge base, and regularly obtain, perform finally by actuator.

The knowledge base object that the present invention adopts is the form of expression of knowledge, here storage feature vector, rule-based algorithm, production rule, process function, semantic network and framework, following alterability and expandability, and be also metadata knowledge base, is a kind of method based on model.

What the present invention adopted actuator object carries out executable operations by one group of logic, when learner obtains a kind of new knowledge, is so performed by actuator, final updating and maintenance knowledge storehouse.

The present invention adopts the object of scorer to be calculated in real time when user retrieves by the document scores in knowledge base, the frequency that certain key word occurs in certain document, and the score of all documents all has relation with key word, and be real-time operation, the frequency occurred in knowledge base when certain key word is higher, so the scoring of this key word is higher, has combinationally used vector space model and the Boolean Model of information retrieval.

Described learner, the present invention is employing one group of learning model, and this model is with one group of linear restriction complex nonlinear model, utilizes Gradient Boost framework, because GBDT is application algorithm very widely, can be used for doing and classify, returns.A lot of data there is good effect, calculating is each time to reduce last residual error (residual), and in order to eliminate residual error, we can set up a new model on gradient (Gradient) direction of residual error minimizing.Thus, in Gradient Boost, the resume of each new model is in order to the residual error of model reduces toward gradient direction before making, and is weighted very large difference with traditional B oost to sample that is correct, mistake.Specific algorithm can see TreeBoost paper TreeBoost.MH:A boosting algorithm for multi-labelhierarchical text categorization (2006), by Andrea Esuli, Tiziano Fagni, Fabrizio Sebastiani, Istituto di Scienza e Tecnologie dell ' Informazione, Consiglio Nazionale delle Ricerche, Via GMoruzzi, 1 – 56124Pisa, Italy.Idiographic flow is described as following steps:

1) a given initial value, when our treated core word is set to initial value;

2) M decision tree (iteration M time) is set up;

3) Logistic conversion is carried out to Function Estimation value F (x);

4) K classification is carried out to the operation of vector, the classification yi that K kind that each sample point xi is corresponding is possible, so yi, F (xi), p (xi) is the vector of a K dimension;

5) gradient direction that residual error reduces is tried to achieve;

6) according to each sample point x, the gradient direction reduced with its residual error, obtains a decision tree be made up of J leaf node;

7) after decision tree has been set up, by last formula, the gain of each leaf node can be obtained.

Described scorer: adopt following computing method to calculate the score value computing formula of scorer:

score(q,d)＝coord(q,d)·queryNorm(q)·Σ _tinq(tf(t in d)·idf(t)2·t.getBoost()·norm(t,d))

Tf (t in d) represents that item frequency item frequency refers to the number of times that a t occurs in document d.

Idf (t) expression is associated with reversion document frequency, and document frequency points out the number of files docFreq of existing item t, and the fewer idf of docFreq is higher, but value is identical under same inquiry.

Coord (q, d) represents the scoring factor, is the number based on there is query term in document.In query string, the word of hit is more, and the value that coord calculates then more large more query terms, in a document, illustrates that the matcher of a little document is higher.Acquiescence is the number percent occurring query term.

The standard queries that queryNorm (q) inquires about, makes can compare between different inquiry.This factor does not affect the sequence of document, has document all can use this factor because all, so this value is larger, and the heavier default value of impact on entirety scoring:

Norm value, when setting up index, is compiled into a byte type and is kept in index database by the present invention.Again norm in index is compiled into the value of a float type during taking-up.

This patent, by Similarity Measure rule, being set up the network of personal connections between text, to predict the scoring of existing text u and project i ∈ I, then first will be found out the nearest-neighbors collection T of u _u, then according to T _uin i has been commented to the score value of undue text.

A=R _x∩ R _yrepresent the common scoring item collection of x and y of document.Because different score values represents that their correlation degree is different, again because when comparing two Documents Similarity difference, the project of simultaneously not marking does not possess comparability at this, therefore can represent the similar differences between them with them to the absolute value of the absolute value of project scoring difference in a, be D _xy.

The present invention considers three kinds of situations for scoring difference:

When | R _x∩ R _y| when=0, because the document that x and y does not mark jointly, so there is no similarity between them, so similar differences is 0;

Work as D _xy=0, | R _x∩ R _y| when ≠ 0, represent that x and y has the document of common scoring, and their similarity is determined by common neighbor node;

Work as D _xy≠ 0, | R _x∩ R _y| when ≠ 0, their similarity has common neighbor node and identical distance.

Described knowledge base obtains external information by the search of environment, then by analysis, comprehensively, analogy, the thought process acquire knowledge such as conclusion, and by these knowledge stored in knowledge base.

The present invention is according to a kind of association rule mining of new subject-oriented and learner.In the scope of naming logistics, carry out the renewal of knowledge base, represent knowledge base here with T, knowledge base is the finite aggregate of propositional formula, represents the new knowledge that will add with p, and Tp represents add new knowledge data p in knowledge base.

Update method, clever T is the knowledge base satisfied condition, and p is the new knowledge satisfied condition, if

W (p, T) = {T^{'} &SubsetEqual; T | T^{'} | &NotEqual; ~ p, T^{'} &Subset; S &SubsetEqual; T &DoubleRightArrow; S | = ~ p}

Namely W (p, T) is all maximum set of formulas compatible with p in T.Then:

T·cp＝{T'∪{p}|T'∈W(p,T)}

In knowledge base, the contact of each knowledge is closer, judges and to eliminate contradiction more complicated when adding new knowledge, so the contact between needing knowledge limits.

First the T ∪ { constraints graph of p} is constructed, constraints graph may have multiple connected component, adding of p only may conflict with the formulae discovery comprised in the branch of p, on other without impact, therefore only the branch comprising p need be considered, if this branched structure is tree, so p is the knowledge base after belonging to renewal, so processed as root node by p.

Take p as a subtree in the tree of root, this patent sets root as formula R, by shared variable C1C2C3 ... ..Ck be connected with each subtree, for C1, C2 ... one group of determined value of Ck, whole tree can be regarded as by R, T1, T2 ... Tk independent sector forms, the deletion union of sets collection of this part, for whole tree is at C1, C2 ... ..Ck be the deletion collection of this group, travel through all values that these delete set, deletion collection during some particular values can be obtained, so, the deletion collection of whole tree can be obtained by the deletion collection of subtree.

So to can meeting knowledge base T and can meeting new knowledge p, as T ∪, { constraints graph of p} is tree-like, the Tp asked, and therefore, the update complexity of knowledge base is relevant to the structure of knowledge base.

Described actuator, for the process of a single-input single-output, exists as drag:

y(k+1)＝f[y(k),y(k-1),......,y(k-P+1),u(k),u(k-1),......,u(k-Q)]

Wherein: y exports, and u is input, and k is discrete time coefficient, and P, Q are positive integer .f [] is function.

The input end u of the actuator of object is limited in scope in amplitude, both there is lower limit um and upper limit uM for any k, has had:

um≤u(k)≤uM

Process object described by supposing is reversible, exists toward retrography document in knowledge base, then can existence function g [], and this patent is established:

u(k)＝g[y(k+1),y(k),......,y(k-p+1),u(k-1),u(k-2),......,u(k-Q)]

For described object inversion model.Further, be input as m dimensional vector Xc, export as Uc, then export input relation and be expressed as:

Uc＝ψc(Xc)

Wherein: ψ c is input-output mappings, namely can input document into actuator from knowledge base, then is returned in knowledge base by actuator retrography.

If the output of ψ c () approaches the output of g ()., can be regarded as the inversion model of actuator.In the k moment, suppose that input xc (k) is:

Xc(k)＝[r(k+1),y(k),......,y(k-p+1),u(k-1)...,u(k-q)]T

Unknown y (k+1) is replaced with given input r (k+1); P and q is the estimated value of P and Q respectively.

When in knowledge base, actuator NC has output:

uc(k)＝ψc[r(k+1),y(k),...,y(k-p+1),u(k-1),...,u(k-q)]

From knowledge base, inputing to actuator, and when exporting actuator to from scorer, then training result is enough to make output bias e (k)=r (k)-y (k), and when remaining a very little value. then have

Xc(k)＝[r(k+1),r(k),...,r(k-p+1),u(k-1),...,u(k-q)]T

Replace y (t) with r (t); Owing to needing to export actuator to from learner, this patent has reflected feed-forward characteristic by this formula, and the departure function J that deviation e (k)=r (k)-y (k) that General Requirements object exports defines minimizes.In order to can make calculate accuracy, also need simultaneously calculate text scorer with export knowledge base differential, after having S, just can be improved the weight coefficient of NC by BP algorithm.On this basis, consider directly inverse execution, directly adapt to the old structure of instruction performing and indirectly adapt to perform.

The present invention, in actuator, improve from knowledge base taking-up document and carry out retrography after marking with scorer, and each extracting cycle performs and once learns, and will reduce learning time.By real-time learning, learning cycle is TL, because learning cycle TL is just by the Time dependent of program, so this patent to extract sample cycle Ts at general learning cycle TL than from scorer and knowledge base much smaller.

Below in conjunction with example, the present invention is made further instructions.

The present invention is in conjunction with the access behavior of user in e-commerce platform, by the product keyword to user, industry top search term, and industry keyword carries out extraction kernel keyword, for the search information of user, the setting of key word information are optimized, and the Optimized Approaches such as trade information are expanded out to the sequence of high-quality, and reach the effect of self study at core search library and each word processing module, for consumer products provides most crucial keyword.

Be illustrated in figure 2 Organization Chart of the present invention, comprise data source modules, data memory module and statistics and excavate module.Described data source modules, for preserving the basic datas such as network log, product information, search information, trade information, as the Data Source of data analysis and data mining.Comprising web log file unit, product information unit, trade information unit.Web log file unit, for preserving web log file information, comprises the Visitor Logs of user and the searching record of user, and product information unit is for preserving the various information of product, and trade information unit is for preserving the various information of industry.Described data memory module, for preserving the data after each processor process, and after carrying out cleaning and filtering, generates multiple intermediate database and Relational database in Data Mart unit.Data memory module is made up of data warehouse unit, filter rinsed, Data Mart unit.Data warehouse unit, filter rinsed, Data Mart unit connect in turn.Described data warehouse unit, for preserving the data after the process of ETL processor.

Be illustrated in figure 3 implementing procedure figure of the present invention, specifically comprise the steps:

Step 1: using the site search daily record in B2B E-commerce platform website, product information as the data source of keyword and related term thereof.

Be divided into following steps in detail:

(1) data source of name of product attribute in the web log file in B2B E-commerce platform website, product information, industry keyword and search keyword is selected.

(2) data extracted carry out ETL process, form the keyword in user search behavior and product information and relevant use information, comprise the keyword in network log, search time, search client IP, product keyword etc., and be stored in data warehouse.

Wherein, described ETL, refer to be responsible for by distribution, data pick-up in heterogeneous data source carries out cleaning, changes behind interim middle layer, integrated, be finally loaded in data warehouse or Data Mart, become the basis of on-line analytical processing, data mining.

(3) product keyword data is processed through word filter device, remove non-word letter.

1, get data line record, statistics word number, and each word space is split.

2, one by one each character is distinguished, if space is not then added up, if not space, and be word, then add up.

3, judge that previous is space, latter one is character, then carry out statistics word.

Step 2: product dictionary derives from product information, industry dictionary of being correlated with comprises trade information, is undertaken carrying out correlativity process to product information by industry, classified by keyword corresponding for each product by industry type by product associative processor.

(1) carry out product keyword by word matching method to mate with industry keyword, according to the common feature occurred, determine which class the keyword corresponding to this product belongs to.

(2) after carrying out matching treatment by product keyword and industry word, then the result after coupling is exported to synonym corpus and processes, expand matching range according to the word occurred common in keyword and corpus.

(3) delete the keyword do not had in product dictionary, data are back in Relational database.

The data of product information have: product IDs, product keyword; The data of industry keywords database have: industry ID, industry keyword.In addition, also for being kept at the middle transition data produced in statistics mining process.

Step 3: after product treatment device, if the keyword of this product is uncommon word, that needs to enter learning database and process, otherwise directly enters word segmentation processing device.

Be divided into following steps in detail:

(1) delete the keyword do not had in product dictionary, thus simplify product information middle database.

(2) judge whether current key word is the keyword that uncommon word maybe cannot mate, and exports learning database to and processes, otherwise, export word segmentation processing device to.

Step 4: segmenter receives the data from product associative processor and comprehensive relevant dictionary, to judge whether it is English word, if English word, will be split by traversal space, and be combined to form < name of product, keyword > sequence.

(1) often pair of < name of product, keyword > sequence, sorts according to product IDs, is stored in data buffer storage, forms complete product and to be correlated with dictionary.

(2) when being judged as non-English word, exporting this word to learning database and learning, if be English word, carry out traversal circular treatment product information.

Step 5: affixe processor accepts the data from segmenter, closes input affixe place by a group data set

Reason device, if there is the word that cannot match with dictionary after the process of affixe processor, then exports learning database to, otherwise is combined to form < name of product, keyword > sequence.

Step 6: export result to root process device after affixe processor processes, root process device accepts to process from the data of affixe processor, data are mated according to dictionary again, input to similarity processor more afterwards, calculate the root of maximum similarity, finally return results again, result is combined to form < name of product, keyword > sequence.

Step 7: after root process device processes, export result sequence to single complex processing device, then circulate when plural number accepts data each letter of each word processed, root process device is then back to if there is abnormal, successful then data assemblies is formed < name of product, keyword > sequence.

Step 8: export data sequence to Temporal dispose device by single complex processing device, after Temporal dispose device receives data, assert which kind of tense judgement belongs to according to time sequence status sample temporal logic, call different results, call Always, Sometime, Until, Next tetra-kinds process according to type.

Step 9: by the output of Temporal dispose device, enters word recombination module, first by spelling the inspection of word dictionary, carry out morphology Distance geometry smallest edit distance to calculate, and similar key rule processes, reject the word spelt and make mistakes, again by the process of learning database, provide corresponding correct spelling, be combined into legacy data structure, deposit in buffer memory, again according to dissimilar, by the data in buffer memory, carry out setting up index, export in index database.

Step 10: export data by word recombination module, to kernel keyword index database, in index database, the data in buffer memory are created as kernel keyword index text file, if be industry core word from word restructuring output type, then the index text file of industry core word will be set up, if be search core word from word restructuring output type, then set up the index file text of search core word.

Step 11: when affixe processor, product associative processor, root process device and word recombination module export learning database to, after learning database receives these data, first data enter learner and learn, first set up one group of rule base, then computation rule weight and variable weight, then according to the inputoutput space of data Confirming model, matching degree tolerance again, need to limit unmatched model quantity, if can not find corresponding rule, then reduce the degree of association of this rule.Learning database exports regular data to knowledge base, when knowledge base receives the data from learning database, then by a series of thought process acquire knowledge, and by these knowledge stored in knowledge base.

Step 12: if current knowledge data have existed in knowledge base, reexamined and whether meet update condition, if meet update condition, upgraded, otherwise these data are back in learning database.

Step 13: after learning database accepts process data, if one there is multiple result in data, then first export these data to scorer, whole scoring process is performed by actuator, in actuator, first take out document and after marking with scorer again retrography knowledge base from knowledge base.

The above is only the preferred embodiment of the present invention; be noted that for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1., based on a kernel keyword extraction method for B2B platform, it is characterized in that: comprise the steps:

2. the kernel keyword extraction method based on B2B platform according to claim 1, it is characterized in that: in described step (2), product high frequency dictionary derives from product information, industry high frequency dictionary comprises trade information, needs to carry out correlativity process by product associative processor to product information; Product information comprises product IDs and product keyword, and trade information comprises industry ID and the popular keyword of industry;

3. the kernel keyword extraction method based on B2B platform according to claim 1, it is characterized in that: described core word segmentation processing device comprises word segmentation processing device, affixe processor, root process device, single complex processing device, Temporal dispose device, similarity processor, word recombination module, keyword index storehouse and learning database, wherein:

4. the kernel keyword extraction method based on B2B platform according to claim 3, is characterized in that: described word segmentation processing device, to the name of product of English, is split, comprise the steps: by traversal space

1. name of product is split as word according to space;