CN104408173B - A kind of kernel keyword extraction method based on B2B platform - Google Patents
A kind of kernel keyword extraction method based on B2B platform Download PDFInfo
- Publication number
- CN104408173B CN104408173B CN201410765503.8A CN201410765503A CN104408173B CN 104408173 B CN104408173 B CN 104408173B CN 201410765503 A CN201410765503 A CN 201410765503A CN 104408173 B CN104408173 B CN 104408173B
- Authority
- CN
- China
- Prior art keywords
- word
- product
- name
- dictionary
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of kernel keyword extraction method based on B2B platform, for English name of product, based on English grammar and semanteme, to extract kernel keyword.The present invention provide kernel keyword extraction method based on B2B platform, big Data Concurrent calculate in, be converted into prototype at the various tense of English word time, carry out word process according to one group of rule and carry out self-teaching in terms of be respectively provided with clear superiority.
Description
Technical field
The present invention relates to a kind of kernel keyword extraction method based on B2B platform.
Background technology
E-commerce development so far, have accumulated the information of magnanimity, and substantial amounts of user, including visitor, transaction
Person, informant etc.;And the height of information repeats to occupy substantial amounts of server resource.
When use search engine carry out keyword search time, need to be submitted in server key word, server according to
Key word scans in mass data, returns Search Results after finding relevant one group information;If the searching of concurrency
Rope, then can have a huge impact server.Efficiency (search speed) and the quality of search (are searched for knot by the quality of key word
The dependency of fruit) have a very big impact, it is therefore desirable to set up the method that a kernel keyword automatically extracts, by key word
(in conjunction with other data) by a series of filtration, participle, mate, restructuring etc. processes, and to draw kernel keyword, allows server
Scan for according to kernel keyword, to improve efficiency and the quality of search.
Key word that product information supplier is arranged for its product and the related term of a collection of high-quality, the standard to product attribute
Really, reflection is very helpful comprehensively.In theory, key word, related term and ProductName product information supplier arranged
Claim, after using segmentation methods, stem algorithm, word reassembly algorithm etc. to process, value word can be extracted and set up index, thus
Finally extract kernel keyword.
Domestic more existing segmenting methods are the most single, especially for automatically extracting of English kernel keyword, only
Carry out coupling for continuous individual character to extract, it is impossible to mate continuous phrase or discontinuous word, easily miss a lot of valuable core
Heart key word, such as:
Chinese patent CN200710122439.1, gives a kind of Words partition system and method, and it utilizes cutting labelling to split
Character string, is then identified according to the continuous individual character in machine word segmentation result, finally extracts core word.But at the method
Reason result may result in the loss of some core value words, and again separating character string is carried out by mechanical segmentation method
Coupling, the efficiency in big data quantity is low-down.
Chinese patent CN200910083775.9, gives a kind of participle processing method and text searching method, and it passes through
Create new Words partition system based on database feature item, and described database feature item is added in described new Words partition system;
And the query word that user submits to is carried out word segmentation processing to generate word segmentation result collection using described database feature item as vocabulary.
In the method selected data storehouse, field carries out participle as characteristic item, make use of database feature item and the pass of text in data base
Connection relation, effectively improves the word segmentation accuracy of the tradition segmenting methods such as unitary, binary, preset vocabulary;But the method is
Relevant dictionary based on preset vocabulary carries out participle, and this method effect when processing English words is low, and does not has
Relate to stem (word prototype) extraction and.
Accurately in participle, English word, the accurate match of stem is that mass data Chinese and English kernel keyword automatically extracts
Important content, is also to improve Mass Data Searching efficiency and the important content of quality.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the present invention provides a kind of core based on B2B platform
Key word extraction method, for English name of product, based on English grammar and semanteme, to extract kernel keyword.
Technical scheme: for achieving the above object, the technical solution used in the present invention is:
A kind of kernel keyword extraction method based on B2B platform, comprises the steps:
(1) using the user setup name of product in B2B platform, search word and industry hot topic word as dictionary source, to word
Source, storehouse is saved in Data Mart after carrying out pretreatment, constitutes name of product core word bank;Dictionary source is carried out the side of pretreatment
Method is:
To user setup name of product, the principle used initially with user setup name of product high frequency, reject and wherein make
With number of times less user setup name of product;Again the user setup key word of corresponding user setup name of product is saved in use
Family is arranged in keywords database;
To search word, first filter out the stop word including punctuate and special symbol;Search word high frequency is used to make again
Principle, reject wherein and recently use the search word that frequency is less half a year;Then pre-place is carried out by core word segmentation processing device
Reason, forms search key word and is saved in search high frequency dictionary;
To industry hot topic word, by trade classification, first filter out the stop word including punctuate and special symbol;Adopt again
The principle used with industry hot topic word high frequency, rejects wherein access times less industry hot topic word;Then by core participle
Processor carries out pretreatment, forms industry hot topic key word and is saved in industry high frequency dictionary;
(2) by effective names of product all in current site, first filter out including punctuate and special symbol
Stop word;Then carry out pretreatment by core word segmentation processing device, products obtained therefrom title is saved in product high frequency dictionary;
(3) name of product in product high frequency dictionary is mated with name of product core word bank, coupling is obtained
According to the sequencing output occurred in name of product, one record of each name of product after name of product duplicate removal, it is saved in number
According in fairground, constitute the kernel keyword of name of product;Matched rule is:
If 1. occurring search key word in name of product, and this search key word is user setup key word;
If 2. occurring search key word in name of product, and this search key word is industry hot topic key word;
It is name of product by meeting the search key definition occurred in the name of product of an any of the above matched rule
Kernel keyword.
In described step (2), product high frequency dictionary derives from product information, and industry high frequency dictionary comprises trade information, needs
Product associative processor to be passed through carries out dependency process to product information;Product information includes product IDs and product key word,
Trade information includes industry ID and industry hot topic key word;
Product key word industry type corresponding for each name of product is classified, specifically includes following steps:
(21) by word matched, product key word and industry hot topic key word are mated, according to the common spy occurred
Levy, determine category of employment belonging to this product;
(22) according to the category of employment determined, by product key word output to synonym corpus, crucial according to product
The common word occurred expands product key word to word with synonym corpus;
(23) the product key word not having in dictionary is first rejected, then by product key word that is uncommon and that cannot mate
Export to learning database, the output of remaining product key word is carried out pretreatment to core word segmentation processing device.
Described core word segmentation processing device include word segmentation processing device, affixe processor, root process device, DANFU number processor,
Tense processor, similarity processor, word recombination module, keyword index storehouse and learning database, wherein:
Described word segmentation processing device, to English name of product, is split by traversal space, comes according to word and phrase
Carrying out word segmentation processing, combination forms<name of product, key word>sequence, and is ranked up according to product IDs;
Described affixe processor, the data produced after word segmentation processing device is processed, remove that each word is front/rear to be sewed, by word
Other forms be converted into noun, or derivative is converted into noun, the noun obtained is mated with dictionary;For cannot
The word matched with dictionary, by corresponding word output to learning database;For the word that can match with dictionary, update
Form<name of product, key word>sequence;
Described root process device, the data produced after processing affixe processor, according to the word of root algorithm foundation word
Property carries out root extraction, then is mated with dictionary by the root of extraction;For the word that cannot match with dictionary, will be corresponding
Word output in learning database;For the word that can match with dictionary, more it is newly formed<name of product, key word>sequence
Row;
Described DANFU number processor, the data produced after root process device is processed, carry out single complex processing, word is turned
It is changed to prototype, is more newly formed<name of product, key word>sequence;
Described tense processor, the data produced after DANFU number processor is processed, carry out tense process, turning word
It is changed to prototype, is more newly formed<name of product, key word>sequence;
Described similarity processor, when the word that coupling obtains exists two or more implication, by similarity processor
Calculate the word meaning of maximum similarity;
Described word recombination module, the data produced after tense processor is processed, first pass through the inspection of word-forming dictionary, word
Shape distance calculates with smallest edit distance, similar key rule processes, and rejects the word that spelling makes mistakes;Then pass through the place of learning database
Reason, provides the word of correct spelling, is combined into the data of just data structure, deposit to caching;Finally according to industry class
Data in caching are set up index by type, export to kernel keyword index database;
Data in caching are created as kernel keyword index text file by described keyword index storehouse;Meanwhile, for
The industry core word of word recombination module output sets up industry core word index text file, for the output of word recombination module
Search core word sets up search core word index text file;High frequency words in search core word bank i.e. constitutes previously described
Search high frequency dictionary, the high frequency words in industry core word bank constitutes previously described industry high frequency dictionary, key dictionary
In high frequency words constitute previously described name of product high frequency dictionary;
Described learning database, including learner, knowledge base, executor and four essential parts of scorer, when affixe processor,
The data that root process device, product associative processor and word recombination module produce export to learning database, and data initially enter
Practise device;Input data are learnt by the knowledge that learner combines in knowledge base, first set up one group of rule, then computation rule power
Weight and variable weight, export the rule set up and amount of calculation to knowledge base;Knowledge base carries out a series of think of to input data
Dimension process is to obtain knowledge, and described knowledge refers to a series of regular algorithm, if the algorithm obtained has existed knowledge base
In, then checking whether the condition meeting more new knowledge base, if meeting update condition, knowledge base being updated, otherwise by data
In return value learner;The knowledge that knowledge base is obtained by executor performs, and the result that executor is performed by scorer is carried out
Scoring, if it is qualified to mark, then this knowledge meets the condition of more new knowledge base.
Described word segmentation processing device, to English name of product, is split by traversal space, comprises the steps:
1. name of product is split as word according to space;
2. the stop word including punctuate and special symbol is removed, to residue word according to 0,1,2 ..., N compiles
Number;
3. for the n-th word, the n-th word and the n-th+i word are mated: if the n-th word and the n-th+i are individual
Word is phrase, then n=n+1, until n=N;Otherwise, the n-th word and the n-th+i word are word, i=i+1, until n+i
=N;N=0,1,2 ..., N, i=1,2 ....
Beneficial effect: the kernel keyword extraction method based on B2B platform that the present invention provides, compared to existing skill
Art, has the advantage that
1, in big Data Concurrent calculates, there is clear superiority: height can be provided the user by distributed memory database
Performance, High Availabitity, scalable data calculate service, by data are distributed to multiple calculating service node, directly at internal memory
Middle calculating, manage and safeguard data, unified access interface and optional redundancy backup mechanism are externally provided;
2, there is clear superiority when the various tense of English word is converted into prototype: had various by series of algorithms
The word of English tense, is converted into prototype;
3, carry out word process according to one group of rule and carry out self-teaching aspect there is clear superiority: for English word
The feature of itself and common cacography, a kind of method giving English-word spelling error correction.
Accompanying drawing explanation
Fig. 1 is the structured flowchart of learning database;
Fig. 2 is the block architecture diagram of the inventive method;
Fig. 3 is the implementing procedure figure of the inventive method.
Detailed description of the invention
A kind of kernel keyword extraction method based on B2B platform, comprises the steps:
(1) using the user setup name of product in B2B platform, search word and industry hot topic word as dictionary source, to word
Source, storehouse is saved in Data Mart after carrying out pretreatment, constitutes name of product core word bank;Dictionary source is carried out the side of pretreatment
Method is:
To user setup name of product, the principle used initially with user setup name of product high frequency, reject and wherein make
With number of times less user setup name of product;Again the user setup key word of corresponding user setup name of product is saved in use
Family is arranged in keywords database;
To search word, first filter out the stop word including punctuate and special symbol;Search word high frequency is used to make again
Principle, reject wherein and recently use the search word that frequency is less half a year;Then pre-place is carried out by core word segmentation processing device
Reason, forms search key word and is saved in search high frequency dictionary;
To industry hot topic word, by trade classification, first filter out the stop word including punctuate and special symbol;Adopt again
The principle used with industry hot topic word high frequency, rejects wherein access times less industry hot topic word;Then by core participle
Processor carries out pretreatment, forms industry hot topic key word and is saved in industry high frequency dictionary;
(2) by effective names of product all in current site, first filter out including punctuate and special symbol
Stop word;Then carry out pretreatment by core word segmentation processing device, products obtained therefrom title is saved in product high frequency dictionary;
(3) name of product in product high frequency dictionary is mated with name of product core word bank, coupling is obtained
According to the sequencing output occurred in name of product, one record of each name of product after name of product duplicate removal, it is saved in number
According in fairground, constitute the kernel keyword of name of product;Matched rule is:
If 1. occurring search key word in name of product, and this search key word is user setup key word;
If 2. occurring search key word in name of product, and this search key word is industry hot topic key word;
It is name of product by meeting the search key definition occurred in the name of product of an any of the above matched rule
Kernel keyword.
In described step (2), product high frequency dictionary derives from product information, and industry high frequency dictionary comprises trade information, needs
Product associative processor to be passed through carries out dependency process to product information;Product information includes product IDs and product key word,
Trade information includes industry ID and industry hot topic key word;
Product key word industry type corresponding for each name of product is classified, specifically includes following steps:
(21) by word matched, product key word and industry hot topic key word are mated, according to the common spy occurred
Levy, determine category of employment belonging to this product;
(22) according to the category of employment determined, by product key word output to synonym corpus, crucial according to product
The common word occurred expands product key word to word with synonym corpus;
(23) the product key word not having in dictionary is first rejected, then by product key word that is uncommon and that cannot mate
Export to learning database, the output of remaining product key word is carried out pretreatment to core word segmentation processing device.
Described core word segmentation processing device include word segmentation processing device, affixe processor, root process device, DANFU number processor,
Tense processor, similarity processor, word recombination module, keyword index storehouse and learning database, wherein:
Described word segmentation processing device, to English name of product, is split by traversal space, comes according to word and phrase
Carrying out word segmentation processing, combination forms<name of product, key word>sequence, and is ranked up according to product IDs;
Described affixe processor, the data produced after word segmentation processing device is processed, remove that each word is front/rear to be sewed, by word
Other forms be converted into noun, or derivative is converted into noun, the noun obtained is mated with dictionary;For cannot
The word matched with dictionary, by corresponding word output to learning database;For the word that can match with dictionary, update
Form<name of product, key word>sequence;
Described root process device, the data produced after processing affixe processor, according to the word of root algorithm foundation word
Property carries out root extraction, then is mated with dictionary by the root of extraction;For the word that cannot match with dictionary, will be corresponding
Word output in learning database;For the word that can match with dictionary, more it is newly formed<name of product, key word>sequence
Row;
Described DANFU number processor, the data produced after root process device is processed, carry out single complex processing, word is turned
It is changed to prototype, is more newly formed<name of product, key word>sequence;
Described tense processor, the data produced after DANFU number processor is processed, carry out tense process, turning word
It is changed to prototype, is more newly formed<name of product, key word>sequence;
Described similarity processor, when the word that coupling obtains exists two or more implication, by similarity processor
Calculate the word meaning of maximum similarity;
Described word recombination module, the data produced after tense processor is processed, first pass through the inspection of word-forming dictionary, word
Shape distance calculates with smallest edit distance, similar key rule processes, and rejects the word that spelling makes mistakes;Then pass through the place of learning database
Reason, provides the word of correct spelling, is combined into the data of just data structure, deposit to caching;Finally according to industry class
Data in caching are set up index by type, export to kernel keyword index database;
Data in caching are created as kernel keyword index text file by described keyword index storehouse;Meanwhile, for
The industry core word of word recombination module output sets up industry core word index text file, for the output of word recombination module
Search core word sets up search core word index text file;
Described learning database, including learner, knowledge base, executor and four essential parts of scorer, when affixe processor,
The data that root process device, product associative processor and word recombination module produce export to learning database, and data initially enter
Practise device;Input data are learnt by the knowledge that learner combines in knowledge base, first set up one group of rule, then computation rule power
Weight and variable weight, export the rule set up and amount of calculation to knowledge base;Knowledge base carries out a series of think of to input data
Dimension process is to obtain knowledge, and described knowledge refers to a series of regular algorithm, if the algorithm obtained has existed knowledge base
In, then checking whether the condition meeting more new knowledge base, if meeting update condition, knowledge base being updated, otherwise by data
In return value learner;The knowledge that knowledge base is obtained by executor performs, and the result that executor is performed by scorer is carried out
Scoring, if it is qualified to mark, then this knowledge meets the condition of more new knowledge base.
Each ingredient with regard to core word segmentation processing device is specifically described below.
Word segmentation processing device, to English name of product, is split by traversal space, comprises the steps:
1. name of product is split as word according to space;
2. the stop word including punctuate and special symbol is removed, to residue word according to 0,1,2 ..., N compiles
Number;
3. for the n-th word, the n-th word and the n-th+i word are mated: if the n-th word and the n-th+i are individual
Word is phrase, then n=n+1, until n=N;Otherwise, the n-th word and the n-th+i word are word, i=i+1, until n+i
=N;N=0,1,2 ..., N, i=1,2 ....
Such as:
Name of product is: Collapsible Silicone Lunch Box Cooker Food Container
It is split as by space: Collapsible/Silicone/Lunch/Box/Cooker/Food/Container
Group of words is searched: for word Collapsible, first judge whether Collapsible Silicone is phrase:
If Collapsible Silicone is phrase, then terminates this circulation, start to judge Silicone;If Collapsible
Silicone is not phrase, then judge whether Collapsible Lunch is phrase.According to this rule, obtain following form:
The words/phrases split result that table 1 word segmentation processing device obtains
Product IDs | Word (key word) | Type (type) |
1 | Collapsible | 0 |
1 | Silicone | 1 |
1 | Lunch Box | 1 |
1 | Cooker | 0 |
1 | Food Container | 1 |
In table, in type, 1 is phrase, and 0 is word, it is assumed here that Lunch Box and Food Container is phrase.
Affixe processor, the data produced after word segmentation processing device is processed, remove that each word is front/rear to be sewed, by its of word
His form is converted into noun, or derivative is converted into noun, such as, pronounce is converted into pronunciation, general
Explain is converted into explanation etc..The noun obtained mated with dictionary: for matching with dictionary
Word, by corresponding word output to learning database;For the word that can match with dictionary, be more newly formed name of product,
Key word > sequence.
Root process device, the data produced after affixe processor is processed, enter according to the part of speech of word according to root algorithm
Row root extracts, then is mated with dictionary by the root of extraction;For the word that cannot match with dictionary, will be single accordingly
Word exports to learning database;For the word that can match with dictionary, more it is newly formed<name of product, key word>sequence.
DANFU number processor, the data produced after root process device is processed, carry out single complex processing, word is converted to
Prototype, concrete conversion method is as follows:
1. to general word, by constituting plural form at word end plus suffix-s, when word endings is voiceless consonant
Time read [s], when word endings is voiced consonant or vowel read [z];For the plural form of this kind of word, ended up by removal
Letter s is to be converted to prototype;Such as word shoes, is converted to prototype shoe;
2. to the general word ended up with s, z, x, ch, sh, by constituting plural number at word end plus suffix-es;Right
In the plural form of this kind of word, the alphabetical es ended up by removal is to be converted to prototype;Such as word buses, is converted to former
Type bus;
3. for the general word ended up with consonant y, by the y of ending is become i, add suffix-es and constitute
Plural number;For the plural form of this kind of word, by the ies of ending is transformed to y to be converted to prototype;Such as word
Candies, is converted to prototype candy;
4. to the general word ended up with o, if special word or abbreviation, then by adding suffix-es at word end
Constitute plural number, otherwise, by constituting plural number at word end plus suffix-s;First for the plural form of this kind of word,
Join whether word is special word or abbreviation, the most then the alphabetical es ended up by removal, to be converted to prototype, otherwise, is passed through
Remove the alphabetical s of ending to be converted to prototype;Such as word tomatoes, is converted to prototype tomato;
The most special dictionary, in order to deposit the word of special DANFU transformation of variables form, such as piano, photo, roof,
Affix, fish, men, child, in following several situations, first carry out special dictionary coupling to word, if can not be with
Word matched in special dictionary, is converted the most as the case may be;
6. for the general word ended up with consonant f or fe, plural number is constituted by f or fe of ending being become ves;
For the plural form of this kind of word, by the ves of ending is transformed to f or fe to be converted to prototype;Such as word knives,
Be converted to prototype knife;
7. for the general word ended up with consonant is, plural number is constituted by the is of ending being become es;For this
The plural form of class word, by being transformed to is to be converted to prototype by the es of ending;Such as word axes, is converted to prototype
axis;
8. for the general word ended up with consonant ix, plural number is constituted by the ix of ending being become ices;For
The plural form of this kind of word, by being transformed to ix to be converted to prototype by the ices of ending;Such as word appendices, turns
It is changed to prototype appendix;
9. use semantic analyzer, although special word is ended up with s, but being not plural number, not having if analyzing this book
Standby plural form then retains prototype, otherwise removes s;Such as word goods is construed to goods.
Tense processor, the data produced after DANFU number processor is processed, carry out tense process, being converted to word
Prototype;The present invention, by processing present indefinite simple present state, processes the feature changed over, as having various probability
The model of tree-shaped bifurcated, is divided into following several feature:
1. by algorithm process past idenfinite state, processing time point is the feature changed in the past;
2. by algorithm process future simple tense state, processing time point is the feature changed in the future;
3. by algorithm process past future tense state, processing time point is the feature changed the most in the future
The present invention define Always, Sometime, Until be time sequence status operator to control logic: work as procedure match
During to Always word, after taking out Always word, remainder content performs process;When procedure match to Sometime,
Take out Sometime context and perform process by logic;When procedure match to Until, the same Sometime of processing mode.
Similarity processor, when the word that coupling obtains exists two or more implication, is calculated by similarity processor
Go out the word meaning of maximum similarity.Such as, the word park obtained for coupling, we cannot specify its indication is that park is still stopped
Parking lot, can specify the implication of word now by similarity processor.
The present invention uses the higher-dimension of Words similarity algorithm based on vector space to process rule, by adding a weight
Matrix, carries out the extraction of characteristic vector, to reduce the complexity of similarity, improves efficiency.
Particularly as follows: represent characteristic item with T, refer to the basic language list occurring in text D and text D content can being represented
Position, such text D just can use the incompatible expression of collection of characteristic item T, i.e. D (T1,T2,…,Tk,…,Tn);Such as one text
Have tetra-characteristic items of a, b, c and d, then the text just can be expressed as D (a, b, c, d).For the text containing n characteristic item
For, it will usually give certain weight to each characteristic item and represent its significance level, i.e. D=D (T1w1,T2w2,…,
Tkwk,…,Tnwn), it is abbreviated as D=DW=D (w1,w2,…,wk,…,wn), D is referred to as the vector representation of text D, wkRepresent Tk
Weight.In the example above, it is assumed that the weight of tetra-characteristic items of a, b, c and d is respectively 30,20,20 and 10, then give literary composition
This vector representation is D=D (30,20,20,10).In vector space model, two text D1And D2Between content degree of association
Sim(D1,D2) represent with the cosine of the angle between vector, formula is:
Sim(D1,D2) value is the biggest, represents D1And D2Between similarity the biggest.
Word recombination module, to tense processor process after produce data, first pass through word-forming dictionary check, morphology away from
Process from smallest edit distance calculating, similar key rule, reject the word that spelling makes mistakes, provide correct spelling, and build one
Individual unambiguous misspelling dictionary;When carrying out spell check, search misspelling dictionary, if it is possible to coupling, then it represents that spelling words
Mistake, carries out error correction to this word.The method for building up of misspelling dictionary is as follows.
Sum is occurred for words all in training sample, according to the going out of each word in corpus statistics corpus
Occurrence number, frequency, calculate preposition probability.For language material does not has the word occurred, calculate 1/N, N for instruction by smoothing processing
Practice the occurrence number sum of all words in sample.
Conditional probability uses 1/M, M to be all possible word sum, each conjecture word of such as light word
Conditional probability 1/290,290 be editing distance be all possible conjecture of 1.
It is matrix by 26 letter representations, calculates each letter distance on keyboard by algorithm.
Design conditions Probability p (D | h), it is assumed that this word is the probability size of the word that we input, used here as editor
The concept of distance, calculates the possible editor that all of editing distance is 1.
According to Bayes principle, rearmounted probability is unrelated with generating probability p of each input (D), thus p (h | D) ∝ P (h)
× p (D | h), calculate most probable spelling.
When processing English word, word has the biggest redundancy, possesses following statistical nature:
1. initial comprises the important information of English word;
2., in the case of most of misspellings, the probability that in word, unique letter changes is less;
3. unique consonant sequence more embodies characteristic than unique vowel sequence;
It is 4. double in misspelling word that to write misspelling probability bigger;Such as transposition-> transposition,
Insertion->insertrion。
Based on above statistical nature, employing constructs fault-tolerant function and is:
The original series of the original series of initial+unique consonant+unique vowel
Describe:
If letter collection Σ={ ' a ', ' b ' .... ' z ', ' A ', ' B ' .... ' Z ';
English word is designated as L1L2 ... ..Lm, and wherein Li (1≤i≤m) ∈ Σ, m are that word is long;
Vowel V={ ' a ', ' e ', ' i ', ' o ', ' u ', ' A ', ' T ', ' O ', U};
Consonant C=Σ-V;
Word first letter
FLetter=L1;
The unique sequence code of word medial vowel letter is V_seq, and former sound letter number is Vm, and in word, consonant is unique
Sequence is C_seq, and the number of consonant is Cm.
Fault-tolerant function=fLetter+V_seq+C_seq, if the fault-tolerant value of word Wi is Si in dictionary, pretreatment check and correction is single
The fault-tolerant value of word is Sp, if Sp=Si, then the word that in dictionary, fault-tolerant value Si is corresponding is error correction term.
The mean scan number of times processed according to this method is:
Misspelling word generates fault-tolerant value, if word length is m, in addition to initial, needing scanning total letter number is m-1 time.
Processing average number of comparisons according to this method is:
Misspelling word needs to generate unique vowel original series when generating fault-tolerant value and unique consonant is former
Beginning sequence.In addition to initial, word medial vowel letter number is Vm, and the number of consonant is Cm, wherein Vm+Cm=m-1.
We's average number of comparisons computational methods of method employing are:
CT=1+2+ ... .+Vm-1+1+2+ ...+Cm-1
=[(m-1)2+(m-1)]/2-VmCm-2
=[(m-1)2+(m-1)]/2
If the average word a length of 7 of current English word, then average number of comparisons is 21 times.
The training corpus employing method that the present invention uses is:
Input original large-scale corpus, first the vocabulary that each sentence is correct carried out automatic Evaluation, and provides a scoring,
Then by the height of quality score number, correct vocabulary is ranked up, and on the basis of quality score, considers coverage
Problem, choice of dynamical one set, the set that output is chosen from original corpus is as training corpus.
Whole framework is divided into two parts, quality evaluation part and corpus based on coverage to extract.
The present invention, for quality evaluation part, chooses a less training of high-quality relative size from existing language material
Set, first considers that quality is the most individual, and what definition quality was high to meet following condition:
Source statement and object statement are all than smoother statement;
The present invention is for the evaluation of quality, and with Q, (f e) represents that text is to (f, quality e);
Source statement and object statement use intertranslation rule more accurate;
K represents the Characteristic Number of model integration, and e represents that source statement, f represent that object statement, Wi represent each character pair
Weight, each weight can be obtained by automatic method in the training set of manual construction,
As k=5, the order successively of P1 to P5 is Pdic(f, e), PLM(e), PLM(f), PTM(f | e), PTM(e|f);
The measurement of coverage size is respectively compared three indexs:
Word covers, n-gram covers, translation covers;
With first in candidate's training corpus to as first element of subset in selected corpus, and
It is scanned backward performing, if current sentence is to selecting corpus to be concentrated with new phrase translation, then preferential by this sentence
Add in corpus subset.
Learning database, being divided into is four parts, four essential parts compositions such as learner, knowledge base, executor and scorer,
Relation between various piece is as shown in Figure 1.
The learner purpose that the present invention uses is to go to evaluate learning link with algorithm to recommend it is assumed that by a series of rule
Composition, needs to call many rules if performing an algorithm, and learner will take out knowledge from knowledge base, and by rule
Obtain, perform finally by executor.
The knowledge base purpose that the present invention uses is the form of expression of knowledge, storage feature vector, rule-based algorithm, generation here
Formula rule, process function, semantic network and framework, it then follows alterability and expandability, and be also metadata knowledge base,
It it is a kind of method based on model.
The present invention use executor's purpose carry out performing operation by one group of logic, when learner obtain a kind of new knowledge
Time, then performed by executor, final updating and maintenance knowledge storehouse.
The present invention uses the purpose of scorer to be to be counted in real time when user retrieves by the document scores in knowledge base
Calculate, the frequency that certain keyword occurs in certain document, and the score of all documents all has relation with keyword, and
And be real-time operation, the frequency occurred in knowledge base when certain keyword is the highest, then the scoring of this keyword is the highest,
Vector space model and the Boolean Model of information retrieval are applied in combination it.
Described learner, the present invention is to use one group of learning model, and this model is with one group of linear restriction complexity non-thread
Property model, utilize Gradient Boost framework, owing to GBDT is an application algorithm the most widely, can be used to point
Class, recurrence.Having good effect in a lot of data, calculating each time is the residual error in order to reduce the last time
(residual), and in order to eliminate residual error, we can set up one newly on gradient (Gradient) direction that residual error reduces
Model.Thus, in Gradient Boost, the resume of each new model be so that before the residual error of model past
Gradient direction reduces, and with traditional B oost, sample correct, mistake is weighted the biggest difference.Specific algorithm is permissible
See TreeBoost paper TreeBoost.MH:A boosting algorithm for multi-label
Hierarchical text categorization (2006), by Andrea Esuli, Tiziano Fagni, Fabrizio
Sebastiani, Istituto di Scienza e Tecnologie dell ' Informazione, Consiglio
Nazionale delle Ricerche, Via G Moruzzi, 1 56124Pisa, Italy.Idiographic flow is described as walking as follows
Rapid:
1) a given initial value, when our treated core word is set to initial value;
2) M decision tree (iteration M time) is set up;
3) Function Estimation value F (x) is carried out Logistic conversion;
4) for K the operation being grouped into row vector, the classification yi that K kind that each sample point xi is corresponding is possible, institute
With yi, F (xi), p (xi) is the vector of a K dimension;
5) gradient direction that residual error reduces is tried to achieve;
6) according to each sample point x, the gradient direction reduced with its residual error, obtain one and be made up of J leaf node
Decision tree;
7) after decision tree has been set up, by last formula, the gain of each leaf node can be obtained.
Described scorer: use method calculated below to calculate the score value computing formula of scorer:
Score (q, d)=coord (q, d) queryNorm (q) Σtinq(tf(t in d)·idf(t)2·
t.getBoost()·norm(t,d))
Tf (t in d) represents that item frequency term frequency refers to the number of times that a t occurs in document d.
Idf (t) expression is associated with inverse document frequency, and document frequency points out the number of files docFreq, docFreq of existing item t
The fewest idf is the highest, but value is identical under same inquiry.
(q d) represents the scoring factor, is based on the number occurring query term in document coord.In query string, the word of hit
The most, the biggest more many query term of value that coord calculates, in a document, illustrates that the matcher of a little document is the highest.Silent
Recognize the percentage ratio that query term occurs.
The standard queries that queryNorm (q) inquires about, makes can compare between different inquiry.This factor does not affect document
Sequence, has document all can use this factor because all, so this value is the biggest, and the heaviest default value of impact on entirety scoring:
Norm value, setting up index when, is compiled into a byte type and is saved in index database by the present invention.Take out
Time again norm in index is compiled into the value of a float type.
This patent is by Similarity Measure rule, it is established that the network of personal connections between text, to predict existing text u and item
The scoring of mesh i ∈ I, then first to find out the nearest-neighbors collection T of uu, then according to TuIn i has been commented the score value of undue text.
A=Rx∩RyRepresent the common scoring item collection of x and y of document.Because different score values represents their association
Degree is different, and again because when comparing two Documents Similarity difference, the project the most simultaneously marked does not possesses comparable at this
Property, therefore with them, the absolute value of the absolute value of project scoring difference in a can be represented the similar differences between them, for
Dxy。
The present invention is for scoring difference three kinds of situations of consideration:
When | Rx∩Ry| when=0, because the document that x and y marks the most jointly, so there is no similarity between them, so
Similar differences is 0;
Work as Dxy=0, | Rx∩Ry| when ≠ 0, represent that x and y has the document of common scoring, and their similarity is by jointly
Neighbor node determine;
Work as Dxy≠0,|Rx∩Ry| when ≠ 0, their similarity has common neighbor node and identical distance.
Described knowledge base is to obtain external information by the search of environment, then through analyzing, comprehensive, analogy, conclusion etc. think
Dimension process obtains knowledge, and these knowledge is stored in knowledge base.
The present invention is according to the association rule mining of a kind of new subject-oriented and learner.Enter in the range of naming logistics
The renewal of row knowledge base, represents knowledge base with T here, and knowledge base is the finite aggregate of propositional formula, represents to be added new with p
Knowledge, T p represents and adds new knowledge data p in knowledge base.
Update method, spirit T is the knowledge base meeting condition, and p is the new knowledge meeting condition, if
I.e. W (p, T) is all maximum set of formulas compatible with p in T.Then:
T cp={T' ∪ p} | T' ∈ W (p, T) }
In knowledge base, the contact of each knowledge is the closest, judges and elimination contradiction is the most complicated when adding new knowledge, so needing
Contact between knowledge is limited.
First { constraints graph of p}, constraints graph may have multiple connected component to structure T ∪, and the addition of p is only possible and comprises p
Branch in formula calculate conflict, on other without impact, the most only need to consider the branch comprising p, if this branch knot
Structure is tree, then p is the knowledge base after belonging to renewal, so being processed as root node by p.
A subtree in tree with p as root, this patent sets root as formula R, by shared variable C1C2C3 ... ..Ck and each son
Tree is connected, for C1, C2 ... a group of Ck determines that value, whole tree can be regarded as by R, T1, T2 ... Tk independent sector forms,
The deletion union of sets collection of this part, for whole tree at C1, C2 ... ..Ck is the deletion collection of this group, travels through these and deletes set
All values, deletion collection during available some particular values, so, the deletion collection of subtree can obtain the deletion collection of whole tree.
So to meeting knowledge base T and new knowledge p can be met, when T ∪ the constraints graph of p} is tree-like, the T p asked, because of
This, the update complexity of knowledge base is relevant to the structure of knowledge base.
Described executor, for the process of a single-input single-output, exists such as drag:
Y (k+1)=f [y (k), y (k-1) ..., y (k-P+1), u (k), u (k-1) ..., u (k-Q)]
Wherein: y be output, u be input, k is discrete time coefficient, P, Q be positive integer .f [] be function.
The input u of the executor of object is limited in scope in amplitude, both there is lower limit um and upper limit uM for appointing
What k, has:
um≤u(k)≤uM
Process object described by assuming is reversible, there is retrography document in knowledge base, then can existence function g
[], this patent sets:
U (k)=g [y (k+1), y (k) ..., y (k-p+1), u (k-1), u (k-2) ..., u (k-Q)]
For described object inversion model.Further, input as m dimensional vector Xc, be output as Uc, then output input relation table
It is shown as:
Uc=ψ c (Xc)
Wherein: ψ c is input-output mappings, i.e. can input document into executor from knowledge base, then by executor's retrography
Return in knowledge base.
If the output of ψ c () approaches the output of g ()., can be regarded as the inversion model of executor.In the k moment, false
If input xc (k) is:
Xc (k)=[r (k+1), y (k) ..., y (k-p+1), u (k-1) ..., u (k-q)] and T
Unknown y (k+1) is replaced with given input r (k+1);P and q is the estimated value of P and Q respectively.
When in knowledge base, executor NC has output:
Uc (k)=ψ c [r (k+1), y (k) ..., y (k-p+1), u (k-1) ..., u (k-q)]
Input to executor from knowledge base, and when scorer exports to executor, then training result be enough to make
Output bias e (k)=r (k)-y (k), and when remaining a value the least. then have
Xc (k)=[r (k+1), r (k) ..., r (k-p+1), u (k-1) ..., u (k-q)] T
It is to replace y (t) with r (t);Owing to needing the output from learner to be reflected to executor, this patent by this formula
Feed-forward characteristic, typically requires that departure function J defined in deviation e (k)=r (k) y (k) that object exports minimizes.In order to
Can make calculate accuracy, the most also need calculate text scorer with output knowledge base differential, after having S, pass through
BP algorithm just can improve the weight coefficient of NC.On the basis of this, it is considered to the most inverse execution, directly adapt to perform and indirectly fit
The old structure of instruction that should perform.
The present invention, in executor, improves and carries out retrography after knowledge base is taken out document and marked with scorer,
Further, each extracting cycle performs once to learn, and will reduce learning time.By real-time learning, learning cycle is TL, due to
Learning cycle TL is simply determined by the time of program, so, this patent at general learning cycle TL ratio from scorer and knowledge base
Middle extraction sample cycle Ts is much smaller.
Below in conjunction with example, the present invention is made further instructions.
The present invention combines the access behavior of user in e-commerce platform, by the product key word to user, industry heat
Door search word, and industry key word carry out extract kernel keyword, search information, the setting of key word information for user are entered
Row optimizes, and the Optimized Approaches such as trade information are expanded out the sequence of high-quality, and processes at core search library and each word
Module has reached the effect of self study, provides most crucial key word for consumer products.
It is illustrated in figure 2 the Organization Chart of the present invention, excavates module including data source modules, data memory module and statistics.
Described data source modules, is used for preserving the basic datas such as network log, product information, search information, trade information, as data
Analyze and the Data Source of data mining.Including web log file unit, product information unit, trade information unit.Website
Log unit is used for preserving web log file information, records and the search record of user, product information list including accessing of user
Unit is for preserving the various information of product, and trade information unit is for preserving the various information of industry.Described data memory module,
For preserving the data after each processor processes, and after carrying out cleaned filter, Data Mart unit generates multiple centre
Data base and Relational database.Data memory module is made up of data warehouse unit, cleaned filter, Data Mart unit.Number
It is sequentially connected with according to warehouse unit, cleaned filter, Data Mart unit.Described data warehouse unit, for preserving through ETL process
Data after device process.
It is illustrated in figure 3 the implementing procedure figure of the present invention, specifically includes following steps:
Step 1: using the site search daily record in B2B E-commerce platform website, product information as key word and phase thereof
Close the data source of word.
It is divided into following steps in detail:
(1) name of product attribute, industry in the web log file in B2B E-commerce platform website, product information is selected to close
Keyword and the data source of search key word.
(2) data extracted carry out ETL process, form the key word in user's search behavior and product information and are correlated with
Use information, including the key word in network log, search time, search client IP, product key word etc., and be stored in
In data warehouse.
Wherein, described ETL, refer to be responsible for laggard to interim intermediate layer for data pick-up that be distributed, in heterogeneous data source
Row clean, conversion, integrated, be finally loaded in data warehouse or Data Mart, become the base of on-line analytical processing, data mining
Plinth.
(3) product keyword data is processed through word filter device, remove non-word letter.
1, take data line record, add up word number, and each word space is split.
2, one by one each character is distinguished, if space is not added up, if unblank, and be word, then unite
Meter.
3, judging that previous is space, later is character, then carry out adding up a word.
Step 2: product dictionary derives from product information, industry dictionary of being correlated with comprises trade information, by the relevant place of product
Reason device carries out by industry, product information is carried out dependency process, is carried out point by industry type by key word corresponding for each product
Class.
(1) carry out product key word by word matching method to mate with industry key word, according to the common feature occurred,
Determine which class the key word corresponding to this product belongs to.
(2) after carrying out matching treatment by product key word and industry word, then the result after coupling is exported to synonym
Word corpus processes, and according to key word, the common word occurred expands matching range with corpus.
(3) delete the key word not having in product dictionary, data are back in Relational database.
The data of product information have: product IDs, product key word;The data of industry keywords database have: industry ID, and industry is closed
Keyword.In addition, it is additionally operable to be saved in statistics mining process the middle transition data produced.
Step 3: after product treatment device, if the key word of this product is uncommon word, that needs to carry out entering study
Storehouse processes, and is otherwise directly entered word segmentation processing device.
It is divided into following steps in detail:
(1) delete the key word not having in product dictionary, thus simplify product information middle database.
(2) judge that whether current key word is the key word that uncommon word maybe cannot mate, export to learning database and process,
Otherwise, output is to word segmentation processing device.
Step 4: segmenter receives from product associative processor and the data of comprehensive relevant dictionary, will determine that whether be English
Literary composition word, if English word, will be split by traversal space, and combine formation<name of product, key word>sequence
Row.
(1) every pair of<name of product, key word>sequence, it is ranked up according to product IDs, is stored in data buffer storage, shape
Complete product is become to be correlated with dictionary.
(2) when being judged as non-English word, the output of this word is learnt, if English word then to learning database
Carry out traversal loop and process a product information.
Step 5: affixe processor accepts the data from segmenter, is closed at input affixe by a group data set
Reason device, if there is the word that cannot match with dictionary after the process of affixe processor, then exports to learning database,
Otherwise combine formation<name of product, key word>sequence.
Step 6: by result output to root process device after affixe processor has processed, root process device accept from
The data of affixe processor process, and data are mated further according to dictionary, input to similarity processor, meter the most again
Calculate the root of maximum similarity, finally return again to result, result combination is formed<name of product, key word>sequence.
Step 7: after root process device has processed, by result sequence output to DANFU number processor, when plural number accepts
One data then circulates each letter to each word and processes, and is then back to root process device if there is abnormal, success
Then data combination is formed<name of product, key word>sequence.
Step 8: by DANFU number processor by data sequence output to tense processor, after tense processor receives data,
Assert which kind of tense judgement belongs to according to time sequence status sample temporal logic, call different results, call according to type
Always, Sometime, Until, Next tetra-kinds process.
Step 9: by the output of tense processor, enters word recombination module, first passes through word-forming dictionary inspection, carry out morphology
Distance and smallest edit distance calculate, and similar key rule processes, and reject the word that spelling makes mistakes, then pass through learning database
Process, provide corresponding correct spelling, be combined into legacy data structure, deposit to caching, further according to dissimilar, will be slow
Data in depositing, carry out setting up index, export to index database.
Step 10: exported data by word recombination module, to kernel keyword index database, by caching in index database
Data are created as kernel keyword index text file, if being industry core word from word restructuring output type, then will set up
The index text file of industry core word, if being search core word from word restructuring output type, then sets up search core word
Index file text.
Step 11: when the output of affixe processor, product associative processor, root process device and word recombination module is to learning
Practising storehouse, after learning database receives these data, data initially enter learner and learn, and first set up one group of rule base, so
Rear computation rule weight and variable weight, determine the input-output space of model, then matching degree tolerance, need limit further according to data
Making unmatched model quantity, if can not find the rule of correspondence, then reducing the degree of association of this rule.Learning database is by rule number
According to output to knowledge base, when knowledge base receives the data from learning database, then obtain knowledge by a series of thinking process,
And these knowledge are stored in knowledge base.
Step 12: if current knowledge data have existed in knowledge base, reexamine and whether meet update condition, if full
Foot update condition is then updated, and otherwise these data is back in learning database.
Step 13: after learning database acceptance processes data, if there is multiple result in a data, first by this number
According to output to scorer, whole scoring process is performed by executor, in executor, first takes out document also from knowledge base
After marking with scorer again in retrography knowledge base.
The above is only the preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art
For Yuan, under the premise without departing from the principles of the invention, it is also possible to make some improvements and modifications, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (3)
1. a kernel keyword extraction method based on B2B platform, it is characterised in that: comprise the steps:
(1) using the user setup name of product in B2B platform, search word and industry hot topic word as dictionary source, to dictionary source
It is saved in Data Mart after carrying out pretreatment, constitutes name of product core word bank;The method that dictionary source carries out pretreatment is:
To user setup name of product, the principle used initially with user setup name of product high frequency, reject wherein use time
The less user setup name of product of number;Again the user setup key word of corresponding user setup name of product is saved in user to set
Put in keywords database;
To search word, first filter out the stop word including punctuate and special symbol;Search word high frequency is used to use again
Principle, rejects and wherein uses the search word that frequency is less half a year recently;Then pretreatment, shape are carried out by core word segmentation processing device
Search key word is become to be saved in search high frequency dictionary;
To industry hot topic word, by trade classification, first filter out the stop word including punctuate and special symbol;Use row again
The principle that industry hot topic word high frequency uses, rejects wherein access times less industry hot topic word;Then by core word segmentation processing
Device carries out pretreatment, forms industry hot topic key word and is saved in industry high frequency dictionary;
(2) by effective names of product all in current site, the non-use including punctuate and special symbol is first filtered out
Word;Then carry out pretreatment by core word segmentation processing device, products obtained therefrom title is saved in product high frequency dictionary;
(3) name of product in product high frequency dictionary is mated with name of product core word bank, the product that coupling is obtained
According to the sequencing output occurred in name of product, one record of each name of product after title duplicate removal, it is saved in data set
In city, constitute the kernel keyword of name of product;Matched rule is:
If 1. occurring search key word in name of product, and this search key word is user setup key word;
If 2. occurring search key word in name of product, and this search key word is industry hot topic key word;
It is the core of name of product by meeting the search key definition occurred in the name of product of an any of the above matched rule
Heart key word;
Described core word segmentation processing device includes word segmentation processing device, affixe processor, root process device, DANFU number processor, tense
Processor, similarity processor, word recombination module, keyword index storehouse and learning database, wherein:
Described word segmentation processing device, to English name of product, is split by traversal space, carries out according to word and phrase
Word segmentation processing, combination forms<name of product, key word>sequence, and is ranked up according to product IDs;
Described affixe processor, the data produced after word segmentation processing device is processed, remove that each word is front/rear to be sewed, by its of word
His form is converted into noun, or derivative is converted into noun, is mated with dictionary by the noun obtained;For cannot be with word
The word that allusion quotation matches, by corresponding word output to learning database;For the word that can match with dictionary, more it is newly formed
<name of product, key word>sequence;
Described root process device, the data produced after affixe processor is processed, enter according to the part of speech of word according to root algorithm
Row root extracts, then is mated with dictionary by the root of extraction;For the word that cannot match with dictionary, will be single accordingly
Word exports to learning database;For the word that can match with dictionary, more it is newly formed<name of product, key word>sequence;
Described DANFU number processor, the data produced after root process device is processed, carry out single complex processing, word is converted to
Prototype, is more newly formed<name of product, key word>sequence;
Described tense processor, the data produced after DANFU number processor is processed, carry out tense process, word is converted to former
Type, is more newly formed<name of product, key word>sequence;
Described similarity processor, when the word that coupling obtains exists two or more implication, is calculated by similarity processor
Go out the word meaning of maximum similarity;
Described word recombination module, to tense processor process after produce data, first pass through word-forming dictionary check, morphology away from
Process from smallest edit distance calculating, similar key rule, reject the word that spelling makes mistakes;Then pass through the process of learning database,
Provide the word of correct spelling, be combined into the data of just data structure, deposit to caching;Finally according to industry type,
Data in caching are set up index, exports to kernel keyword index database;
Data in caching are created as kernel keyword index text file by described keyword index storehouse;Meanwhile, for word
The industry core word of recombination module output sets up industry core word index text file, for the search of word recombination module output
Core word sets up search core word index text file;
Described learning database, including learner, knowledge base, executor and four essential parts of scorer, when affixe processor, root
The data that processor, product associative processor and word recombination module produce export to learning database, and data initially enter learner;
Learner combines knowledge in knowledge base and learns input data, first sets up one group of rule, then computation rule weight and
Variable weight, exports the rule set up and amount of calculation to knowledge base;Knowledge base carries out a series of thinking to input data
Journey is to obtain knowledge, and described knowledge refers to a series of regular algorithm, if the algorithm obtained has existed in knowledge base, then
Checking whether the condition meeting more new knowledge base, if meeting update condition, knowledge base being updated, otherwise data are returned
In learner;The knowledge that knowledge base is obtained by executor performs, and the result that executor is performed by scorer is marked, if
Mark qualified, then this knowledge meets the condition of more new knowledge base.
Kernel keyword extraction method based on B2B platform the most according to claim 1, it is characterised in that: described
In step (2), product high frequency dictionary derives from product information, and industry high frequency dictionary comprises trade information, needs by product phase
Close processor and product information is carried out dependency process;Product information includes product IDs and product key word, and trade information includes
Industry ID and industry hot topic key word;
Product key word industry type corresponding for each name of product is classified, specifically includes following steps:
(21) by word matched, product key word and industry hot topic key word are mated, according to the common feature occurred,
Determine category of employment belonging to this product;
(22) according to the category of employment determined, by the output of product key word in synonym corpus, according to product key word with
In synonym corpus, the common word occurred expands product key word;
(23) first reject the product key word not having in dictionary, then product key word that is uncommon and that cannot mate is exported
To learning database, the output of remaining product key word is carried out pretreatment to core word segmentation processing device.
Kernel keyword extraction method based on B2B platform the most according to claim 1, it is characterised in that: described
Word segmentation processing device, to English name of product, is split by traversal space, comprises the steps:
1. name of product is split as word according to space;
2. the stop word including punctuate and special symbol is removed, to residue word according to 0,1,2 ..., N is numbered;
3. for the n-th word, the n-th word and the n-th+i word are mated: if the n-th word and the n-th+i word
For phrase, then n=n+1, until n=N;Otherwise, the n-th word and the n-th+i word are word, i=i+1, until n+i=N;
N=0,1,2 ..., N, i=1,2 ....
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410765503.8A CN104408173B (en) | 2014-12-11 | 2014-12-11 | A kind of kernel keyword extraction method based on B2B platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410765503.8A CN104408173B (en) | 2014-12-11 | 2014-12-11 | A kind of kernel keyword extraction method based on B2B platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104408173A CN104408173A (en) | 2015-03-11 |
CN104408173B true CN104408173B (en) | 2016-12-07 |
Family
ID=52645804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410765503.8A Active CN104408173B (en) | 2014-12-11 | 2014-12-11 | A kind of kernel keyword extraction method based on B2B platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104408173B (en) |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104978404B (en) * | 2015-06-04 | 2018-07-20 | 无锡天脉聚源传媒科技有限公司 | A kind of generation method and device of video album title |
CN104978400A (en) * | 2015-06-04 | 2015-10-14 | 无锡天脉聚源传媒科技有限公司 | Method for generating video album name and apparatus |
CN104978403B (en) * | 2015-06-04 | 2018-08-24 | 无锡天脉聚源传媒科技有限公司 | A kind of generation method and device of video album title |
CA2976001A1 (en) * | 2015-08-11 | 2017-02-16 | Vissie CHEN | Method for producing, ensuring, accessing and selling quality data |
WO2017045186A1 (en) * | 2015-09-17 | 2017-03-23 | 深圳市世强先进科技有限公司 | Keyword defining method and system |
CN106803197B (en) * | 2015-11-26 | 2021-09-03 | 北京嘀嘀无限科技发展有限公司 | Order splicing method and equipment |
CN107203542A (en) * | 2016-03-17 | 2017-09-26 | 阿里巴巴集团控股有限公司 | Phrase extracting method and device |
CN105893592B (en) * | 2016-04-12 | 2019-06-21 | Oppo广东移动通信有限公司 | A kind of searching method and device |
CN106383910B (en) * | 2016-10-09 | 2020-02-14 | 合一网络技术(北京)有限公司 | Method for determining search term weight, and method and device for pushing network resources |
CN108121754B (en) * | 2016-11-30 | 2020-11-24 | 北京国双科技有限公司 | Method and device for acquiring keyword attribute combination |
CN108241699B (en) * | 2016-12-26 | 2022-03-11 | 百度在线网络技术(北京)有限公司 | Method and device for pushing information |
CN106980961A (en) * | 2017-03-02 | 2017-07-25 | 中科天地互联网科技(苏州)有限公司 | A kind of resume selection matching process and system |
CN108984554B (en) * | 2017-06-01 | 2021-06-29 | 北京京东尚科信息技术有限公司 | Method and device for determining keywords |
CN107391481A (en) * | 2017-06-29 | 2017-11-24 | 清远墨墨教育科技有限公司 | A kind of vocabulary screening technique for vocabulary test |
CN108038100A (en) * | 2017-11-30 | 2018-05-15 | 四川隧唐科技股份有限公司 | engineering keyword extracting method and device |
CN109947947B (en) * | 2019-03-29 | 2021-11-23 | 北京泰迪熊移动科技有限公司 | Text classification method and device and computer readable storage medium |
CN111782760A (en) * | 2019-05-09 | 2020-10-16 | 北京沃东天骏信息技术有限公司 | Core product word recognition method, device and equipment |
CN111339763B (en) * | 2020-02-26 | 2022-06-28 | 四川大学 | English mail subject generation method based on multi-level neural network |
CN111859972B (en) * | 2020-07-28 | 2024-03-15 | 平安科技(深圳)有限公司 | Entity identification method, entity identification device, computer equipment and computer readable storage medium |
CN112163421B (en) * | 2020-10-09 | 2022-05-17 | 厦门大学 | Keyword extraction method based on N-Gram |
CN112818693A (en) * | 2021-02-07 | 2021-05-18 | 深圳市世强元件网络有限公司 | Automatic extraction method and system for electronic component model words |
CN113032683B (en) * | 2021-04-28 | 2021-12-24 | 玉米社(深圳)网络科技有限公司 | Method for quickly segmenting words in network popularization |
CN114860865A (en) * | 2022-05-05 | 2022-08-05 | 北京达佳互联信息技术有限公司 | Index construction and resource recall method and device, electronic equipment and storage medium |
CN117272991A (en) * | 2022-09-30 | 2023-12-22 | 上海寰通商务科技有限公司 | Method, device and medium for identifying target object in pharmaceutical industry to be identified |
CN115470323B (en) * | 2022-10-31 | 2023-03-10 | 中建电子商务有限责任公司 | Method for improving searching precision of building industry based on word segmentation technology |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103226618A (en) * | 2013-05-21 | 2013-07-31 | 焦点科技股份有限公司 | Related word extracting method and system based on data market mining |
CN103377190A (en) * | 2012-04-11 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Trading platform based supplier information searching method and device |
CN103745012A (en) * | 2014-01-28 | 2014-04-23 | 广州一呼百应网络技术有限公司 | Method and system for intelligently matching and showing recommended information of web page according to product title |
CN103942347A (en) * | 2014-05-19 | 2014-07-23 | 焦点科技股份有限公司 | Word separating method based on multi-dimensional comprehensive lexicon |
-
2014
- 2014-12-11 CN CN201410765503.8A patent/CN104408173B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377190A (en) * | 2012-04-11 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Trading platform based supplier information searching method and device |
CN103226618A (en) * | 2013-05-21 | 2013-07-31 | 焦点科技股份有限公司 | Related word extracting method and system based on data market mining |
CN103745012A (en) * | 2014-01-28 | 2014-04-23 | 广州一呼百应网络技术有限公司 | Method and system for intelligently matching and showing recommended information of web page according to product title |
CN103942347A (en) * | 2014-05-19 | 2014-07-23 | 焦点科技股份有限公司 | Word separating method based on multi-dimensional comprehensive lexicon |
Also Published As
Publication number | Publication date |
---|---|
CN104408173A (en) | 2015-03-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104408173B (en) | A kind of kernel keyword extraction method based on B2B platform | |
CN111611361B (en) | Intelligent reading, understanding, question answering system of extraction type machine | |
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN107992597B (en) | Text structuring method for power grid fault case | |
CN109271505B (en) | Question-answering system implementation method based on question-answer pairs | |
Guo et al. | Improving multilingual semantic interoperation in cross-organizational enterprise systems through concept disambiguation | |
CN110377715A (en) | Reasoning type accurate intelligent answering method based on legal knowledge map | |
Ru et al. | Using semantic similarity to reduce wrong labels in distant supervision for relation extraction | |
CN109783806B (en) | Text matching method utilizing semantic parsing structure | |
Korobkin et al. | Methods of statistical and semantic patent analysis | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN112036178A (en) | Distribution network entity related semantic search method | |
CN116822625A (en) | Divergent-type associated fan equipment operation and detection knowledge graph construction and retrieval method | |
CN107818081A (en) | Sentence similarity appraisal procedure based on deep semantic model and semantic character labeling | |
Ahmed et al. | Named entity recognition by using maximum entropy | |
Kilias et al. | Idel: In-database entity linking with neural embeddings | |
CN114841353A (en) | Quantum language model modeling system fusing syntactic information and application thereof | |
CN112862569B (en) | Product appearance style evaluation method and system based on image and text multi-modal data | |
Alian et al. | Paraphrasing identification techniques in English and Arabic texts | |
Ahkouk et al. | Comparative study of existing approaches on the Task of Natural Language to Database Language | |
Nethravathi et al. | Structuring natural language to query language: a review | |
Zhang et al. | An approach for named entity disambiguation with knowledge graph | |
Hossen et al. | Bert model-based natural language to nosql query conversion using deep learning approach | |
Angermann et al. | Taxonomy Matching Using Background Knowledge | |
Liu et al. | Keywords extraction method for technological demands of small and medium-sized enterprises based on LDA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |