CN107273513A - Keyword recognition method based on machine learning - Google Patents

Keyword recognition method based on machine learning Download PDF

Info

Publication number
CN107273513A
CN107273513A CN201710474708.4A CN201710474708A CN107273513A CN 107273513 A CN107273513 A CN 107273513A CN 201710474708 A CN201710474708 A CN 201710474708A CN 107273513 A CN107273513 A CN 107273513A
Authority
CN
China
Prior art keywords
word
subtree
node
lead
hash tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710474708.4A
Other languages
Chinese (zh)
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201710474708.4A priority Critical patent/CN107273513A/en
Publication of CN107273513A publication Critical patent/CN107273513A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a kind of keyword recognition method based on machine learning, this method includes:Data retrieval structure is set up using Hash tree dictionary, the ambiguous field occurred during participle is eliminated using statistical method.The present invention proposes a kind of keyword recognition method based on machine learning, improves the process of the ambiguous field processing to participle, possesses more preferably time complexity and cutting accuracy.

Description

Keyword recognition method based on machine learning
Technical field
The present invention relates to natural language processing, more particularly to a kind of keyword recognition method based on machine learning.
Background technology
With network technology and Internet gradually maturation, traditional single keyword mode can not meet current sea The content obtaining demand of information is measured, the important technology that question answering system needs to solve as web search how is designed.From From the point of view of existing question answering system, Chinese is due to the limitation in terms of the complexity and semantics recognition of its participle, formed product technology Relatively fall behind, for example, because existing segmenting method must set the initial value of a matching word length first, word length is long, calculates The time complexity of method is improved;The too short then cutting accuracy reduction of word length.Processing for ambiguous field can not meet reality User needs.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of keyword based on machine learning and known Other method, including:
Data retrieval structure is set up using Hash tree dictionary,
The ambiguous field occurred during participle is eliminated using statistical method.
Preferably, the Hash tree dictionary is used to store character string, realizes the quick lookup of character string;The Hash tree by Lead-in is indexed and hash tree node two parts composition, during the single pass to being split sentence, along the root section of tree chain Point is word for word matched.
Preferably, the Hash tree is according to the number of partitions A of Chinese character code standard and per subregion number of words B, in hash tree node B data cell of middle storage, loads Chinese word segmentation dictionary, and the data retrieval structure of foundation is as follows:
Lead-in is indexed, and is directly positioned using below equation according to the region-position code of word:
Pos=(c1-176)×B+(c2-161)
Pos is position of the word in lead-in index node, c1For the unsigned number of prefix word first character section, c2For second The unsigned number of individual byte.
Preferably, the lead-in index node includes following data:
Attribute:Whether individual character is matched as word, if there is subtree, index word as the most long word of lead-in length;
Subtree size:When there is subtree, lead-in is the number of the two-character phrase of index word, is otherwise 0;
Subtree pointer:When there is subtree, pointer points to subtree, and otherwise pointer is sky;
One unit of lead-in index is the root node of the Hash tree of corresponding word.
Preferably, the hash tree node includes single keyword, the mark with the presence or absence of subtree, and root node is to working as prosthomere Whether point keyword is matched as word, and root node to present node keyword is length, the subtree size of the most long word of prefix, is existed During subtree, root node to current keyword is the number of the word of prefix, is otherwise 0;Subtree pointer:When there is subtree, pointer refers to To subtree, otherwise pointer is sky.
Preferably, methods described also includes:
Using n-th of word in sentence as the state transfer of the Markov process of n-1 word immediately before it, utilize Co-occurrence probability between these words carries out part-of-speech tagging as the transfer between state, and various probability ginsengs are extracted by language material training Number, the probability of the possible corresponding string of a given word is calculated according to probability parameter, then according to predefined standard The appropriate string of selection is used as output.
The present invention compared with prior art, with advantages below:
The present invention proposes a kind of keyword recognition method based on machine learning, improves at the ambiguous field to participle The process of reason, possesses more preferably time complexity and cutting accuracy.
Brief description of the drawings
Fig. 1 is the flow chart of the keyword recognition method according to embodiments of the present invention based on machine learning.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of keyword recognition method based on machine learning.Fig. 1 is according to the present invention The keyword recognition method flow chart based on machine learning of embodiment.
The present invention sets up data retrieval structure using Hash tree dictionary, utilizes what is occurred during statistical method elimination participle Ambiguous field.Hash tree dictionary is used to store character string, realizes the quick lookup of character string.The Hash tree by lead-in index and Tree node two parts composition is hashed, during the single pass to being split sentence, is word for word matched along the root node of tree chain. According to the number of partitions A of Chinese character code standard and per subregion number of words B, B data cell is stored in hash tree node, Chinese is loaded Dictionary for word segmentation.Data structure is as follows:
Lead-in is indexed, and is directly positioned using below equation according to the region-position code of word:
Pos=(c1-176)×B+(c2-161)
Pos is position of the word in lead-in index node, c1For the unsigned number of prefix word first character section, c2For second The unsigned number of individual byte.
Lead-in index node includes following data:Attribute:Whether individual character is matched as word, if there is subtree, and index word is made For the length of the most long word of lead-in;Subtree size:When there is subtree, lead-in is the number of the two-character phrase of index word, is otherwise 0; Subtree pointer:When there is subtree, pointer points to subtree, and otherwise pointer is sky.One unit of lead-in index is corresponding word The root node of Hash tree.
Hash tree node and include single keyword, the mark with the presence or absence of subtree, root node is to present node keyword No matching is word, and root node to present node keyword is length, the subtree size of the most long word of prefix, when there is subtree, root Number of the node to the word that current keyword is prefix.Otherwise it is 0, subtree pointer:When there is subtree, pointer points to subtree, no Then pointer is sky.
Using n-th of word in sentence as the state transfer of the Markov process of n-1 word immediately before it, utilize Co-occurrence probability between these words carries out part-of-speech tagging as the transfer between state, and various probability ginsengs are extracted by language material training Number, the probability of the possible corresponding string of a given word is calculated according to probability parameter, then according to predefined standard The appropriate string of selection is used as output.
It is short sentence by Chinese sentence cutting, then in progress first according to punctuate table on the basis of Hash tree dictionary The match information of character string in matching process is preserved while with participle, is scanned by the match information of character string and by word Method judges whether ambiguous field, and finally to give segmentation process by the intermediate result of pre- cutting handled.Due to every Individual word has the process matched by the lead-in as word, it is possible to find out all ambiguous fields.By above-mentioned improvement Segmentation methods processing after, all cutting routes of ambiguous field are contained in pre- cutting result.According in training corpus Word frequency information, the probability of word in all cutting routes is calculated using statistical model, the word of maximum probability is optimal word.
It is described in detail below:
Step 1:Preliminary participle pretreatment is carried out to search statement, i.e., is first divided into search statement pre-cut according to punctuate table Multiple substrings are simultaneously preserved;
Step 2:Take pre-cut molecule string S=C0C1…Cn-1, n is the length of substring, and initializing variable temp=n, p=0; Q=1000000;
Step 3:Take S j-th of character Cj, obtain CjPosition in lead-in index;
Step 4:The data of present node are read, it is word to include whether matching, if there is subtree, most long word is long, if There is subtree, read subtree pointer and subtree size;
Step 5:I=j+1 is taken, if most long word is long<N, goes to step 11;
Step 6:Take S i-th of character Ci, read each Hash tree node data in subtree pointer:Keyword, if With the attribute for word, most long word is long, subtree pointer and subtree size.Using dichotomy by starting point of each keyword, subtree Size for distance it is interval in match CiIf it fails to match, Ci-1CiDo not form word;
Now, if i-1>Q, then preserve CpCp+1...Ci-1Relevant information moved to right into ambiguous set of fields, and by S I-p word, meanwhile, n=temp- (i-p) is made, 3 are gone to step;
Else if i-1≤q, preserves CpCp+1...Ci-1Into word information, S is moved to right to i-j word at p, n=temp- is made (i-j+1) 3, are gone to step;
Step 7:Read each Hash tree node data in subtree pointer, including keyword, if match as word, most long word It is long, subtree pointer, subtree size;
Step 8:If a length of n of most major term and matching be word, preserve matching after word Ci+1-nCi+2-n...Ci
Step 9:If the word is that the match is successful for the first time, remembers p=i+1-n, q=i, j=p+1, go to step 3;Otherwise, j =i, goes to step 3;
Step 10:I=i+1, goes to step 6;
Step 11:S moves to left a word, and n=n-1 goes to step 3;
Step 12:If ambiguous set of fields is not sky, disambiguation processing is carried out;Otherwise, participle terminates.
Above-mentioned disambiguation processing further comprises:
Three parameters, part of speech state matrix A, B, π are set;
Wherein part of speech state matrix A is part of speech state matrix, between 41 parts of speech state-transition matrix (41 × 41) it is converted into the form storage of table wherein.The element a of matrix AijFor:
aij=N (Ti, Tj)/N(Ti)
Wherein N (Ti, Tj) it is that part of speech marks T in trainingjImmediately TiThe number of times occurred afterwards, N (Ti) it is mark TiOccur Number of times.
Wherein symbol probability distribution matrix B, is stored therein the different part of speech probability of each word correspondence.To in part of speech table Probability, be obtained by the following formula part of speech probability distribution bjk
bjk=N (Wk, Tj)/N(Ti)
Wherein N (Wk, Tj) it is the vocabulary W in trainingkPart-of-speech tagging be TjThe number of times of appearance.
Parameter π represents that initial state probabilities are distributed.
Input parameter collection λ=(A, B, π), the part of speech of each word is determined for giving sentence.Make W=w1w2ΛwmFor one Individual sentence, wiFor a word in sentence.Q=q is made again1q2ΛqmFor mono- possible part-of-speech tagging sequence of sentence W, qiFor word wi's One part-of-speech tagging result.Problem is converted into and seeks a part of speech sequence in a model λ and enables its best illustration sentence W, the present invention uses HMM evaluation criterions, from single most probable state qt, this evaluation criterion causes the number for reaching correct status Purpose desired value is maximum.
For each possible pre- cutting route, initialize first:
δ1(i)=πibi(o1) initialization when bi(o1)
Conclude and calculate and preserve backtracking
δt(j)=max1<i<Nt(j)aij]×bj(ot), 2≤t≤T, 1≤j≤N
Wherein t is moment, aijIt is POS transfer for state transition probability, j is part of speech state, δt(j) it is that t state is J, is output as O1, O2..., OtMaximum probability, the variable save reaches the probability in the most possible path of each node.Profit The optimal path of whole grid is calculated with dynamic programming algorithm.
Finally terminate and produce path (band backtracking)
P=max1<i<NT(j)]
So far, optimal path status switch and best initial weights are tried to achieve.
P values, the P values of relatively more all pre- cutting routes, selection one will be all calculated for each possible pre- cutting route Maximum probability P is planted, output result is used as.
In summary, the present invention proposes a kind of keyword recognition method based on machine learning, improves to participle The process of ambiguous field processing, possesses more preferably time complexity and cutting accuracy.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims (6)

1. a kind of keyword recognition method based on machine learning, it is characterised in that including:
Data retrieval structure is set up using Hash tree dictionary,
The ambiguous field occurred during participle is eliminated using statistical method.
2. according to the method described in claim 1, it is characterised in that the Hash tree dictionary is used to store character string, realizes word Accord with the quick lookup of string;The Hash tree is indexed by lead-in and hashed tree node two parts and constitutes, to being split the one of sentence In secondary scanning process, word for word matched along the root node of tree chain.
3. according to the method described in claim 1, it is characterised in that the Hash tree is according to the number of partitions A of Chinese character code standard With every subregion number of words B, B data cell is stored in hash tree node, Chinese word segmentation dictionary, the data retrieval knot of foundation is loaded Structure is as follows:
Lead-in is indexed, and is directly positioned using below equation according to the region-position code of word:
Pos=(c1-176)×B+(c2-161)
Pos is position of the word in lead-in index node, c1For the unsigned number of prefix word first character section, c2For second word The unsigned number of section.
4. according to the method in claim 2 or 3, it is characterised in that the lead-in index node includes following data:
Attribute:Whether individual character is matched as word, if there is subtree, index word as the most long word of lead-in length;
Subtree size:When there is subtree, lead-in is the number of the two-character phrase of index word, is otherwise 0;
Subtree pointer:When there is subtree, pointer points to subtree, and otherwise pointer is sky;
One unit of lead-in index is the root node of the Hash tree of corresponding word.
5. according to the method in claim 2 or 3, it is characterised in that the hash tree node comprising single keyword, whether There is the mark of subtree, whether root node to present node keyword is matched as word, before root node is to present node keyword Length, the subtree size for the most long word sewed, when there is subtree, root node to current keyword is the number of the word of prefix, otherwise For 0;Subtree pointer:When there is subtree, pointer points to subtree, and otherwise pointer is sky.
6. according to the method described in claim 1, it is characterised in that methods described also includes:
Using n-th of word in sentence as the state transfer of the Markov process of n-1 word immediately before it, these are utilized Co-occurrence probability between word carries out part-of-speech tagging as the transfer between state, and various probability parameters are extracted by language material training, The probability of the possible corresponding string of a given word is calculated according to probability parameter, is then selected according to predefined standard Appropriate string is used as output.
CN201710474708.4A 2017-06-21 2017-06-21 Keyword recognition method based on machine learning Pending CN107273513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710474708.4A CN107273513A (en) 2017-06-21 2017-06-21 Keyword recognition method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710474708.4A CN107273513A (en) 2017-06-21 2017-06-21 Keyword recognition method based on machine learning

Publications (1)

Publication Number Publication Date
CN107273513A true CN107273513A (en) 2017-10-20

Family

ID=60068382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710474708.4A Pending CN107273513A (en) 2017-06-21 2017-06-21 Keyword recognition method based on machine learning

Country Status (1)

Country Link
CN (1) CN107273513A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555097A (en) * 2018-05-31 2019-12-10 罗伯特·博世有限公司 Slot filling with joint pointer and attention in spoken language understanding

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929902A (en) * 2012-07-05 2013-02-13 江苏新瑞峰信息科技有限公司 Character splitting method and device based on Chinese retrieval
CN103365974A (en) * 2013-06-28 2013-10-23 百度在线网络技术(北京)有限公司 Semantic disambiguation method and system based on related words topic
CN103823859A (en) * 2014-02-21 2014-05-28 安徽博约信息科技有限责任公司 Name recognition algorithm based on combination of decision-making tree rules and multiple statistic models
CN105975480A (en) * 2016-04-20 2016-09-28 广州精点计算机科技有限公司 Instruction identification method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929902A (en) * 2012-07-05 2013-02-13 江苏新瑞峰信息科技有限公司 Character splitting method and device based on Chinese retrieval
CN103365974A (en) * 2013-06-28 2013-10-23 百度在线网络技术(北京)有限公司 Semantic disambiguation method and system based on related words topic
CN103823859A (en) * 2014-02-21 2014-05-28 安徽博约信息科技有限责任公司 Name recognition algorithm based on combination of decision-making tree rules and multiple statistic models
CN105975480A (en) * 2016-04-20 2016-09-28 广州精点计算机科技有限公司 Instruction identification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孟旭升: "改进的中文分词算法在自动答疑系统中的应用研究", 《万方学位论文数据库》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555097A (en) * 2018-05-31 2019-12-10 罗伯特·博世有限公司 Slot filling with joint pointer and attention in spoken language understanding

Similar Documents

Publication Publication Date Title
CN110019658B (en) Method and related device for generating search term
KR101744861B1 (en) Compound splitting
CN110287309B (en) Method for quickly extracting text abstract
CN104408191B (en) The acquisition methods and device of the association keyword of keyword
US20100094835A1 (en) Automatic query concepts identification and drifting for web search
CN106033416A (en) A string processing method and device
CN105068997B (en) The construction method and device of parallel corpora
CN106294350A (en) A kind of text polymerization and device
CN110134949A (en) A kind of text marking method and apparatus based on teacher&#39;s supervision
US5553284A (en) Method for indexing and searching handwritten documents in a database
CN109582704A (en) Recruitment information and the matched method of job seeker resume
CN111666764B (en) Automatic abstracting method and device based on XLNet
CN107807910A (en) A kind of part-of-speech tagging method based on HMM
CN104252484A (en) Pinyin error correction method and system
CN109918664B (en) Word segmentation method and device
CN112633000B (en) Method and device for associating entities in text, electronic equipment and storage medium
CN113282689A (en) Retrieval method and device based on domain knowledge graph and search engine
CN111832299A (en) Chinese word segmentation system
JP2011118872A (en) Method and device for determining category of unregistered word
CN106708798A (en) String segmentation method and device
CN110929498A (en) Short text similarity calculation method and device and readable storage medium
CN107256212A (en) Chinese search word intelligence cutting method
CN108509521A (en) A kind of image search method automatically generating text index
JP2015088064A (en) Text summarization device, text summarization method, and program
CN108197315A (en) A kind of method and apparatus for establishing participle index database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171020