CN107273513A

CN107273513A - Keyword recognition method based on machine learning

Info

Publication number: CN107273513A
Application number: CN201710474708.4A
Authority: CN
Inventors: 张鹏
Original assignee: BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2017-10-20

Abstract

The invention provides a kind of keyword recognition method based on machine learning, this method includes：Data retrieval structure is set up using Hash tree dictionary, the ambiguous field occurred during participle is eliminated using statistical method.The present invention proposes a kind of keyword recognition method based on machine learning, improves the process of the ambiguous field processing to participle, possesses more preferably time complexity and cutting accuracy.

Description

Keyword recognition method based on machine learning

Technical field

The present invention relates to natural language processing, more particularly to a kind of keyword recognition method based on machine learning.

Background technology

With network technology and Internet gradually maturation, traditional single keyword mode can not meet current sea The content obtaining demand of information is measured, the important technology that question answering system needs to solve as web search how is designed.From From the point of view of existing question answering system, Chinese is due to the limitation in terms of the complexity and semantics recognition of its participle, formed product technology Relatively fall behind, for example, because existing segmenting method must set the initial value of a matching word length first, word length is long, calculates The time complexity of method is improved；The too short then cutting accuracy reduction of word length.Processing for ambiguous field can not meet reality User needs.

The content of the invention

To solve the problems of above-mentioned prior art, the present invention proposes a kind of keyword based on machine learning and known Other method, including：

Data retrieval structure is set up using Hash tree dictionary,

The ambiguous field occurred during participle is eliminated using statistical method.

Preferably, the Hash tree dictionary is used to store character string, realizes the quick lookup of character string；The Hash tree by Lead-in is indexed and hash tree node two parts composition, during the single pass to being split sentence, along the root section of tree chain Point is word for word matched.

Preferably, the Hash tree is according to the number of partitions A of Chinese character code standard and per subregion number of words B, in hash tree node B data cell of middle storage, loads Chinese word segmentation dictionary, and the data retrieval structure of foundation is as follows：

Lead-in is indexed, and is directly positioned using below equation according to the region-position code of word：

Pos=(c₁-176)×B+(c₂-161)

Pos is position of the word in lead-in index node, c₁For the unsigned number of prefix word first character section, c₂For second The unsigned number of individual byte.

Preferably, the lead-in index node includes following data：

Attribute：Whether individual character is matched as word, if there is subtree, index word as the most long word of lead-in length；

Subtree size：When there is subtree, lead-in is the number of the two-character phrase of index word, is otherwise 0；

Subtree pointer：When there is subtree, pointer points to subtree, and otherwise pointer is sky；

One unit of lead-in index is the root node of the Hash tree of corresponding word.

Preferably, the hash tree node includes single keyword, the mark with the presence or absence of subtree, and root node is to working as prosthomere Whether point keyword is matched as word, and root node to present node keyword is length, the subtree size of the most long word of prefix, is existed During subtree, root node to current keyword is the number of the word of prefix, is otherwise 0；Subtree pointer：When there is subtree, pointer refers to To subtree, otherwise pointer is sky.

Preferably, methods described also includes：

Using n-th of word in sentence as the state transfer of the Markov process of n-1 word immediately before it, utilize Co-occurrence probability between these words carries out part-of-speech tagging as the transfer between state, and various probability ginsengs are extracted by language material training Number, the probability of the possible corresponding string of a given word is calculated according to probability parameter, then according to predefined standard The appropriate string of selection is used as output.

The present invention compared with prior art, with advantages below：

The present invention proposes a kind of keyword recognition method based on machine learning, improves at the ambiguous field to participle The process of reason, possesses more preferably time complexity and cutting accuracy.

Brief description of the drawings

Fig. 1 is the flow chart of the keyword recognition method according to embodiments of the present invention based on machine learning.

Embodiment

Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of keyword recognition method based on machine learning.Fig. 1 is according to the present invention The keyword recognition method flow chart based on machine learning of embodiment.

The present invention sets up data retrieval structure using Hash tree dictionary, utilizes what is occurred during statistical method elimination participle Ambiguous field.Hash tree dictionary is used to store character string, realizes the quick lookup of character string.The Hash tree by lead-in index and Tree node two parts composition is hashed, during the single pass to being split sentence, is word for word matched along the root node of tree chain. According to the number of partitions A of Chinese character code standard and per subregion number of words B, B data cell is stored in hash tree node, Chinese is loaded Dictionary for word segmentation.Data structure is as follows：

Pos=(c₁-176)×B+(c₂-161)

Lead-in index node includes following data：Attribute：Whether individual character is matched as word, if there is subtree, and index word is made For the length of the most long word of lead-in；Subtree size：When there is subtree, lead-in is the number of the two-character phrase of index word, is otherwise 0； Subtree pointer：When there is subtree, pointer points to subtree, and otherwise pointer is sky.One unit of lead-in index is corresponding word The root node of Hash tree.

Hash tree node and include single keyword, the mark with the presence or absence of subtree, root node is to present node keyword No matching is word, and root node to present node keyword is length, the subtree size of the most long word of prefix, when there is subtree, root Number of the node to the word that current keyword is prefix.Otherwise it is 0, subtree pointer：When there is subtree, pointer points to subtree, no Then pointer is sky.

It is short sentence by Chinese sentence cutting, then in progress first according to punctuate table on the basis of Hash tree dictionary The match information of character string in matching process is preserved while with participle, is scanned by the match information of character string and by word Method judges whether ambiguous field, and finally to give segmentation process by the intermediate result of pre- cutting handled.Due to every Individual word has the process matched by the lead-in as word, it is possible to find out all ambiguous fields.By above-mentioned improvement Segmentation methods processing after, all cutting routes of ambiguous field are contained in pre- cutting result.According in training corpus Word frequency information, the probability of word in all cutting routes is calculated using statistical model, the word of maximum probability is optimal word.

It is described in detail below：

Step 1：Preliminary participle pretreatment is carried out to search statement, i.e., is first divided into search statement pre-cut according to punctuate table Multiple substrings are simultaneously preserved；

Step 2：Take pre-cut molecule string S=C₀C₁…C_n-1, n is the length of substring, and initializing variable temp=n, p=0； Q=1000000；

Step 3：Take S j-th of character C_j, obtain C_jPosition in lead-in index；

Step 4：The data of present node are read, it is word to include whether matching, if there is subtree, most long word is long, if There is subtree, read subtree pointer and subtree size；

Step 5：I=j+1 is taken, if most long word is long<N, goes to step 11；

Step 6：Take S i-th of character C_i, read each Hash tree node data in subtree pointer：Keyword, if With the attribute for word, most long word is long, subtree pointer and subtree size.Using dichotomy by starting point of each keyword, subtree Size for distance it is interval in match C_iIf it fails to match, C_i-1C_iDo not form word；

Now, if i-1>Q, then preserve C_pC_p+1...C_i-1Relevant information moved to right into ambiguous set of fields, and by S I-p word, meanwhile, n=temp- (i-p) is made, 3 are gone to step；

Else if i-1≤q, preserves C_pC_p+1...C_i-1Into word information, S is moved to right to i-j word at p, n=temp- is made (i-j+1) 3, are gone to step；

Step 7：Read each Hash tree node data in subtree pointer, including keyword, if match as word, most long word It is long, subtree pointer, subtree size；

Step 8：If a length of n of most major term and matching be word, preserve matching after word C_i+1-nC_i+2-n...C_i；

Step 9：If the word is that the match is successful for the first time, remembers p=i+1-n, q=i, j=p+1, go to step 3；Otherwise, j =i, goes to step 3；

Step 10：I=i+1, goes to step 6；

Step 11：S moves to left a word, and n=n-1 goes to step 3；

Step 12：If ambiguous set of fields is not sky, disambiguation processing is carried out；Otherwise, participle terminates.

Above-mentioned disambiguation processing further comprises：

Three parameters, part of speech state matrix A, B, π are set；

Wherein part of speech state matrix A is part of speech state matrix, between 41 parts of speech state-transition matrix (41 × 41) it is converted into the form storage of table wherein.The element a of matrix A_ijFor：

a_ij=N (T_i, T_j)/N(T_i)

Wherein N (T_i, T_j) it is that part of speech marks T in training_jImmediately T_iThe number of times occurred afterwards, N (T_i) it is mark T_iOccur Number of times.

Wherein symbol probability distribution matrix B, is stored therein the different part of speech probability of each word correspondence.To in part of speech table Probability, be obtained by the following formula part of speech probability distribution b_jk：

b_jk=N (W_k, T_j)/N(T_i)

Wherein N (W_k, T_j) it is the vocabulary W in training_kPart-of-speech tagging be T_jThe number of times of appearance.

Parameter π represents that initial state probabilities are distributed.

Input parameter collection λ=(A, B, π), the part of speech of each word is determined for giving sentence.Make W=w₁w₂Λw_mFor one Individual sentence, w_iFor a word in sentence.Q=q is made again₁q₂Λq_mFor mono- possible part-of-speech tagging sequence of sentence W, q_iFor word w_i's One part-of-speech tagging result.Problem is converted into and seeks a part of speech sequence in a model λ and enables its best illustration sentence W, the present invention uses HMM evaluation criterions, from single most probable state q_t, this evaluation criterion causes the number for reaching correct status Purpose desired value is maximum.

For each possible pre- cutting route, initialize first：

δ₁(i)=π_ib_i(o₁) initialization when b_i(o₁)

Conclude and calculate and preserve backtracking

δ_t(j)=max_1<i<N[δ_t(j)a_ij]×b_j(o_t), 2≤t≤T, 1≤j≤N

Wherein t is moment, a_ijIt is POS transfer for state transition probability, j is part of speech state, δ_t(j) it is that t state is J, is output as O₁, O₂..., O_tMaximum probability, the variable save reaches the probability in the most possible path of each node.Profit The optimal path of whole grid is calculated with dynamic programming algorithm.

Finally terminate and produce path (band backtracking)

P=max_1<i<N[δ_T(j)]

So far, optimal path status switch and best initial weights are tried to achieve.

P values, the P values of relatively more all pre- cutting routes, selection one will be all calculated for each possible pre- cutting route Maximum probability P is planted, output result is used as.

In summary, the present invention proposes a kind of keyword recognition method based on machine learning, improves to participle The process of ambiguous field processing, possesses more preferably time complexity and cutting accuracy.

Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.

It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims

1. a kind of keyword recognition method based on machine learning, it is characterised in that including：

Data retrieval structure is set up using Hash tree dictionary,

2. according to the method described in claim 1, it is characterised in that the Hash tree dictionary is used to store character string, realizes word Accord with the quick lookup of string；The Hash tree is indexed by lead-in and hashed tree node two parts and constitutes, to being split the one of sentence In secondary scanning process, word for word matched along the root node of tree chain.

3. according to the method described in claim 1, it is characterised in that the Hash tree is according to the number of partitions A of Chinese character code standard With every subregion number of words B, B data cell is stored in hash tree node, Chinese word segmentation dictionary, the data retrieval knot of foundation is loaded Structure is as follows：

Pos=(c₁-176)×B+(c₂-161)

Pos is position of the word in lead-in index node, c₁For the unsigned number of prefix word first character section, c₂For second word The unsigned number of section.

4. according to the method in claim 2 or 3, it is characterised in that the lead-in index node includes following data：

5. according to the method in claim 2 or 3, it is characterised in that the hash tree node comprising single keyword, whether There is the mark of subtree, whether root node to present node keyword is matched as word, before root node is to present node keyword Length, the subtree size for the most long word sewed, when there is subtree, root node to current keyword is the number of the word of prefix, otherwise For 0；Subtree pointer：When there is subtree, pointer points to subtree, and otherwise pointer is sky.

6. according to the method described in claim 1, it is characterised in that methods described also includes：

Using n-th of word in sentence as the state transfer of the Markov process of n-1 word immediately before it, these are utilized Co-occurrence probability between word carries out part-of-speech tagging as the transfer between state, and various probability parameters are extracted by language material training, The probability of the possible corresponding string of a given word is calculated according to probability parameter, is then selected according to predefined standard Appropriate string is used as output.