CN104008187B

CN104008187B - Semi-structured text matching method based on the minimum edit distance

Info

Publication number: CN104008187B
Application number: CN201410257734.8A
Authority: CN
Inventors: 赵宇; 高升; 郭军
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-06-11
Filing date: 2014-06-11
Publication date: 2017-02-01
Anticipated expiration: 2034-06-11
Also published as: CN104008187A

Abstract

The invention belongs to the field of natural language processing, and discloses a semi-structured text matching method based on the minimum edit distance. The method includes the following steps of firstly, preprocessing data; secondly, conducting offline training to determine the threshold values of a log likelihood ratio algorithm and an about entropy algorithm; thirdly, online extracting a multi-word expression set for an unstructured text of a semi-structured text to be evaluated through the combination of the two algorithms; fourthly, obtaining a text expression, based on the multi-word expression set, of the text according to the extracted multi-word expression set and a structured string in an original evaluated text; fifthly, calculating the matching degree of the inputted semi-structured text and a semi-structured text to be matched through a semi-structured text matching method based on the minimum edit distance; sixthly, weighing the matching degree of the texts according to the similarity of the characteristic set, and outputting the sequencing result (Top-N). By means of the semi-structured text matching method based on the minimum edit distance, the matching accuracy of the semi-structured texts can be improved, and the method has great practical value.

Description

A kind of semi-structured text matching process based on smallest edit distance

Technical field

The invention belongs to natural language processing field, particularly to a kind of semi-structured text based on smallest edit distance Matching process.

Background technology

The magnanimity increase of the network information makes information retrieval become the important channel of acquisition of information.Information based on key word Retrieval has obtained extensive research and has applied.But, can not be met growing in many instances based on key word Various access to information demands.Such as, personal position search, has many job descriptions on network, when inputting a key word, It is to return the job information that some meet condition, but, only often it is difficult to accurately obtain information needed with several key words, More efficiently position search is the direct search based on job seeker resume.Input resume, by biographic information and position The full text coupling of data, returns the job information matching with resume requirement.In the opposite direction, company talent search is also same Situation, need to input a job requirement, go to be mated with resume data.Except position and talent's search, matchmaking service, rent Etc. all there is the information matches problem of both sides of supply and demand.Regardless of whether being which side information, it is not with simply several keys Word is described, but employs a text and be described, and in description text, can include structural data, such as resume In height, age, educational background etc. can structured be given, also comprise the even non-structured data of half structure, such as letter Education in going through and work experience, personal interest, speciality and self evaluation etc., all may be with half structure or non-structured side Formula occurs.And often in job hunting, these destructurings have important effect.

Then, in this applications, the key of problem is the coupling of text.Text similarity research related to this also has Very abundant achievement, is wherein typically the Text similarity computing based on vector space model.Other documents such as based on vsm The research such as cluster, the text classification based on vsm, information retrieval based on vsm, in these researchs, the selection of text feature, spy The calculating of value indicative and measuring similarity are key factors.

The method of traditional calculating text similarity mainly has based on vector space model (vsm) included angle cosine apart from text Similarity algorithm, the Text similarity computing algorithm based on Term co-occurrence, Text similarity computing algorithm based on event etc..But In the text matches application of the types such as above-mentioned position search, we use multi-words expression as the spy of resume position text Levy, the general all ratios of granularity of multi-words expression are larger, and Characteristic Number is also fewer, there is a problem of that data is very sparse, tradition Based on vector space model calculating text similarity method, be not to be readily applicable to position searching order, it calculating knot Fruit is very big may to be 0.Simultaneously take account of the angle from user, go on one's own initiative to calculate the phase with position text with resume text Like spending, resume has initiative, and position has passivity, the physics meaning of this similarity and between the general text of calculating two equalitys Justice is different.

Content of the invention

The present invention provides a kind of semi-structured text matching process based on smallest edit distance, comprises the steps of

First, pretreatment is carried out to data, respectively the semi-structured text of training and test is divided into two parts: structuring literary composition Basis and non-structured text, and word segmentation processing is carried out to non-structured text.

2nd, according to training data, log-likelihood algorithm (llr) and the threshold value of left and right entropy algorithm (lre) are determined, specifically side Method is as follows:

1) first with log-likelihood (llr) formula, in non-structured text, extract multi-words expression candidate,

llr (x, y) = 2 (a * \log \frac{a * n}{(a + b) (a + c)} + b * \log \frac{b * n}{(a + b) (b + d)} + c * \log \frac{c * n}{(c + d) (a + c)} + d * \log \frac{d * n}{(c + d) (b + d)})

Calculate the score value between adjacent cells, the frequency that wherein a:x and y occurs simultaneously；B: abutted to word on the right of x not It is the frequency of y；C: the word that abutted to the y left side is not the frequency of x；D: two words that abutted to, neither x, are not the frequency of y, I.e. d=n-a-b-c (n is the sum of word in corpus)；Build binary tree successively by the size of score value, give a llr threshold Value, if the score value of certain node is more than threshold value, each node in addition to leaf node of the binary tree with this node as root is just It is multi-words expression candidate；

2) filter, using left and right entropy, the multiple expression candidate obtaining based on llr algorithm further it is characterised in that

Wherein, xy represents the unit of candidate, and a, b are that a left side connects and the right Chinese character meeting candidate unit xy, when given one respectively Left and right entropy threshold, is defined as multi-words expression more than the candidate of threshold value；

3) accuracy according to the multi-words expression extracting in training set and recall rate, determines the optimal threshold of two algorithms simultaneously Value.

3rd, according to llr algorithm and lre algorithm, and the threshold value of the llr algorithm that gone out using off-line training and lre algorithm, online Ground extracts a multi-words expression set for non-structured text in the semi-structured text of each test.

4th, utilize the multi-words expression set that previous step extracts, along with the original structure string in these texts it is possible to obtain To the text representation based on multi-words expression set for the text, characterize text with multi-words expression set and structuring character string.

5th, the matching degree computation model based on smallest edit distance (med) utilizing inventor to propose, calculates the half of input Structured text and the matching degree of semi-structured text to be matched, concrete grammar is as follows:

1) similarity based on med between two stringss₁、s₂For two symbol strings；

2) two set of strings r, j, r={ r₁,r₂,...,r_i,...,r_n, j={ j₁,j₂,...,j_k,...,j_m, r_i(1≤ I≤n) and j_k(1≤k≤m) is symbol string, and set similarity is defined as

p (r, j) = \{\begin{matrix} \frac{| r \cap j | + \underset{\forall r &element; {r - j}}{σ} \max {f (r, j), \forall j &element; {j - r}}}{| r |}, | r | &notequal; 0, | r - j | * | j - r | &notequal; 0 \\ 1, | r - j | = 0 \\ \frac{| j |}{| r |}, | j - r | = 0, | r | &notequal; 0 \end{matrix}

Wherein, | r |=n, | j |=m represent the number of key word in keyword set r and j respectively；| r ∩ j | represents set R and the number of set j identical string；R-j represents all and belongs to set r and be not belonging to the set of the string composition in set j；j-r Represent all to belong to set j and be not belonging to the set of the string composition in set r.

6th, to the matching degree calculating by sorting from big to small, ranking results (top-n) are exported, the most forward expression coupling Degree is the strongest, represents that matching degree is the weakest the most rearward.

The beneficial effects of the present invention is, with respect to the method based on COS distance in traditional vector space model, try Test the result verification present invention, in semi-structured text coupling, there is higher accuracy rate and recall rate, there is very strong practicality Property.

Brief description

Fig. 1 is the flow chart of the present invention.

Specific embodiment

Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in detail for present invention offer.

Fig. 1 is the flow chart of the present invention, where the dotted line signifies that training department split flow trend, realizes representing part of detecting flow process Trend, comprises the following steps:

The first step: pretreatment.

Step s1: semi-structured text is divided into two parts: structured text and non-structured text, and to destructuring Text carries out word segmentation processing.

Second step: according to training data, determine log-likelihood algorithm (llr) and the threshold value of left and right entropy algorithm (lre).

Step s2: multi-words expression candidate is extracted according to log-likelihood algorithm；

Step s3: final multi-words expression is determined according to left and right entropy algorithm；

Step s4: determine the threshold value of llr algorithm and lre algorithm according to f-measure value.

3rd step: extract test text character representation.

Step s5: according to llr algorithm and lre algorithm, and the threshold value being obtained using training, extract test text destructuring Multi-words expression set in partial text；

Step s6: in test text structured text set and the multi-words expression union of sets collection that extract are used as former half The character representation of structured text.

4th step: calculate matching degree output result.

Step s7: calculate two halves structuring using the matching degree computation model based on smallest edit distance that inventor proposes The matching degree of text；

Step s8: descending, output result are carried out to matching degree.

Each step will be specifically described below:

Step s1 is divided into two parts: structured text and destructuring literary composition the semi-structured text of training and test respectively This, and word segmentation processing is carried out to structured text.

Step s2 extracts multi-words expression in non-structured text, using log-likelihood (llr) formula

llr (x, y) = 2 (a * \log \frac{a * n}{(a + b) (a + c)} + b * \log \frac{b * n}{(a + b) (b + d)} + c * \log \frac{c * n}{(c + d) (a + c)} + d * \log \frac{d * n}{(c + d) (b + d)})

Calculate the score value between adjacent cells, the frequency that wherein a:x and y occurs simultaneously；B: abutted to word on the right of x not It is the frequency of y；C: the word that abutted to the y left side is not the frequency of x；D: two words that abutted to, neither x, are not the frequency of y, I.e. d=n-a-b-c (n is the sum of word in corpus)；Build binary tree successively by the size of score value, give a llr threshold Value, if the score value of certain node is more than threshold value, each node in addition to leaf node of the binary tree with this node as root is just It is multi-words expression candidate.

Step s3 filter, using left and right entropy, the multiple expression candidate obtaining based on llr algorithm further it is characterised in that

Wherein, xy represents the unit of candidate, and a, b are that a left side connects and the right Chinese character meeting candidate unit xy, when given one respectively Left and right entropy threshold, is defined as multi-words expression more than the candidate of threshold value.

Step s4, according to the accuracy of the multi-words expression of extraction and recall rate in training set, determines two algorithms each simultaneously Optimal threshold.

Step s5 is according to llr algorithm and lre algorithm, and the threshold value of the llr algorithm that gone out using off-line training and lre algorithm, Extract a multi-words expression set for non-structured text in the semi-structured text of each test online.

Step s6 utilize previous step extract multi-words expression set, along with the original structure string in these texts so that it may To obtain the text representation based on multi-words expression set of the text, characterize literary composition with multi-words expression set and structuring character string This.

Step s7 utilizes the matching degree computation model based on smallest edit distance that inventor proposes, and calculates the half hitch of input Structure text and the matching degree of semi-structured text to be matched, concrete grammar is as follows:

p (r, j) = \{\begin{matrix} \frac{| r \cap j | + \underset{\forall r &element; {r - j}}{σ} \max {f (r, j), \forall j &element; {j - r}}}{| r |}, | r | &notequal; 0, | r - j | * | j - r | &notequal; 0 \\ 1, | r - j | = 0 \\ \frac{| j |}{| r |}, | j - r | = 0, | r | &notequal; 0 \end{matrix}

Step s8 exports ranking results to the matching degree calculating by sorting from big to small, and the most forward expression matching degree is By force, represent that matching degree is the weakest the most rearward.

Above in association with the concrete reality to the semi-structured text matching process being proposed based on smallest edit distance for the accompanying drawing The mode of applying is set forth.By the description of embodiment of above, one of ordinary skill in the art can be apparent from Can realize by the mode of software plus necessary general hardware platform to the present invention.Based on such understanding, the skill of the present invention What art scheme substantially contributed to prior art in other words partly can embody in form of a computer software product, and this is soft Part product is stored in a storage medium, including some instructions with so that one or more computer equipment executes the present invention Method described in each embodiment.

According to the thought of the present invention, all will change in specific embodiments and applications.In sum, originally Description should not be construed as limitation of the present invention.

Invention described above embodiment, does not constitute the restriction to invention protection domain.Any the present invention's Modification, equivalent and improvement made within spirit and principle etc., should be included within the scope of the present invention.

Claims

1. a kind of semi-structured text matching process based on smallest edit distance is it is characterised in that comprise the following steps:

(1) pretreatment is carried out to data, semi-structured text is divided into two parts: structured text and non-structured text, and right Structured text carries out word segmentation processing；

(2) off-line training: according to log-likelihood algorithm and left and right entropy algorithm, extract the non-structured text part of training data In multi-words expression, to determine log-likelihood algorithm and the threshold value of left and right entropy algorithm；

(3) according to log-likelihood algorithm and left and right entropy algorithm, and the log-likelihood algorithm that gone out using off-line training and left and right entropy The threshold value of algorithm, extracts a multi-words expression collection for non-structured text in each semi-structured text to be tested online Close；

(4) the multi-words expression set utilizing previous step to extract, along with the original structure string in these texts, obtains the text Based on the text representation of multi-words expression set, characterize text with multi-words expression set and structuring character string；

(5) utilize semi-structured text matching process based on smallest edit distance, calculate the semi-structured text of input and treat The similarity of the semi-structured text joined；

(6) the similarity of text is weighed with the similarity of characteristic set, to the similarity calculating by sorting from big to small, defeated Go out ranking results, the most forward expression similarity is the strongest, represent that similarity is the weakest the most rearward.

2. according to claim 1 a kind of based in the semi-structured text matching process of smallest edit distance, step (2) The threshold value of middle training log-likelihood algorithm is it is characterised in that utilize log-likelihood algorithmic formula

Calculate the score value between adjacent cells, wherein a represents the frequency that x and y occurs simultaneously；B represents that the word that abutted on the right of x is not The frequency of y；c

Represent that the word that abutted to the y left side is not the frequency of x；D represents two words that abutted to neither x, is not the frequency of y；n For the sum of word in corpus, build binary tree successively by the size of score value, give the threshold value of a log-likelihood algorithm, such as Really the score value of certain node is more than threshold value, then each node in addition to leaf node of the binary tree with this node as root is exactly many vocabularys Reach candidate.

3. according to claim 1 a kind of based in the semi-structured text matching process of smallest edit distance, step (2) The threshold value of middle training left and right entropy algorithm is it is characterised in that utilize left and right entropy algorithmic formula

Left entropy

Right entropy

Wherein, xy represents the unit of candidate, and a, b are that a left side connects and the right Chinese character meeting candidate unit xy, when given one about respectively Entropy algorithm threshold value, is defined as multi-words expression more than the candidate of threshold value.

4. according to claim 1 a kind of based in the semi-structured text matching process of smallest edit distance, step (2) The middle threshold value determining log-likelihood algorithm and left and right entropy algorithm is it is characterised in that according to the multi-words expression extracting in training set Accuracy and recall rate, simultaneously determine two algorithms optimal threshold.

5. according to claim 1 a kind of based in the semi-structured text matching process of smallest edit distance, step (5) In based on smallest edit distance define two text feature set similarity model it is characterised in that being:

1) similarity based on med between two stringss₁、s₂For two symbol strings, med (s₁,s₂) table Show s₁,s₂Between smallest edit distance, | s₁|,|s₂| represent the length of two character strings respectively；

2) two set of strings r, j, r={ r₁,r₂,...,r_i,...,r_n, j={ j₁,j₂,...,j_k,...,j_m, r_i(1≤i≤ And j n)_k(1≤k≤m) is symbol string, and set similarity is defined as

Wherein, | r |=n, | j |=m represent the number of key word in keyword set r and j respectively；| r ∩ j | represent set r with The number of set j identical string；R-j represents all and belongs to set r and be not belonging to the set of the string composition in set j；J-r represents The set that all strings belonging to set j and being not belonging in set r form.