CN104008187B - Semi-structured text matching method based on the minimum edit distance - Google Patents

Semi-structured text matching method based on the minimum edit distance Download PDF

Info

Publication number
CN104008187B
CN104008187B CN201410257734.8A CN201410257734A CN104008187B CN 104008187 B CN104008187 B CN 104008187B CN 201410257734 A CN201410257734 A CN 201410257734A CN 104008187 B CN104008187 B CN 104008187B
Authority
CN
China
Prior art keywords
semi
text
structured text
algorithm
structured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410257734.8A
Other languages
Chinese (zh)
Other versions
CN104008187A (en
Inventor
赵宇
高升
郭军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201410257734.8A priority Critical patent/CN104008187B/en
Publication of CN104008187A publication Critical patent/CN104008187A/en
Application granted granted Critical
Publication of CN104008187B publication Critical patent/CN104008187B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML

Abstract

The invention belongs to the field of natural language processing, and discloses a semi-structured text matching method based on the minimum edit distance. The method includes the following steps of firstly, preprocessing data; secondly, conducting offline training to determine the threshold values of a log likelihood ratio algorithm and an about entropy algorithm; thirdly, online extracting a multi-word expression set for an unstructured text of a semi-structured text to be evaluated through the combination of the two algorithms; fourthly, obtaining a text expression, based on the multi-word expression set, of the text according to the extracted multi-word expression set and a structured string in an original evaluated text; fifthly, calculating the matching degree of the inputted semi-structured text and a semi-structured text to be matched through a semi-structured text matching method based on the minimum edit distance; sixthly, weighing the matching degree of the texts according to the similarity of the characteristic set, and outputting the sequencing result (Top-N). By means of the semi-structured text matching method based on the minimum edit distance, the matching accuracy of the semi-structured texts can be improved, and the method has great practical value.

Description

A kind of semi-structured text matching process based on smallest edit distance
Technical field
The invention belongs to natural language processing field, particularly to a kind of semi-structured text based on smallest edit distance Matching process.
Background technology
The magnanimity increase of the network information makes information retrieval become the important channel of acquisition of information.Information based on key word Retrieval has obtained extensive research and has applied.But, can not be met growing in many instances based on key word Various access to information demands.Such as, personal position search, has many job descriptions on network, when inputting a key word, It is to return the job information that some meet condition, but, only often it is difficult to accurately obtain information needed with several key words, More efficiently position search is the direct search based on job seeker resume.Input resume, by biographic information and position The full text coupling of data, returns the job information matching with resume requirement.In the opposite direction, company talent search is also same Situation, need to input a job requirement, go to be mated with resume data.Except position and talent's search, matchmaking service, rent Etc. all there is the information matches problem of both sides of supply and demand.Regardless of whether being which side information, it is not with simply several keys Word is described, but employs a text and be described, and in description text, can include structural data, such as resume In height, age, educational background etc. can structured be given, also comprise the even non-structured data of half structure, such as letter Education in going through and work experience, personal interest, speciality and self evaluation etc., all may be with half structure or non-structured side Formula occurs.And often in job hunting, these destructurings have important effect.
Then, in this applications, the key of problem is the coupling of text.Text similarity research related to this also has Very abundant achievement, is wherein typically the Text similarity computing based on vector space model.Other documents such as based on vsm The research such as cluster, the text classification based on vsm, information retrieval based on vsm, in these researchs, the selection of text feature, spy The calculating of value indicative and measuring similarity are key factors.
The method of traditional calculating text similarity mainly has based on vector space model (vsm) included angle cosine apart from text Similarity algorithm, the Text similarity computing algorithm based on Term co-occurrence, Text similarity computing algorithm based on event etc..But In the text matches application of the types such as above-mentioned position search, we use multi-words expression as the spy of resume position text Levy, the general all ratios of granularity of multi-words expression are larger, and Characteristic Number is also fewer, there is a problem of that data is very sparse, tradition Based on vector space model calculating text similarity method, be not to be readily applicable to position searching order, it calculating knot Fruit is very big may to be 0.Simultaneously take account of the angle from user, go on one's own initiative to calculate the phase with position text with resume text Like spending, resume has initiative, and position has passivity, the physics meaning of this similarity and between the general text of calculating two equalitys Justice is different.
Content of the invention
The present invention provides a kind of semi-structured text matching process based on smallest edit distance, comprises the steps of
First, pretreatment is carried out to data, respectively the semi-structured text of training and test is divided into two parts: structuring literary composition Basis and non-structured text, and word segmentation processing is carried out to non-structured text.
2nd, according to training data, log-likelihood algorithm (llr) and the threshold value of left and right entropy algorithm (lre) are determined, specifically side Method is as follows:
1) first with log-likelihood (llr) formula, in non-structured text, extract multi-words expression candidate,
llr ( x , y ) = 2 ( a * log a * n ( a + b ) ( a + c ) + b * log b * n ( a + b ) ( b + d ) + c * log c * n ( c + d ) ( a + c ) + d * log d * n ( c + d ) ( b + d ) )
Calculate the score value between adjacent cells, the frequency that wherein a:x and y occurs simultaneously;B: abutted to word on the right of x not It is the frequency of y;C: the word that abutted to the y left side is not the frequency of x;D: two words that abutted to, neither x, are not the frequency of y, I.e. d=n-a-b-c (n is the sum of word in corpus);Build binary tree successively by the size of score value, give a llr threshold Value, if the score value of certain node is more than threshold value, each node in addition to leaf node of the binary tree with this node as root is just It is multi-words expression candidate;
2) filter, using left and right entropy, the multiple expression candidate obtaining based on llr algorithm further it is characterised in that
Wherein, xy represents the unit of candidate, and a, b are that a left side connects and the right Chinese character meeting candidate unit xy, when given one respectively Left and right entropy threshold, is defined as multi-words expression more than the candidate of threshold value;
3) accuracy according to the multi-words expression extracting in training set and recall rate, determines the optimal threshold of two algorithms simultaneously Value.
3rd, according to llr algorithm and lre algorithm, and the threshold value of the llr algorithm that gone out using off-line training and lre algorithm, online Ground extracts a multi-words expression set for non-structured text in the semi-structured text of each test.
4th, utilize the multi-words expression set that previous step extracts, along with the original structure string in these texts it is possible to obtain To the text representation based on multi-words expression set for the text, characterize text with multi-words expression set and structuring character string.
5th, the matching degree computation model based on smallest edit distance (med) utilizing inventor to propose, calculates the half of input Structured text and the matching degree of semi-structured text to be matched, concrete grammar is as follows:
1) similarity based on med between two stringss1、s2For two symbol strings;
2) two set of strings r, j, r={ r1,r2,...,ri,...,rn, j={ j1,j2,...,jk,...,jm, ri(1≤ I≤n) and jk(1≤k≤m) is symbol string, and set similarity is defined as
p ( r , j ) = | r ∩ j | + σ ∀ r &element; { r - j } max { f ( r , j ) , ∀ j &element; { j - r } } | r | , | r | &notequal; 0 , | r - j | * | j - r | &notequal; 0 1 , | r - j | = 0 | j | | r | , | j - r | = 0 , | r | &notequal; 0
Wherein, | r |=n, | j |=m represent the number of key word in keyword set r and j respectively;| r ∩ j | represents set R and the number of set j identical string;R-j represents all and belongs to set r and be not belonging to the set of the string composition in set j;j-r Represent all to belong to set j and be not belonging to the set of the string composition in set r.
6th, to the matching degree calculating by sorting from big to small, ranking results (top-n) are exported, the most forward expression coupling Degree is the strongest, represents that matching degree is the weakest the most rearward.
The beneficial effects of the present invention is, with respect to the method based on COS distance in traditional vector space model, try Test the result verification present invention, in semi-structured text coupling, there is higher accuracy rate and recall rate, there is very strong practicality Property.
Brief description
Fig. 1 is the flow chart of the present invention.
Specific embodiment
Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in detail for present invention offer.
Fig. 1 is the flow chart of the present invention, where the dotted line signifies that training department split flow trend, realizes representing part of detecting flow process Trend, comprises the following steps:
The first step: pretreatment.
Step s1: semi-structured text is divided into two parts: structured text and non-structured text, and to destructuring Text carries out word segmentation processing.
Second step: according to training data, determine log-likelihood algorithm (llr) and the threshold value of left and right entropy algorithm (lre).
Step s2: multi-words expression candidate is extracted according to log-likelihood algorithm;
Step s3: final multi-words expression is determined according to left and right entropy algorithm;
Step s4: determine the threshold value of llr algorithm and lre algorithm according to f-measure value.
3rd step: extract test text character representation.
Step s5: according to llr algorithm and lre algorithm, and the threshold value being obtained using training, extract test text destructuring Multi-words expression set in partial text;
Step s6: in test text structured text set and the multi-words expression union of sets collection that extract are used as former half The character representation of structured text.
4th step: calculate matching degree output result.
Step s7: calculate two halves structuring using the matching degree computation model based on smallest edit distance that inventor proposes The matching degree of text;
Step s8: descending, output result are carried out to matching degree.
Each step will be specifically described below:
Step s1 is divided into two parts: structured text and destructuring literary composition the semi-structured text of training and test respectively This, and word segmentation processing is carried out to structured text.
Step s2 extracts multi-words expression in non-structured text, using log-likelihood (llr) formula
llr ( x , y ) = 2 ( a * log a * n ( a + b ) ( a + c ) + b * log b * n ( a + b ) ( b + d ) + c * log c * n ( c + d ) ( a + c ) + d * log d * n ( c + d ) ( b + d ) )
Calculate the score value between adjacent cells, the frequency that wherein a:x and y occurs simultaneously;B: abutted to word on the right of x not It is the frequency of y;C: the word that abutted to the y left side is not the frequency of x;D: two words that abutted to, neither x, are not the frequency of y, I.e. d=n-a-b-c (n is the sum of word in corpus);Build binary tree successively by the size of score value, give a llr threshold Value, if the score value of certain node is more than threshold value, each node in addition to leaf node of the binary tree with this node as root is just It is multi-words expression candidate.
Step s3 filter, using left and right entropy, the multiple expression candidate obtaining based on llr algorithm further it is characterised in that
Wherein, xy represents the unit of candidate, and a, b are that a left side connects and the right Chinese character meeting candidate unit xy, when given one respectively Left and right entropy threshold, is defined as multi-words expression more than the candidate of threshold value.
Step s4, according to the accuracy of the multi-words expression of extraction and recall rate in training set, determines two algorithms each simultaneously Optimal threshold.
Step s5 is according to llr algorithm and lre algorithm, and the threshold value of the llr algorithm that gone out using off-line training and lre algorithm, Extract a multi-words expression set for non-structured text in the semi-structured text of each test online.
Step s6 utilize previous step extract multi-words expression set, along with the original structure string in these texts so that it may To obtain the text representation based on multi-words expression set of the text, characterize literary composition with multi-words expression set and structuring character string This.
Step s7 utilizes the matching degree computation model based on smallest edit distance that inventor proposes, and calculates the half hitch of input Structure text and the matching degree of semi-structured text to be matched, concrete grammar is as follows:
1) similarity based on med between two stringss1、s2For two symbol strings;
2) two set of strings r, j, r={ r1,r2,...,ri,...,rn, j={ j1,j2,...,jk,...,jm, ri(1≤ I≤n) and jk(1≤k≤m) is symbol string, and set similarity is defined as
p ( r , j ) = | r ∩ j | + σ ∀ r &element; { r - j } max { f ( r , j ) , ∀ j &element; { j - r } } | r | , | r | &notequal; 0 , | r - j | * | j - r | &notequal; 0 1 , | r - j | = 0 | j | | r | , | j - r | = 0 , | r | &notequal; 0
Wherein, | r |=n, | j |=m represent the number of key word in keyword set r and j respectively;| r ∩ j | represents set R and the number of set j identical string;R-j represents all and belongs to set r and be not belonging to the set of the string composition in set j;j-r Represent all to belong to set j and be not belonging to the set of the string composition in set r.
Step s8 exports ranking results to the matching degree calculating by sorting from big to small, and the most forward expression matching degree is By force, represent that matching degree is the weakest the most rearward.
Above in association with the concrete reality to the semi-structured text matching process being proposed based on smallest edit distance for the accompanying drawing The mode of applying is set forth.By the description of embodiment of above, one of ordinary skill in the art can be apparent from Can realize by the mode of software plus necessary general hardware platform to the present invention.Based on such understanding, the skill of the present invention What art scheme substantially contributed to prior art in other words partly can embody in form of a computer software product, and this is soft Part product is stored in a storage medium, including some instructions with so that one or more computer equipment executes the present invention Method described in each embodiment.
According to the thought of the present invention, all will change in specific embodiments and applications.In sum, originally Description should not be construed as limitation of the present invention.
Invention described above embodiment, does not constitute the restriction to invention protection domain.Any the present invention's Modification, equivalent and improvement made within spirit and principle etc., should be included within the scope of the present invention.

Claims (5)

1. a kind of semi-structured text matching process based on smallest edit distance is it is characterised in that comprise the following steps:
(1) pretreatment is carried out to data, semi-structured text is divided into two parts: structured text and non-structured text, and right Structured text carries out word segmentation processing;
(2) off-line training: according to log-likelihood algorithm and left and right entropy algorithm, extract the non-structured text part of training data In multi-words expression, to determine log-likelihood algorithm and the threshold value of left and right entropy algorithm;
(3) according to log-likelihood algorithm and left and right entropy algorithm, and the log-likelihood algorithm that gone out using off-line training and left and right entropy The threshold value of algorithm, extracts a multi-words expression collection for non-structured text in each semi-structured text to be tested online Close;
(4) the multi-words expression set utilizing previous step to extract, along with the original structure string in these texts, obtains the text Based on the text representation of multi-words expression set, characterize text with multi-words expression set and structuring character string;
(5) utilize semi-structured text matching process based on smallest edit distance, calculate the semi-structured text of input and treat The similarity of the semi-structured text joined;
(6) the similarity of text is weighed with the similarity of characteristic set, to the similarity calculating by sorting from big to small, defeated Go out ranking results, the most forward expression similarity is the strongest, represent that similarity is the weakest the most rearward.
2. according to claim 1 a kind of based in the semi-structured text matching process of smallest edit distance, step (2) The threshold value of middle training log-likelihood algorithm is it is characterised in that utilize log-likelihood algorithmic formula
Calculate the score value between adjacent cells, wherein a represents the frequency that x and y occurs simultaneously;B represents that the word that abutted on the right of x is not The frequency of y;c
Represent that the word that abutted to the y left side is not the frequency of x;D represents two words that abutted to neither x, is not the frequency of y;n For the sum of word in corpus, build binary tree successively by the size of score value, give the threshold value of a log-likelihood algorithm, such as Really the score value of certain node is more than threshold value, then each node in addition to leaf node of the binary tree with this node as root is exactly many vocabularys Reach candidate.
3. according to claim 1 a kind of based in the semi-structured text matching process of smallest edit distance, step (2) The threshold value of middle training left and right entropy algorithm is it is characterised in that utilize left and right entropy algorithmic formula
Left entropy
Right entropy
Wherein, xy represents the unit of candidate, and a, b are that a left side connects and the right Chinese character meeting candidate unit xy, when given one about respectively Entropy algorithm threshold value, is defined as multi-words expression more than the candidate of threshold value.
4. according to claim 1 a kind of based in the semi-structured text matching process of smallest edit distance, step (2) The middle threshold value determining log-likelihood algorithm and left and right entropy algorithm is it is characterised in that according to the multi-words expression extracting in training set Accuracy and recall rate, simultaneously determine two algorithms optimal threshold.
5. according to claim 1 a kind of based in the semi-structured text matching process of smallest edit distance, step (5) In based on smallest edit distance define two text feature set similarity model it is characterised in that being:
1) similarity based on med between two stringss1、s2For two symbol strings, med (s1,s2) table Show s1,s2Between smallest edit distance, | s1|,|s2| represent the length of two character strings respectively;
2) two set of strings r, j, r={ r1,r2,...,ri,...,rn, j={ j1,j2,...,jk,...,jm, ri(1≤i≤ And j n)k(1≤k≤m) is symbol string, and set similarity is defined as
Wherein, | r |=n, | j |=m represent the number of key word in keyword set r and j respectively;| r ∩ j | represent set r with The number of set j identical string;R-j represents all and belongs to set r and be not belonging to the set of the string composition in set j;J-r represents The set that all strings belonging to set j and being not belonging in set r form.
CN201410257734.8A 2014-06-11 2014-06-11 Semi-structured text matching method based on the minimum edit distance Expired - Fee Related CN104008187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410257734.8A CN104008187B (en) 2014-06-11 2014-06-11 Semi-structured text matching method based on the minimum edit distance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410257734.8A CN104008187B (en) 2014-06-11 2014-06-11 Semi-structured text matching method based on the minimum edit distance

Publications (2)

Publication Number Publication Date
CN104008187A CN104008187A (en) 2014-08-27
CN104008187B true CN104008187B (en) 2017-02-01

Family

ID=51368844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410257734.8A Expired - Fee Related CN104008187B (en) 2014-06-11 2014-06-11 Semi-structured text matching method based on the minimum edit distance

Country Status (1)

Country Link
CN (1) CN104008187B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980961A (en) * 2017-03-02 2017-07-25 中科天地互联网科技(苏州)有限公司 A kind of resume selection matching process and system
CN107256245B (en) * 2017-06-02 2020-05-05 河海大学 Offline model improvement and selection method for spam message classification
CN110019665A (en) * 2017-09-30 2019-07-16 北京国双科技有限公司 Text searching method and device
CN110162750B (en) * 2019-01-24 2023-07-07 腾讯科技(深圳)有限公司 Text similarity detection method, electronic device and computer readable storage medium
CN109920431B (en) 2019-03-05 2021-12-07 百度在线网络技术(北京)有限公司 Method and apparatus for outputting information
CN110781204B (en) * 2019-09-09 2024-02-20 腾讯大地通途(北京)科技有限公司 Identification information determining method, device, equipment and storage medium of target object
CN111414765B (en) * 2020-03-20 2023-07-25 北京百度网讯科技有限公司 Sentence consistency determination method and device, electronic equipment and readable storage medium
CN113076734B (en) * 2021-04-15 2023-01-20 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification
CN103294817A (en) * 2013-06-13 2013-09-11 华东师范大学 Text feature extraction method based on categorical distribution probability

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033964B (en) * 2011-01-13 2012-05-09 北京邮电大学 Text classification method based on block partition and position weight

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101634983A (en) * 2008-07-21 2010-01-27 华为技术有限公司 Method and device for text classification
CN103294817A (en) * 2013-06-13 2013-09-11 华东师范大学 Text feature extraction method based on categorical distribution probability

Also Published As

Publication number Publication date
CN104008187A (en) 2014-08-27

Similar Documents

Publication Publication Date Title
CN104008187B (en) Semi-structured text matching method based on the minimum edit distance
CN104102626B (en) A kind of method for short text Semantic Similarity Measurement
Farisi et al. Sentiment analysis on hotel reviews using Multinomial Naïve Bayes classifier
CN108595632B (en) Hybrid neural network text classification method fusing abstract and main body characteristics
CN110222163B (en) Intelligent question-answering method and system integrating CNN and bidirectional LSTM
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN107085581B (en) Short text classification method and device
CN109189925A (en) Term vector model based on mutual information and based on the file classification method of CNN
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN108804651B (en) Social behavior detection method based on enhanced Bayesian classification
CN104615767A (en) Searching-ranking model training method and device and search processing method
CN110175221B (en) Junk short message identification method by combining word vector with machine learning
CN109284406A (en) Intension recognizing method based on difference Recognition with Recurrent Neural Network
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN110826338A (en) Fine-grained semantic similarity recognition method for single-choice gate and inter-class measurement
CN109344399A (en) A kind of Text similarity computing method based on the two-way lstm neural network of stacking
CN109815400A (en) Personage's interest extracting method based on long text
Omara et al. Deep convolutional network for arabic sentiment analysis
CN110705247A (en) Based on x2-C text similarity calculation method
CN108470025A (en) Partial-Topic probability generates regularization own coding text and is embedded in representation method
CN106681985A (en) Establishment system of multi-field dictionaries based on theme automatic matching
Zhang et al. Exploring deep recurrent convolution neural networks for subjectivity classification
Syam et al. Hate speech detection on Twitter using long short-term memory (LSTM) method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170201