CN109145289A

CN109145289A - Based on the old-Chinese bilingual sentence similarity calculating method for improving relation vector model

Info

Publication number: CN109145289A
Application number: CN201810808788.7A
Authority: CN
Inventors: 周兰江; 李思卓; 周枫
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2019-01-04

Abstract

The present invention relates to a kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes, belongs to natural language processing and machine learning techniques field.Old-Chinese bilingual dictionary that the present invention is first depending on building calculates the similarity value of Laotian and Chinese sentence, then bilingual sentence length information is fully considered, calculate Laotian and Chinese sentence length ratio value, comprehensive two values calculate Laotian and Chinese sentence similarity value, so that old-Chinese bilingual sentence similarity calculation reliability with higher, the higher Laotian of similarity and Chinese sentence can be thus aligned in alignment procedure, simplify the process of sentence alignment.The present invention can effectively excavate parallel sentence pairs from bilingualism corpora, and old-calculating of Chinese bilingual sentence similarity and the best match algorithm of bigraph (bipartite graph) sufficiently combine, and can effectively improve the accuracy rate of sentence alignment, therefore the present invention has certain research significance.

Description

Based on the old-Chinese bilingual sentence similarity calculating method for improving relation vector model

Technical field

The present invention relates to a kind of based on the old-Chinese bilingual sentence similarity calculating method for improving relation vector model, belongs to Natural language processing and machine learning techniques field.

Background technique

Sentence similarity calculating is research topic important in natural language processing field, is widely used.In question and answer In system, needs to ask a question to user using similarity based method and the problems in system knowledge base is compared, find problem Best match and return to optimum answer.In the generating process of automatic abstract, need to use the method for sentence similarity to arrange Except the sentence of similar import, the redundancy of digest is avoided.In terms of across language, the old bilingual sentence similarity calculation of the Chinese can be applied The search of the old hot news of the Chinese, the old shared education resources of the Chinese, and promote the development of the Chinese old cultural exchanges and both sides in all respects.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of based on the old-Chinese bilingual sentence phase for improving relation vector model Like degree calculation method, the accuracy rate of old-Chinese bilingual sentence similarity calculation can be effectively improved, it on the other hand also can be to Laos Language corpus is expanded, therefore the present invention has certain research significance.

The technical solution adopted by the present invention is that: it is a kind of based on the old-Chinese bilingual sentence similarity for improving relation vector model Calculation method, characterized by the following steps:

Step1, first to Chinese sentence T in corpus_iWith Laotian sentence T_jParticiple and part-of-speech tagging are carried out, is therefrom screened The keyword of Chinese sentence and Laotian sentence out；

Step1.1, first with Words partition system T to Chinese sentence respectively_iWith Laotian sentence T_jIt is segmented, is obtained Chinese and Laotian sentence after participle；

Step1.2, by participle after, carry out part-of-speech tagging, therefrom filter out the main component of a sentence, they include Noun, pronoun, verb, adjective and adverbial word these types part of speech, it is crucial accordingly as Chinese sentence and Laotian sentence Word does so the semantic integrity that can guarantee sentence to the utmost；

Step2, the Chinese sentence T for obtaining Step1_iWith Laotian sentence T_jKeyword be converted to third party's language English Language constitutes T_iAnd T_jKeyword vector indicate；

Step2.1, it defines 1: the definition that keyword vector indicates: such as giving a Chinese sentence T_i, by Words partition system After participle, obtained keyword m_iThe vector of composition is known as Chinese sentence T_iKeyword vector indicate, be T_iv={ m₁, m₂,…,m_n}；

Step3, Chinese sentence T is constituted_iWith Laotian sentence T_jKeyword vector expression after, then consider vector length Shorter crucial term vector, it is assumed here that Len (T_i)≤Len(T_j), i.e. hypothesis Chinese sentence vector length is shorter than Laotian sentence Vector length calculates Chinese sentence T at this time_iInitial weight value vector T B_i={ b₁,b₂,…,b_n, for Chinese sentence T_iIn Each keyword m_i, calculate old-Chinese bilingual sentence similarity value；

Step3.1, due to having here related to Chinese sentence T_iWith Laotian sentence T_jKeyword indicate and weighted value to Amount, so being illustrated here using definition 2, definition 3, definition 4: defining 2: giving a Chinese sentence T_iKeyword to Amount indicates T_iv={ m₁,m₂,…,m_n, the keyword m in vector_iPrevious keyword m_i-1Referred to as m_iPreceding keyword, m_i's The latter keyword m_i+1Referred to as m_iRear keyword；It defines 3: giving a Chinese sentence T_iKeyword vector indicate T_iv={ m₁,m₂,…,m_n, T_iVector length Len (T_i)=n gives each keyword m_iAssign an initial weight valueThe weighted value of all keywords constitutes a vector and is known as T_iInitial weight value vector, be expressed as TB_i ={ b₁,b₂,…,b_n}；It defines 4: giving two Chinese sentence T_iWith Laotian sentence T_jKeyword vector indicate, for T_iv In any keyword m_iIf m_iAlso in T_jMiddle appearance, then claim m_iIn T_jMiddle presence, T_iIn it is all in T_jPresent in keyword The vector of composition is known as T_iBased on T_jThere are vectors, be expressed as E_i,j={ e₁,e₂,…,e_p, there are keywords corresponding in vector Weighted value constitute vector be known as T_iBased on T_jExistence value vector, be expressed as TE_i,j={ v₁,v₂,…,v_p, then respectively into Row Step3.2 and Step3.3；

Step3.2, third party's language precision is improved by increaseing accordingly the weight that keyword is near synonym, then carried out Step3.4；

Step3.3, the precision of keyword position is improved by increasing the judgement number of preceding keyword and rear keyword, so After carry out Step3.4；

Step3.4, basis obtain Chinese sentence T_iInitial weight value vector T B_i={ b₁,b₂,…,b_n, Chinese sentence T_i Based on Laotian sentence T_jExistence value vector T E_i,j={ v₁,v₂,…,v_p, therefore, old-Chinese bilingual sentence similarity value calculates Shown in formula such as formula (1):

Specifically, specific step is as follows by the Step3.2；

Step3.2.1, assume Len (T_i)≤Len(T_j), calculate T_iInitial weight value vector T B_i={ b₁,b₂,…, b_n}；

Step3.2.2, for Chinese sentence T_iEach of keyword m_iIf m_iIn Laotian sentence T_jMiddle presence Or with the presence of synonym, consider m_iIn T_iAnd T_jIn preceding keyword, if the two preceding keywords are identical word or same Adopted word, then by TB_iMiddle m_iCorresponding weight increases α times, if the two preceding keywords are near synonym, by TB_iMiddle m_iAccordingly Weight increases β (1 < β < α) times, for m_iRear keyword do identical processing, E may finally be obtained_i,j={ e₁, e₂,...,e_pAnd TE_i,j={ v₁,v₂,...,v_p}。

Specifically, specific step is as follows by the Step3.3；

Step3.3.1, assume Len (T_i)≤Len(T_j), calculate Chinese sentence T_iInitial weight value vector T B_i={ b₁, b₂,…,b_n}；

Step3.3.2, for T_iEach of keyword m_iIf: m_iIn Laotian sentence T_jMiddle presence has same Adopted word exists, and considers m_iIn T_iAnd T_jIn beforeA keyword, wherein γ is rounded downwards, and γ is T_jThe number of keyword, such as Before fruitA keyword is identical word or synonym, then by TB_iMiddle m_iCorresponding weight increases α times, if precedingA key Word is near synonym, then by TB_iMiddle m_iCorresponding weight increases β (1 < β < α) times, for m_iAfterA keyword does identical place Reason, finally obtains E_i,j={ e₁,e₂,...,e_pAnd TE_i,j={ v₁,v₂,...,v_p}。

The beneficial effects of the present invention are:

1. old-Chinese bilingual sentence similarity calculating method of the invention based on improved relation vector model, proposes A kind of relationship considering bilingual sentence structurally and semantically information simultaneously on the basis of vector space model using third party's language Vector model, improves efficiently traditional vector space model, improves old-Chinese bilingual sentence similarity to a certain extent The accuracy rate of calculating.

2. old-Chinese bilingual sentence similarity calculating method of the invention based on improved relation vector model, this mould Type considers the synonymous information of Matching Relation and keyword between the keyword of composition sentence, in third party's language and keyword It all increases in the precision of position, the structurally and semantically information of sentence can be embodied well, improve old-Chinese bilingual sentence phase The accuracy calculated like degree.

3. old-Chinese bilingual sentence similarity calculating method of the invention based on improved relation vector model, realization Calculation method across language sentence similarity can apply the search in the old hot news of the Chinese, search two marks of similar import Topic generates the sentence for excluding similar import when the autoabstract of the old network hotspot news of the Chinese, avoids the redundancy and rush of digest sentence Into the development of the various Chinese old cultural exchanges and both sides.

Detailed description of the invention

Fig. 1 is the overview flow chart in the present invention.

Fig. 2 is that third party's language precision improves in the present invention.

Fig. 3 is that keyword position precision improves in the present invention.

Specific embodiment

In order to describe in more detail the present invention and convenient for the understanding of those skilled in the art, with reference to the accompanying drawing and embodiment pair The present invention is further described, and the embodiment of this part for illustrating the present invention, do not come with this by the purpose being easy to understand The limitation present invention.

Embodiment 1: as shown in Figure 1-3, based on a kind of old-Chinese bilingual sentence similarity by improved relation vector model Calculation method, includes the following steps:

Step1, participle and part-of-speech tagging are carried out to Chinese sentence in corpus and Laotian sentence first, is screened out from it the Chinese The keyword of sentence and Laotian sentence；

Step1.1, first with Words partition system respectively to Chinese sentence T_iWith Laotian sentence T_jIt is segmented, is divided Chinese and Laotian sentence after word.

Step1.2, by participle after, carry out part-of-speech tagging, therefrom filter out the main component of a sentence, they include Noun, pronoun, verb, adjective and adverbial word these types part of speech, it is crucial accordingly as Chinese sentence and Laotian sentence Word does so the semantic integrity that can guarantee sentence to the utmost.

Step2, from the word segmentation result of Step1, extract Chinese sentence T_iWith Laotian sentence T_jCorresponding keyword simultaneously will These keywords are converted to third party's language English, constitute T_iAnd T_jKeyword vector indicate.

Step2.1, due to having here related to Chinese sentence T_iWith Laotian sentence T_jCrucial term vector, so herein Place is illustrated using definition 1: being defined 1: being given a Chinese sentence T_i, after Words partition system segments, obtained key Word m_iThe vector of composition is known as Chinese sentence T_iKeyword vector indicate, be T_iv={ m₁,m₂,…,m_n}。

Step3, Chinese sentence T is constituted_iWith Laotian sentence T_jCrucial term vector after, then consider vector length it is shorter Crucial term vector, it is assumed here that Len (T_i)≤Len(T_j) (i.e. hypothesis Chinese sentence vector length is shorter than Laotian sentence vector Length), Chinese sentence T is calculated at this time_iInitial weight value vector T B_i={ b₁,b₂,…,b_n}.For Chinese sentence T_iIn it is every One keyword m_i, some processing are taken turns doing to calculate old-Chinese bilingual sentence similarity value, can pass through Figure of description herein 2 and attached drawing 3 help to understand old-Chinese bilingual sentence similarity calculation proposed by the present invention based on improved relation vector model The improvement of method.Relation vector model not only considers whether the keyword in a sentence occurs in another sentence, it is also contemplated that With the influence of most close two words (preceding keyword and rear keyword) of this keyword, in this way, in sentence between all keywords Structural relation embodied, thus increase the comprehensive and accuracy of analysis.The present invention is exactly to carry out to this model Some improvement, to improve the accuracy rate of old-Chinese bilingual sentence similarity calculation.

Step3.1, due to having here related to Chinese sentence T_iWith Laotian sentence T_jKeyword indicate and weighted value to Amount, so being illustrated here using definition 2, definition 3, definition 4: defining 2: giving a Chinese sentence T_iKeyword to Amount indicates T_iv={ m₁,m₂,…,m_n, the keyword m in vector_iPrevious keyword m_i-1Referred to as m_iPreceding keyword, m_i's The latter keyword m_i+1Referred to as m_iRear keyword.It defines 3: giving a Chinese sentence T_iKeyword vector indicate T_iv= {m₁,m₂,…,m_n, T_iVector length Len (T_i)=n gives each keyword m_iAssign an initial weight valueThe weighted value of all keywords constitutes a vector and is known as T_iInitial weight value vector, be expressed as TB_i ={ b₁,b₂,…,b_n}.It defines 4: giving two Chinese sentence T_iWith Laotian sentence T_jKeyword vector indicate, for T_iv In any keyword m_iIf m_iAlso in T_jMiddle appearance, then claim m_iIn T_jMiddle presence, T_iIn it is all in T_jPresent in keyword The vector of composition is known as T_iBased on T_jThere are vectors, be expressed as E_i,j={ e₁,e₂,…,e_p}.There are keywords corresponding in vector Weighted value constitute vector be known as T_iBased on T_jExistence value vector, be expressed as TE_i,j={ v₁,v₂,…,v_p}。

Step3.2, sentence similarity is calculated using keyword to be converted to the method for third party's language due to the invention, It is wherein just inevitably influenced by third party's language, the influence of near synonym is encountered during especially converting.Therefore, Need to improve the precision of third party's language, the present invention is realized by increaseing accordingly the weight that keyword is near synonym.Herein may be used To help the raising understood Ben Faming in third party's language precision by Figure of description 2.

Step3.2.1, assume Len (T_i)≤Len(T_j), calculate T_iInitial weight value vector T B_i={ b₁,b₂,…, b_n}。

Step3.2.2, for Chinese sentence T_iEach of keyword m_iIf m_iIn Laotian sentence T_jMiddle presence Or with the presence of synonym, consider m_iIn T_iAnd T_jIn preceding keyword, if the two preceding keywords are identical word or same Adopted word, then by TB_iMiddle m_iCorresponding weight increases α times, if the two preceding keywords are near synonym, by TB_iMiddle m_iAccordingly Weight increases β (1 < β < α) times, for m_iRear keyword do identical processing, finally obtain E_i,j={ e₁,e₂,..., e_pAnd TE_i,j={ v₁,v₂,...,v_p}

Step3.3, similar, main subject+predicate+object structure is constituted due to Chinese and the sentence of Laotian It is similar, but have some subtle differences, these differences result in keyword position and can deviate, that is, previous pass Keyword and the latter keyword cannot determine that can a keyword increase weight completely, therefore will cause the position due to keyword Precision caused by setting is lost.Therefore, the present invention improves keyword by increasing the judgement number of preceding keyword and rear keyword The precision of position.The raising understood Ben Faming in keyword position precision can be helped by Figure of description 3 herein.

Step3.3.1, assume Len (T_i)≤Len(T_j), calculate Chinese sentence T_iInitial weight value vector T B_i={ b₁, b₂,…,b_n}。

Step3.3.2, for T_iEach of keyword m_iIf: m_iIn Laotian sentence T_jMiddle presence has same Adopted word exists, and considers m_iIn T_iAnd T_jIn beforeA keyword, wherein γ is rounded downwards, and γ is T_jThe number of keyword.Such as Before fruitA keyword is identical word or synonym, then by TB_iMiddle m_iCorresponding weight increases α times, if precedingA key Word is near synonym, then by TB_iMiddle m_iCorresponding weight increases β (1 < β < α) times, for m_iAfterA keyword does identical place Reason, finally obtains E_i,j={ e₁,e₂,...,e_pAnd TE_i,j={ v₁,v₂,...,v_p}。

γ=Len (T can be found during specific experiment_i) influence whether last similarity accuracy rate, that is, Consider front and backError can be generated when (being rounded downwards) a keyword.There are two types of situations to occur: the first situation: working as key When word number is less, the latter keyword does not have much affect to accuracy rate before only considering, moreover it is possible to the accuracy calculated is kept, but It is after keyword number increase, error caused by the grammatical differences between Chinese and Laotian also just increases, and preceding the latter is closed Keyword cannot be guaranteed the accuracy rate calculated, therefore accuracy rate declines；Second situation: right when keyword number is less Accuracy rate does not have much affect, but when keyword number increases, considers front and backThe keyword that will lead to is repeated meter It calculates, therefore it is higher to will lead to accuracy rate.Therefore, it is found after comprehensive analysis, when keyword number is between 5 to 7, old-Chinese Bilingual sentence similarity calculation is more accurate.

The present invention can be successfully solved in the case where Laotian corpus is less, and Chinese and Laotian is effectively performed Bilingual sentence similarity calculation, on the other hand Laotian corpus can also be expanded, therefore the present invention has certain grind Study carefully meaning.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept It puts and makes a variety of changes.

Claims

1. a kind of based on the old-Chinese bilingual sentence similarity calculating method for improving relation vector model, it is characterised in that: including such as Lower step:

Step1, first to Chinese sentence T in corpus_iWith Laotian sentence T_jParticiple and part-of-speech tagging are carried out, the Chinese is screened out from it The keyword of sentence and Laotian sentence；

Step1.1, first with Words partition system T to Chinese sentence respectively_iWith Laotian sentence T_jIt is segmented, is segmented Chinese and Laotian sentence afterwards；

Step1.2, by participle after, carry out part-of-speech tagging, therefrom filter out the main component of a sentence, they include name Word, pronoun, verb, adjective and adverbial word these types part of speech, using it as Chinese sentence and the corresponding keyword of Laotian sentence, Do so the semantic integrity that can guarantee sentence to the utmost；

Step2, the Chinese sentence T for obtaining Step1_iWith Laotian sentence T_jKeyword be converted to third party's language English, structure At T_iAnd T_jKeyword vector indicate；

Step2.1, it defines 1: the definition that keyword vector indicates: such as giving a Chinese sentence T_i, segmented by Words partition system Afterwards, obtained keyword m_iThe vector of composition is known as Chinese sentence T_iKeyword vector indicate, be T_iv={ m₁,m₂,…, m_n}。

Step3, Chinese sentence T is constituted_iWith Laotian sentence T_jKeyword vector expression after, then consider vector length it is shorter Crucial term vector, it is assumed here that Len (T_i)≤Len(T_j), i.e. hypothesis Chinese sentence vector length is shorter than Laotian sentence vector Length calculates Chinese sentence T at this time_iInitial weight value vector T B_i={ b₁,b₂,…,b_n, for Chinese sentence T_iIn it is every One keyword m_i, calculate old-Chinese bilingual sentence similarity value；

Step3.1, due to having here related to Chinese sentence T_iWith Laotian sentence T_jKeyword indicate and weighted value vector, institute To be illustrated here using definition 2, definition 3, definition 4: defining 2: giving a Chinese sentence T_iKeyword vector indicate T_iv={ m₁,m₂,…,m_n, the keyword m in vector_iPrevious keyword m_i-1Referred to as m_iPreceding keyword, m_iThe latter close Keyword m_i+1Referred to as m_iRear keyword；It defines 3: giving a Chinese sentence T_iKeyword vector indicate T_iv={ m₁,m₂,…, m_n, T_iVector length Len (T_i)=n gives each keyword m_iAssign an initial weight valueIt is all The weighted value of keyword constitutes a vector and is known as T_iInitial weight value vector, be expressed as TB_i={ b₁,b₂,…,b_n}；Definition 4: giving two Chinese sentence T_iWith Laotian sentence T_jKeyword vector indicate, for T_ivIn any keyword m_iIf m_iAlso in T_jMiddle appearance, then claim m_iIn T_jMiddle presence, T_iIn it is all in T_jPresent in keyword constitute vector be known as T_iBased on T_j There are vectors, be expressed as E_i,j={ e₁,e₂,…,e_p, the vector constituted there are the weighted value of keyword corresponding in vector is known as T_iBased on T_jExistence value vector, be expressed as TE_i,j={ v₁,v₂,…,v_p, Step3.2 and Step3.3 is then carried out respectively；

Step3.3, improve the precision of keyword position by increasing the judgement number of preceding keyword and rear keyword, then into Row Step3.4；

Step3.4, basis obtain Chinese sentence T_iInitial weight value vector T B_i={ b₁,b₂,…,b_n, Chinese sentence T_iIt is based on Laotian sentence T_jExistence value vector T E_i,j={ v₁,v₂,…,v_p, therefore, old-Chinese bilingual sentence similarity value calculation formula As shown in formula (1):

2. according to claim 1 a kind of based on the old-Chinese bilingual sentence similarity calculation side for improving relation vector model Method, it is characterised in that: specific step is as follows by the Step3.2；

Step3.2.1, assume Len (T_i)≤Len(T_j), calculate T_iInitial weight value vector T B_i={ b₁,b₂,…,b_n}；

Step3.2.2, for Chinese sentence T_iEach of keyword m_iIf m_iIn Laotian sentence T_jMiddle presence has Synonym exists, and considers m_iIn T_iAnd T_jIn preceding keyword, if the two preceding keywords be identical word or synonym, Then by TB_iMiddle m_iCorresponding weight increases α times, if the two preceding keywords are near synonym, by TB_iMiddle m_iCorresponding weight Increase β (1 < β < α) times, for m_iRear keyword do identical processing, E may finally be obtained_i,j={ e₁,e₂,..., e_pAnd TE_i,j={ v₁,v₂,...,v_p}。

3. according to claim 1 a kind of based on the old-Chinese bilingual sentence similarity calculation side for improving relation vector model Method, it is characterised in that: specific step is as follows by the Step3.3；

Step3.3.2, for T_iEach of keyword m_iIf: m_iIn Laotian sentence T_jMiddle presence has synonym to deposit Considering m_iIn T_iAnd T_jIn beforeA keyword, wherein γ is rounded downwards, and γ is T_jThe number of keyword, if preceding A keyword is identical word or synonym, then by TB_iMiddle m_iCorresponding weight increases α times, if precedingA keyword is close Adopted word, then by TB_iMiddle m_iCorresponding weight increases β (1 < β < α) times, for m_iAfterA keyword does identical processing, most E is obtained eventually_i,j={ e₁,e₂,...,e_pAnd TE_i,j={ v₁,v₂,...,v_p}。