CN112632970A

CN112632970A - Similarity scoring algorithm combining subject synonyms and word vectors

Info

Publication number: CN112632970A
Application number: CN202011475757.8A
Authority: CN
Inventors: 付鹏斌; 杨广越; 杨惠荣; 施建国
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-04-09

Abstract

The invention relates to a similarity scoring algorithm combining subject synonyms and word vectors, which is used for realizing automatic scoring of geographical subjective questions. Firstly, with a geographic subject as a background, establishing a geographic dictionary by extracting subject knowledge information, introducing the geographic dictionary into a Word2vec model to train Word vectors, and constructing a geographic corpus; then, aiming at the problem that the synonym forest is inaccurate in identifying subject synonyms, a geographic synonym word library is established; finally, a keyword extraction and weight distribution algorithm is provided based on the part of speech, subject knowledge background is merged into text similarity calculation, a credible value of sentence similarity is established according to the word similarity, and a similarity scoring algorithm is realized. The experimental result shows that the method is basically consistent with the scoring trend of teachers, and the scoring accuracy rate reaches 88.82%.

Description

Similarity scoring algorithm combining subject synonyms and word vectors

Technical Field

The present invention relates to the field of natural language processing and machine learning.

Background

The automatic scoring technology of the subjective questions is a key technology for scoring the actual subjective questions, the existing scoring method needs a large amount of expert labeling data and comprises the steps of constructing a random forest classifier by using the characteristics of semantic similarity, lexical item weight and the like to predict the scores of examinees based on the traditional machine learning method; a short answer scoring algorithm based on a depth automatic encoder for constructing a scoring model under the condition that a target answer is not clearly defined; and automatically scoring by using a neural network formed by the CNN and the LSTM. In order to realize the automation of the network examination, researchers provide an automatic subjective question marking model based on the similarity of multi-feature sentences, the automatic subjective question marking method based on matching needs to manually design and calculate complex features such as word shapes, semantics and syntax, the matching accuracy is low, subject knowledge information is not blended, the marking accuracy is greatly different from automatic marking results of machine learning and deep learning, and the effect is not ideal.

Disclosure of Invention

Aiming at the problems, the invention ensures the calculation accuracy of the subject synonyms by constructing the geographic synonym word library, fuses geographic knowledge information into the corpus to ensure that the expression of the words in a vector space is more consistent with the background of the geographic subject, realizes a similarity scoring algorithm by combining the subject synonyms and word vectors, and performs experiments through real examinee data in Beijing and Shaanxi provinces to obtain a more ideal effect and verify the effectiveness of the algorithm.

The similarity scoring algorithm combining the subject synonyms and the word vectors comprises the following steps:

step one, collecting all texts and partial middle examination questions in a high school geographical knowledge list and a five-year high examination three-year simulation and high examination geography as geographical knowledge linguistic data, performing word segmentation and part-of-speech tagging by a Language Technology Platform (LTP) to obtain 18,140 words in total, wherein the sample data is shown in FIG. 1, and some words have errors, such as 'true west' and 'true south' in adjectives are tagged as direction nouns; the terms "surface", "meridian", etc. in the noun shall be labeled as geographic proper nouns. Aiming at the problems, the method analyzes the part-of-speech category and the meaning of modern Chinese, manually corrects the words with wrong word segmentation or part-of-speech tagging in the processing result of the geographic knowledge corpus by combining with an LTP part-of-speech tagging set, separates the words and the corresponding part-of-speech according to blank spaces after the manual correction, writes the words and the corresponding part-of-speech into a geographic dictionary, and totals 13,955 words;

step two, adopting Chinese Wikipedia as an initial corpus, and constructing a geographic corpus based on Word2vec CBOW (Continuous Bag-of-Word) model training Word vectors;

and step three, using a Word vector model trained by Word2vec to only provide correlation among words, and the calculation accuracy of semantic similarity is not high, aiming at the problem, the synonym forest is usually adopted to improve the calculation accuracy of the semantic similarity of the words, but the method is not suitable for identifying synonyms in disciplines. For example, in "synonym forest", synonyms of "fragile" include "weak", "long" and "long" which are synonyms in daily expression of people, but in the geographic discipline, there is usually expression of "ecological fragile" and replaced with "ecological weak", "ecological long" and "long" which are unclear in meaning and expression is ambiguous, resulting in scoring errors. Aiming at the problem of inaccurate identification of subject synonyms, the invention constructs a geographic synonym thesaurus, and comprises the following specific steps;

a. reading all words in the geographic dictionary, and writing the words into a geographic synonym word bank, wherein each word occupies one line;

b. inquiring all synonyms of each term in synonym forest in the geographic dictionary, writing the synonyms into the rear of the term in the geographic synonym thesaurus, and separating the synonyms by spaces;

c. inquiring all similar words of each word in the geographic dictionary in the geographic corpus, writing the words with the similarity of more than 0.6 into the rear of the word in the geographic synonym thesaurus, and separating the words with spaces;

d. and manually screening and supplementing the candidate words in each row based on the knowledge libraries such as Baidu encyclopedia and the like by taking the first word in each row in the geographic synonym word library as a target word and the later words as candidate words.

At present, 1843 groups of subject synonyms are arranged in a geographic synonym word bank, and because the geographic synonym word bank does not completely have the knowledge background of geographic subject experts and has certain subjectivity during manual construction, some subject synonyms may not be accurate enough, but the geographic synonym word bank can identify the subject synonyms more accurately compared with synonym forest.

And step four, extracting keywords based on parts of speech and distributing weights. The invention provides a part-of-speech-based keyword extraction and weight distribution method by analyzing geographical examination paper of college entrance examination, partial middle school test questions and answer features, namely classifying keywords according to the parts of speech, and giving weights to each type of keywords;

and step five, calculating the word similarity based on the subject synonyms. In order to realize semantic similarity calculation of key words in a scoring process, a subject synonym concept is introduced, and word similarity is comprehensively calculated according to subject synonyms provided in a geographic synonym word bank, synonyms provided in a synonym forest and similar words provided in a geographic corpus;

and step six, calculating the sentence similarity based on the word vector. The sentence is composed of a word or a group of words related syntactically, wherein the words comprise complex characteristics such as word form, word sequence, sentence length, semantics and the like, and any information in the sentence is changed, which may change the sentence semantics, so that the difficulty of understanding the sentence by a computer is high. The sentence similarity calculation method provided by the literature is used for reference, word vectors provided in a geographic corpus are utilized, weighting construction of sentence vectors is carried out on the word vectors through the keyword extraction and weight distribution algorithm in the step four, and the similarity of the two sentence vectors is calculated by utilizing a cosine similarity calculation method to obtain the sentence similarity;

and seventhly, combining similarity scoring algorithms of the subject synonyms and the word vectors. The accuracy of word similarity calculation is guaranteed by constructing a geographic synonym library, but the accuracy of sentence similarity calculation is not high at present, and particularly, a certain similarity value can be calculated for two unrelated sentences, so that scoring is not accurate. And calculating comprehensive similarity according to the word similarity and the sentence similarity of the examinee answers and the standard answers, automatically grading, and making a grading coefficient according to the comprehensive similarity for meeting the evaluation mode of teachers.

Drawings

FIG. 1 is a process of constructing a geographic corpus;

FIG. 2 is a comparison of the scoring effect of experiment one of the present invention;

FIG. 3 is a graph of the scoring effect of experiment two of the present invention in FIG. 1;

FIG. 4 is a graph of the scoring effect of experiment two of the present invention 2;

FIG. 5 is a graph of the accuracy of the score of experiment two of the present invention in FIG. 1;

FIG. 6 is a graph of the accuracy of the second scoring according to the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

The process of the method comprises the following steps:

(1) geographic dictionary

The method collects all texts and partial middle examination questions in a high school geographical knowledge list and a five-year high examination three-year simulation, high examination geography as geographical knowledge linguistic data, carries out word segmentation and word property labeling through a Language Technology Platform (LTP), totals 18,140 words, and sample data is shown in figure 1, wherein some words have errors, such as 'true west' and 'true south' in adjectives are labeled as direction nouns; the terms "surface", "meridian", etc. in the noun shall be labeled as geographic proper nouns. Aiming at the problems, the method analyzes the part-of-speech category and the meaning of modern Chinese, manually corrects the words with wrong word segmentation or part-of-speech tagging in the processing result of the geographic knowledge corpus by combining with an LTP part-of-speech tagging set, separates the words and the corresponding part-of-speech according to blank spaces after the manual correction, writes the words and the corresponding part-of-speech into a geographic dictionary, and totals 13,955 words.

(2) Geographic corpus

The method adopts Chinese Wikipedia as an initial corpus, trains Word vectors based on a CBOW (Continuous Bag-of-Word) model of Word2vec, and constructs a geographic corpus.

Because the LTP is integrated with a dictionary strategy, words in a geographic dictionary can be more accurately identified by introducing the geographic dictionary during word segmentation, and the relation comparison of word vectors before and after the geographic dictionary is introduced, wherein the cold front belongs to the words in the geographic discipline, and the promotion belongs to natural language and non-words in the geographic discipline.

(3) Geographic synonym library

The Word vector model trained by Word2vec can only provide the correlation between words, the semantic similarity calculation accuracy is not high, and for the problem, synonym forest is usually adopted to improve the Word semantic similarity calculation accuracy, but the method is not suitable for identifying synonyms in disciplines. For example, in "synonym forest", synonyms of "fragile" include "weak", "long" and "long" which are synonyms in daily expression of people, but in the geographic discipline, there is usually expression of "ecological fragile" and replaced with "ecological weak", "ecological long" and "long" which are unclear in meaning and expression is ambiguous, resulting in scoring errors. Aiming at the problem of inaccurate identification of subject synonyms, a geographic synonym word library is constructed, and the method specifically comprises the following steps;

(4) Part-of-speech-based keyword extraction and weight assignment

And extracting keywords based on parts of speech and distributing weights. The invention provides a part-of-speech-based keyword extraction and weight distribution method by analyzing geographical examination paper of college entrance examination, partial middle school test questions and answer features, namely classifying keywords according to parts of speech, and endowing each type of keywords with the following weights:

the answer text is divided into words and labeled with part of speech through the LTP and the geographic dictionary, the number of A-class words in the text is set as a, and the weight of each word is set as w_aB class number of words is B, and weight of each word is w_bThe number of C-class words is C, and the weight of each word is w_cAnd taking the weight of the B-type keyword as a reference x to obtain a keyword weight calculation equation shown as formula 1:

the formula 1 is derived, and the weight calculation of the class A keyword is shown as the formula 2:

the calculation of the weight of the B-type keyword is shown as formula 3:

the class C keyword weight calculation is shown in equation 4:

the keyword extraction and weight distribution algorithm is as follows:

a. inputting an algorithm: a target text S;

b. based on a geographic dictionary, utilizing LTP to segment the text S and label part of speech, referring to keyword weight distribution, extracting keywords according to part of speech to obtain a sequence Seq (x, t) { (x)₁,t₁),(x₂,t₂),(x₃,t₃),...,(x_n,t_n) Where x denotes the word, t denotes the part of speech tagged to the word x, and n denotes the number of words.

c. Traversing Seq (x, t), and distributing and counting the A-class word number a, the B-class word number B and the C-class word number C according to the weight of the keyword;

d. calculating the part-of-speech weight w of the A class by formula 2_aCalculating the part-of-speech weight w of class B by formula 3_bCalculating the part-of-speech weight w of class C by formula 4_c。

e. Traversing Seq (x, t), determining the category to which the part of speech t belongs by referring to keyword weight distribution, giving corresponding weight w to the word x according to the category, and obtaining a sequence Seq (x, t, w) { (x)₁,t₁,w₁),(x₂,t₂,w₂),(x₃,t₃,w₃),...,(x_n,t_n,w_n)}。

f. And (3) outputting: seq (x, t, w)

(5) Term similarity calculation based on subject synonyms

The invention introduces a subject synonym concept for realizing semantic similarity calculation of key words in a scoring process, comprehensively calculates word similarity according to subject synonyms provided in a geographic synonym word bank, synonyms provided in a synonym forest and similar words provided in a geographic language bank, and has the following specific principle:

1) inputting an algorithm: word₁And Word₂；

2) Initialization: a geographical synonym library dlSym, a list sym coded as "═ in" synonym forest ", a geographical corpus dlmodel;

3) traverse dlSym, query Word₁dlSymList if dlSymList length is greater than 0, to 4), otherwise, to 5);

4) traverse dlSymList, query Word₂If present, then Word Sim (Word)₁,Word₂) 1, to 9), otherwise, to 5);

5) traverse sym, query Word₁If symList length is greater than 0, to 6), otherwise, to 7);

6) traverse symList, query Word₂If present, then Word Sim (Word)₁,Word₂) 0.8, to 9), otherwise, to 7);

7) query dlmodel for the presence of Word₁If present, to 8), otherwise, wordSim (Word)₁,Word₂) 0, to 9);

8) query dlmodel for the presence of Word₂Computing Word by dlmodel if present₁And Word₂Similarity dlmodelSym (Word)₁,Word₂) Then Word Sim (Word)₁,Word₂)＝dlmodelSym(Word₁,Word₂) X 0.6, to 9), otherwise wordSim (Word)₁,Word₂) 0, to 9);

9) outputting word similarity: word Sim (Word)₁,Word₂)。

(6) Statement similarity calculation based on word vectors

The invention uses a sentence similarity calculation method, uses word vectors provided in a geographic corpus to weight the word vectors through a keyword extraction and weight distribution algorithm to construct sentence vectors, and uses a cosine similarity calculation method to calculate the similarity of the two sentence vectors to obtain the sentence similarity, wherein the specific algorithm is as follows:

a. inputting an algorithm: sentence S₁And S₂；

b. Initialization: a geographic corpus dlmodel;

c. will S₁And S₂Extracting keywords and distributing weights by the method in the step (4) to obtain a sequence Seq₁And Seq₂；

d. Traversal Seq₁And Seq₂Searching whether the word exists in a dlmodel, if so, converting the word into a word vector (the dimension of the word vector is 192) by using the dlmodel, and otherwise, marking the word as a 0 vector;

e. the word vector of the ith word in the sequence Seq is written as wv_iWeight of w_iSeparately calculating Seq by equation 5₁And Seq₂Obtain sentence vector V₁And V₂(ii) a n represents the number of words.

f. Using the cosine similarity algorithm, V is calculated by equation 6₁And V₂Cosine similarity cosSim (V) of₁,V₂) Let S₁And S₂Sentence similarity sentenceSim (S)₁,S₂) Equal to cosSim (V)₁,V₂)；

g. Outputting sentence similarity: sententesenim (S)₁,S₂)。

(7) Similarity scoring algorithm combining subject synonyms and word vectors

The method ensures the accuracy of word similarity calculation by constructing the geographic synonym library, but the accuracy of sentence similarity calculation is not high at present, and particularly, a certain similarity value can be calculated for two unrelated sentences to cause inaccurate scoring.

And calculating comprehensive similarity according to the word similarity and the sentence similarity of the examinee answers and the standard answers, automatically scoring, and making a scoring coefficient lambda according to the comprehensive similarity so as to accord with the evaluation mode of teachers. The correspondence between the similarity and the score coefficient lambda is as follows:

combining a similarity scoring algorithm of the subject synonyms and the word vectors, the method specifically comprises the following steps:

(1) inputting an algorithm: standard answer text B corresponding to score and examinee answer text S;

(2) extracting key words and distributing weight to the B and S by the method of the step one to obtain a sequence

SeqB(bx,bt,bw)＝{(bx₁,bt₁,bw₁),(bx₂,bt₂,bw₂),...,(bx_n,bt_n,bw_n)}，

SeqS(sx,st,sw)＝{(sx₁,st₁,sw₁),(sx₂,st₂,sw₂),...,(sx_n,st_n,sw_n)}；

(3) Traversing SeqB (bx, bt, bw), and searching the ith word part of speech bt in SeqS (sx, st, sw) and SeqB (bx, bt, bw)_iThe same set of words, swlist (sx) ═ { sx₁,sx₂,...,sx_n}；

(4) Bx is calculated by the algorithm of the step two_iCalculating similarity with words in SWList (sx) in sequence, and multiplying the maximum value of the calculation result by weight bw_iObtaining word weighting similarity s_iThen, the word similarity between B and S (B, S) ═ Σ S_i；

(5) Calculating B and S by a sentence similarity algorithm in the third step to obtain the sentence similarity sentenceSim (B, S) of B and S;

the correspondence between the statement similarity credibility value alpha and the word similarity is as follows:

(6) calculating a sentence similarity credibility value alpha through wordSim (B, S) by referring to the corresponding relation, and calculating the comprehensive similarity (B, S) of B and S by using an expression 3;

similarity(B,S)＝wordSim(B,S)+sentenceSim(B,S)×α (3)

(7) calculating a corresponding score coefficient lambda by means of similarity (B, S) according to the corresponding relation between the similarity and the score coefficient, and calculating the score student score according to the formula 4;

studentScore＝score×λ (4)

(8) outputting the score of the examinee: studentScore.

Claims

1. The similarity scoring algorithm combining the subject synonyms and the word vectors is characterized by comprising the following steps of:

step one, extracting keywords based on parts of speech and distributing weights;

step two, calculating the word similarity based on the subject synonyms;

thirdly, calculating the sentence similarity based on the word vectors;

and step four, combining a similarity scoring algorithm of the subject synonyms and the word vectors.

2. The algorithm for scoring similarity between a thesaurus-associated synonym and a word vector as claimed in claim 1, wherein the part-of-speech-based keyword extraction and weight assignment in step one are as follows:

the keyword extraction algorithm is as follows:

a. inputting an algorithm: a target text S;

b. based on a geographic dictionary, utilizing LTP to segment the text S and label part of speech, referring to weight distribution, extracting keywords according to the part of speech to obtain a sequence Seq (x, t) { (x)₁,t₁),(x₂,t₂),(x₃,t₃),...,(x_n,t_n) Where x represents a word and t represents a part of speech tagged to the word x; n represents the number of words

c. Traversing Seq (x, t), and counting the A-class word number a, the B-class word number B and the C-class word number C by referring to weight distribution;

d. calculating the part-of-speech weight w of class A_aCalculating the part-of-speech weight w of the B class_bCalculating the part-of-speech weight w of the C class_c；

e. Traversing Seq (x, t), determining the category to which the part of speech t belongs by referring to weight distribution, giving corresponding weight w to the word x according to the category, and obtaining a sequence Seq (x, t, w) { (x)₁,t₁,w₁),(x₂,t₂,w₂),(x₃,t₃,w₃),...,(x_n,t_n,w_n)}；

f. And (3) outputting: seq (x, t, w).

3. The algorithm for scoring the similarity between a discipline synonym and a word vector as claimed in claim 1, wherein the degree of similarity between words based on discipline synonyms in step two is calculated as follows:

1) inputting an algorithm: word₁And Word₂；

9) outputting word similarity: word Sim (Word)₁,Word₂)。

4. The algorithm for scoring the similarity between a conjunctive discipline synonym and a word vector as claimed in claim 1, wherein the word vector-based sentence similarity in step three is calculated as follows:

a. inputting an algorithm: sentence S₁And S₂；

b. Initialization: a geographic corpus dlmodel;

c. will S₁And S₂Extracting keywords and distributing weights by the method of the first step to obtain a sequence Seq₁And Seq₂；

e. the word vector of the ith word in the sequence Seq is written as wv_iWeight of w_iSeparately calculating Seq by equation 1₁And Seq₂Obtain sentence vector V₁And V₂；

f. Using a cosine similarity algorithm, V is calculated by equation 2₁And V₂Cosine similarity cosSim (V) of₁,V₂) Let S₁And S₂Sentence similarity sentenceSim (S)₁,S₂) Equal to cosSim (V)₁,V₂)；

g. Outputting sentence similarity: sententesenim (S)₁,S₂)。

5. The algorithm for scoring the similarity between a conjoint subject synonym and a word vector as claimed in claim 1, wherein the algorithm for scoring the similarity between a conjoint subject synonym and a word vector in step four is as follows:

similarity(B,S)＝wordSim(B,S)+sentenceSim(B,S)×α (3)

the correspondence between the similarity and the score coefficient lambda is as follows:

studentScore＝score×λ (4)

(8) outputting the score of the examinee: studentScore.