CN112632970A - Similarity scoring algorithm combining subject synonyms and word vectors - Google Patents

Similarity scoring algorithm combining subject synonyms and word vectors Download PDF

Info

Publication number
CN112632970A
CN112632970A CN202011475757.8A CN202011475757A CN112632970A CN 112632970 A CN112632970 A CN 112632970A CN 202011475757 A CN202011475757 A CN 202011475757A CN 112632970 A CN112632970 A CN 112632970A
Authority
CN
China
Prior art keywords
word
similarity
algorithm
geographic
scoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011475757.8A
Other languages
Chinese (zh)
Inventor
付鹏斌
杨广越
杨惠荣
施建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202011475757.8A priority Critical patent/CN112632970A/en
Publication of CN112632970A publication Critical patent/CN112632970A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention relates to a similarity scoring algorithm combining subject synonyms and word vectors, which is used for realizing automatic scoring of geographical subjective questions. Firstly, with a geographic subject as a background, establishing a geographic dictionary by extracting subject knowledge information, introducing the geographic dictionary into a Word2vec model to train Word vectors, and constructing a geographic corpus; then, aiming at the problem that the synonym forest is inaccurate in identifying subject synonyms, a geographic synonym word library is established; finally, a keyword extraction and weight distribution algorithm is provided based on the part of speech, subject knowledge background is merged into text similarity calculation, a credible value of sentence similarity is established according to the word similarity, and a similarity scoring algorithm is realized. The experimental result shows that the method is basically consistent with the scoring trend of teachers, and the scoring accuracy rate reaches 88.82%.

Description

Similarity scoring algorithm combining subject synonyms and word vectors
Technical Field
The present invention relates to the field of natural language processing and machine learning.
Background
The automatic scoring technology of the subjective questions is a key technology for scoring the actual subjective questions, the existing scoring method needs a large amount of expert labeling data and comprises the steps of constructing a random forest classifier by using the characteristics of semantic similarity, lexical item weight and the like to predict the scores of examinees based on the traditional machine learning method; a short answer scoring algorithm based on a depth automatic encoder for constructing a scoring model under the condition that a target answer is not clearly defined; and automatically scoring by using a neural network formed by the CNN and the LSTM. In order to realize the automation of the network examination, researchers provide an automatic subjective question marking model based on the similarity of multi-feature sentences, the automatic subjective question marking method based on matching needs to manually design and calculate complex features such as word shapes, semantics and syntax, the matching accuracy is low, subject knowledge information is not blended, the marking accuracy is greatly different from automatic marking results of machine learning and deep learning, and the effect is not ideal.
Disclosure of Invention
Aiming at the problems, the invention ensures the calculation accuracy of the subject synonyms by constructing the geographic synonym word library, fuses geographic knowledge information into the corpus to ensure that the expression of the words in a vector space is more consistent with the background of the geographic subject, realizes a similarity scoring algorithm by combining the subject synonyms and word vectors, and performs experiments through real examinee data in Beijing and Shaanxi provinces to obtain a more ideal effect and verify the effectiveness of the algorithm.
The similarity scoring algorithm combining the subject synonyms and the word vectors comprises the following steps:
step one, collecting all texts and partial middle examination questions in a high school geographical knowledge list and a five-year high examination three-year simulation and high examination geography as geographical knowledge linguistic data, performing word segmentation and part-of-speech tagging by a Language Technology Platform (LTP) to obtain 18,140 words in total, wherein the sample data is shown in FIG. 1, and some words have errors, such as 'true west' and 'true south' in adjectives are tagged as direction nouns; the terms "surface", "meridian", etc. in the noun shall be labeled as geographic proper nouns. Aiming at the problems, the method analyzes the part-of-speech category and the meaning of modern Chinese, manually corrects the words with wrong word segmentation or part-of-speech tagging in the processing result of the geographic knowledge corpus by combining with an LTP part-of-speech tagging set, separates the words and the corresponding part-of-speech according to blank spaces after the manual correction, writes the words and the corresponding part-of-speech into a geographic dictionary, and totals 13,955 words;
step two, adopting Chinese Wikipedia as an initial corpus, and constructing a geographic corpus based on Word2vec CBOW (Continuous Bag-of-Word) model training Word vectors;
and step three, using a Word vector model trained by Word2vec to only provide correlation among words, and the calculation accuracy of semantic similarity is not high, aiming at the problem, the synonym forest is usually adopted to improve the calculation accuracy of the semantic similarity of the words, but the method is not suitable for identifying synonyms in disciplines. For example, in "synonym forest", synonyms of "fragile" include "weak", "long" and "long" which are synonyms in daily expression of people, but in the geographic discipline, there is usually expression of "ecological fragile" and replaced with "ecological weak", "ecological long" and "long" which are unclear in meaning and expression is ambiguous, resulting in scoring errors. Aiming at the problem of inaccurate identification of subject synonyms, the invention constructs a geographic synonym thesaurus, and comprises the following specific steps;
a. reading all words in the geographic dictionary, and writing the words into a geographic synonym word bank, wherein each word occupies one line;
b. inquiring all synonyms of each term in synonym forest in the geographic dictionary, writing the synonyms into the rear of the term in the geographic synonym thesaurus, and separating the synonyms by spaces;
c. inquiring all similar words of each word in the geographic dictionary in the geographic corpus, writing the words with the similarity of more than 0.6 into the rear of the word in the geographic synonym thesaurus, and separating the words with spaces;
d. and manually screening and supplementing the candidate words in each row based on the knowledge libraries such as Baidu encyclopedia and the like by taking the first word in each row in the geographic synonym word library as a target word and the later words as candidate words.
At present, 1843 groups of subject synonyms are arranged in a geographic synonym word bank, and because the geographic synonym word bank does not completely have the knowledge background of geographic subject experts and has certain subjectivity during manual construction, some subject synonyms may not be accurate enough, but the geographic synonym word bank can identify the subject synonyms more accurately compared with synonym forest.
And step four, extracting keywords based on parts of speech and distributing weights. The invention provides a part-of-speech-based keyword extraction and weight distribution method by analyzing geographical examination paper of college entrance examination, partial middle school test questions and answer features, namely classifying keywords according to the parts of speech, and giving weights to each type of keywords;
and step five, calculating the word similarity based on the subject synonyms. In order to realize semantic similarity calculation of key words in a scoring process, a subject synonym concept is introduced, and word similarity is comprehensively calculated according to subject synonyms provided in a geographic synonym word bank, synonyms provided in a synonym forest and similar words provided in a geographic corpus;
and step six, calculating the sentence similarity based on the word vector. The sentence is composed of a word or a group of words related syntactically, wherein the words comprise complex characteristics such as word form, word sequence, sentence length, semantics and the like, and any information in the sentence is changed, which may change the sentence semantics, so that the difficulty of understanding the sentence by a computer is high. The sentence similarity calculation method provided by the literature is used for reference, word vectors provided in a geographic corpus are utilized, weighting construction of sentence vectors is carried out on the word vectors through the keyword extraction and weight distribution algorithm in the step four, and the similarity of the two sentence vectors is calculated by utilizing a cosine similarity calculation method to obtain the sentence similarity;
and seventhly, combining similarity scoring algorithms of the subject synonyms and the word vectors. The accuracy of word similarity calculation is guaranteed by constructing a geographic synonym library, but the accuracy of sentence similarity calculation is not high at present, and particularly, a certain similarity value can be calculated for two unrelated sentences, so that scoring is not accurate. And calculating comprehensive similarity according to the word similarity and the sentence similarity of the examinee answers and the standard answers, automatically grading, and making a grading coefficient according to the comprehensive similarity for meeting the evaluation mode of teachers.
Drawings
FIG. 1 is a process of constructing a geographic corpus;
FIG. 2 is a comparison of the scoring effect of experiment one of the present invention;
FIG. 3 is a graph of the scoring effect of experiment two of the present invention in FIG. 1;
FIG. 4 is a graph of the scoring effect of experiment two of the present invention 2;
FIG. 5 is a graph of the accuracy of the score of experiment two of the present invention in FIG. 1;
FIG. 6 is a graph of the accuracy of the second scoring according to the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
The process of the method comprises the following steps:
(1) geographic dictionary
The method collects all texts and partial middle examination questions in a high school geographical knowledge list and a five-year high examination three-year simulation, high examination geography as geographical knowledge linguistic data, carries out word segmentation and word property labeling through a Language Technology Platform (LTP), totals 18,140 words, and sample data is shown in figure 1, wherein some words have errors, such as 'true west' and 'true south' in adjectives are labeled as direction nouns; the terms "surface", "meridian", etc. in the noun shall be labeled as geographic proper nouns. Aiming at the problems, the method analyzes the part-of-speech category and the meaning of modern Chinese, manually corrects the words with wrong word segmentation or part-of-speech tagging in the processing result of the geographic knowledge corpus by combining with an LTP part-of-speech tagging set, separates the words and the corresponding part-of-speech according to blank spaces after the manual correction, writes the words and the corresponding part-of-speech into a geographic dictionary, and totals 13,955 words.
(2) Geographic corpus
The method adopts Chinese Wikipedia as an initial corpus, trains Word vectors based on a CBOW (Continuous Bag-of-Word) model of Word2vec, and constructs a geographic corpus.
Because the LTP is integrated with a dictionary strategy, words in a geographic dictionary can be more accurately identified by introducing the geographic dictionary during word segmentation, and the relation comparison of word vectors before and after the geographic dictionary is introduced, wherein the cold front belongs to the words in the geographic discipline, and the promotion belongs to natural language and non-words in the geographic discipline.
(3) Geographic synonym library
The Word vector model trained by Word2vec can only provide the correlation between words, the semantic similarity calculation accuracy is not high, and for the problem, synonym forest is usually adopted to improve the Word semantic similarity calculation accuracy, but the method is not suitable for identifying synonyms in disciplines. For example, in "synonym forest", synonyms of "fragile" include "weak", "long" and "long" which are synonyms in daily expression of people, but in the geographic discipline, there is usually expression of "ecological fragile" and replaced with "ecological weak", "ecological long" and "long" which are unclear in meaning and expression is ambiguous, resulting in scoring errors. Aiming at the problem of inaccurate identification of subject synonyms, a geographic synonym word library is constructed, and the method specifically comprises the following steps;
a. reading all words in the geographic dictionary, and writing the words into a geographic synonym word bank, wherein each word occupies one line;
b. inquiring all synonyms of each term in synonym forest in the geographic dictionary, writing the synonyms into the rear of the term in the geographic synonym thesaurus, and separating the synonyms by spaces;
c. inquiring all similar words of each word in the geographic dictionary in the geographic corpus, writing the words with the similarity of more than 0.6 into the rear of the word in the geographic synonym thesaurus, and separating the words with spaces;
d. and manually screening and supplementing the candidate words in each row based on the knowledge libraries such as Baidu encyclopedia and the like by taking the first word in each row in the geographic synonym word library as a target word and the later words as candidate words.
At present, 1843 groups of subject synonyms are arranged in a geographic synonym word bank, and because the geographic synonym word bank does not completely have the knowledge background of geographic subject experts and has certain subjectivity during manual construction, some subject synonyms may not be accurate enough, but the geographic synonym word bank can identify the subject synonyms more accurately compared with synonym forest.
(4) Part-of-speech-based keyword extraction and weight assignment
And extracting keywords based on parts of speech and distributing weights. The invention provides a part-of-speech-based keyword extraction and weight distribution method by analyzing geographical examination paper of college entrance examination, partial middle school test questions and answer features, namely classifying keywords according to parts of speech, and endowing each type of keywords with the following weights:
Figure BDA0002835257510000041
the answer text is divided into words and labeled with part of speech through the LTP and the geographic dictionary, the number of A-class words in the text is set as a, and the weight of each word is set as waB class number of words is B, and weight of each word is wbThe number of C-class words is C, and the weight of each word is wcAnd taking the weight of the B-type keyword as a reference x to obtain a keyword weight calculation equation shown as formula 1:
Figure BDA0002835257510000042
the formula 1 is derived, and the weight calculation of the class A keyword is shown as the formula 2:
Figure BDA0002835257510000043
the calculation of the weight of the B-type keyword is shown as formula 3:
Figure BDA0002835257510000051
the class C keyword weight calculation is shown in equation 4:
Figure BDA0002835257510000052
the keyword extraction and weight distribution algorithm is as follows:
a. inputting an algorithm: a target text S;
b. based on a geographic dictionary, utilizing LTP to segment the text S and label part of speech, referring to keyword weight distribution, extracting keywords according to part of speech to obtain a sequence Seq (x, t) { (x)1,t1),(x2,t2),(x3,t3),...,(xn,tn) Where x denotes the word, t denotes the part of speech tagged to the word x, and n denotes the number of words.
c. Traversing Seq (x, t), and distributing and counting the A-class word number a, the B-class word number B and the C-class word number C according to the weight of the keyword;
d. calculating the part-of-speech weight w of the A class by formula 2aCalculating the part-of-speech weight w of class B by formula 3bCalculating the part-of-speech weight w of class C by formula 4c
e. Traversing Seq (x, t), determining the category to which the part of speech t belongs by referring to keyword weight distribution, giving corresponding weight w to the word x according to the category, and obtaining a sequence Seq (x, t, w) { (x)1,t1,w1),(x2,t2,w2),(x3,t3,w3),...,(xn,tn,wn)}。
f. And (3) outputting: seq (x, t, w)
(5) Term similarity calculation based on subject synonyms
The invention introduces a subject synonym concept for realizing semantic similarity calculation of key words in a scoring process, comprehensively calculates word similarity according to subject synonyms provided in a geographic synonym word bank, synonyms provided in a synonym forest and similar words provided in a geographic language bank, and has the following specific principle:
1) inputting an algorithm: word1And Word2
2) Initialization: a geographical synonym library dlSym, a list sym coded as "═ in" synonym forest ", a geographical corpus dlmodel;
3) traverse dlSym, query Word1dlSymList if dlSymList length is greater than 0, to 4), otherwise, to 5);
4) traverse dlSymList, query Word2If present, then Word Sim (Word)1,Word2) 1, to 9), otherwise, to 5);
5) traverse sym, query Word1If symList length is greater than 0, to 6), otherwise, to 7);
6) traverse symList, query Word2If present, then Word Sim (Word)1,Word2) 0.8, to 9), otherwise, to 7);
7) query dlmodel for the presence of Word1If present, to 8), otherwise, wordSim (Word)1,Word2) 0, to 9);
8) query dlmodel for the presence of Word2Computing Word by dlmodel if present1And Word2Similarity dlmodelSym (Word)1,Word2) Then Word Sim (Word)1,Word2)=dlmodelSym(Word1,Word2) X 0.6, to 9), otherwise wordSim (Word)1,Word2) 0, to 9);
9) outputting word similarity: word Sim (Word)1,Word2)。
(6) Statement similarity calculation based on word vectors
The invention uses a sentence similarity calculation method, uses word vectors provided in a geographic corpus to weight the word vectors through a keyword extraction and weight distribution algorithm to construct sentence vectors, and uses a cosine similarity calculation method to calculate the similarity of the two sentence vectors to obtain the sentence similarity, wherein the specific algorithm is as follows:
a. inputting an algorithm: sentence S1And S2
b. Initialization: a geographic corpus dlmodel;
c. will S1And S2Extracting keywords and distributing weights by the method in the step (4) to obtain a sequence Seq1And Seq2
d. Traversal Seq1And Seq2Searching whether the word exists in a dlmodel, if so, converting the word into a word vector (the dimension of the word vector is 192) by using the dlmodel, and otherwise, marking the word as a 0 vector;
e. the word vector of the ith word in the sequence Seq is written as wviWeight of wiSeparately calculating Seq by equation 51And Seq2Obtain sentence vector V1And V2(ii) a n represents the number of words.
Figure BDA0002835257510000061
f. Using the cosine similarity algorithm, V is calculated by equation 61And V2Cosine similarity cosSim (V) of1,V2) Let S1And S2Sentence similarity sentenceSim (S)1,S2) Equal to cosSim (V)1,V2);
Figure BDA0002835257510000062
g. Outputting sentence similarity: sententesenim (S)1,S2)。
(7) Similarity scoring algorithm combining subject synonyms and word vectors
The method ensures the accuracy of word similarity calculation by constructing the geographic synonym library, but the accuracy of sentence similarity calculation is not high at present, and particularly, a certain similarity value can be calculated for two unrelated sentences to cause inaccurate scoring.
Figure BDA0002835257510000063
Figure BDA0002835257510000071
And calculating comprehensive similarity according to the word similarity and the sentence similarity of the examinee answers and the standard answers, automatically scoring, and making a scoring coefficient lambda according to the comprehensive similarity so as to accord with the evaluation mode of teachers. The correspondence between the similarity and the score coefficient lambda is as follows:
Figure BDA0002835257510000072
combining a similarity scoring algorithm of the subject synonyms and the word vectors, the method specifically comprises the following steps:
(1) inputting an algorithm: standard answer text B corresponding to score and examinee answer text S;
(2) extracting key words and distributing weight to the B and S by the method of the step one to obtain a sequence
SeqB(bx,bt,bw)={(bx1,bt1,bw1),(bx2,bt2,bw2),...,(bxn,btn,bwn)},
SeqS(sx,st,sw)={(sx1,st1,sw1),(sx2,st2,sw2),...,(sxn,stn,swn)};
(3) Traversing SeqB (bx, bt, bw), and searching the ith word part of speech bt in SeqS (sx, st, sw) and SeqB (bx, bt, bw)iThe same set of words, swlist (sx) ═ { sx1,sx2,...,sxn};
(4) Bx is calculated by the algorithm of the step twoiCalculating similarity with words in SWList (sx) in sequence, and multiplying the maximum value of the calculation result by weight bwiObtaining word weighting similarity siThen, the word similarity between B and S (B, S) ═ Σ Si
(5) Calculating B and S by a sentence similarity algorithm in the third step to obtain the sentence similarity sentenceSim (B, S) of B and S;
the correspondence between the statement similarity credibility value alpha and the word similarity is as follows:
Figure BDA0002835257510000073
(6) calculating a sentence similarity credibility value alpha through wordSim (B, S) by referring to the corresponding relation, and calculating the comprehensive similarity (B, S) of B and S by using an expression 3;
similarity(B,S)=wordSim(B,S)+sentenceSim(B,S)×α (3)
(7) calculating a corresponding score coefficient lambda by means of similarity (B, S) according to the corresponding relation between the similarity and the score coefficient, and calculating the score student score according to the formula 4;
studentScore=score×λ (4)
(8) outputting the score of the examinee: studentScore.

Claims (5)

1. The similarity scoring algorithm combining the subject synonyms and the word vectors is characterized by comprising the following steps of:
step one, extracting keywords based on parts of speech and distributing weights;
step two, calculating the word similarity based on the subject synonyms;
thirdly, calculating the sentence similarity based on the word vectors;
and step four, combining a similarity scoring algorithm of the subject synonyms and the word vectors.
2. The algorithm for scoring similarity between a thesaurus-associated synonym and a word vector as claimed in claim 1, wherein the part-of-speech-based keyword extraction and weight assignment in step one are as follows:
Figure FDA0002835257500000011
the keyword extraction algorithm is as follows:
a. inputting an algorithm: a target text S;
b. based on a geographic dictionary, utilizing LTP to segment the text S and label part of speech, referring to weight distribution, extracting keywords according to the part of speech to obtain a sequence Seq (x, t) { (x)1,t1),(x2,t2),(x3,t3),...,(xn,tn) Where x represents a word and t represents a part of speech tagged to the word x; n represents the number of words
c. Traversing Seq (x, t), and counting the A-class word number a, the B-class word number B and the C-class word number C by referring to weight distribution;
d. calculating the part-of-speech weight w of class AaCalculating the part-of-speech weight w of the B classbCalculating the part-of-speech weight w of the C classc
e. Traversing Seq (x, t), determining the category to which the part of speech t belongs by referring to weight distribution, giving corresponding weight w to the word x according to the category, and obtaining a sequence Seq (x, t, w) { (x)1,t1,w1),(x2,t2,w2),(x3,t3,w3),...,(xn,tn,wn)};
f. And (3) outputting: seq (x, t, w).
3. The algorithm for scoring the similarity between a discipline synonym and a word vector as claimed in claim 1, wherein the degree of similarity between words based on discipline synonyms in step two is calculated as follows:
1) inputting an algorithm: word1And Word2
2) Initialization: a geographical synonym library dlSym, a list sym coded as "═ in" synonym forest ", a geographical corpus dlmodel;
3) traverse dlSym, query Word1dlSymList if dlSymList length is greater than 0, to 4), otherwise, to 5);
4) traverse dlSymList, query Word2If present, then Word Sim (Word)1,Word2) 1, to 9), otherwise, to 5);
5) traverse sym, query Word1If symList length is greater than 0, to 6), otherwise, to 7);
6) traverse symList, query Word2If present, then Word Sim (Word)1,Word2) 0.8, to 9), otherwise, to 7);
7) query dlmodel for the presence of Word1If present, to 8), otherwise, wordSim (Word)1,Word2) 0, to 9);
8) query dlmodel for the presence of Word2Computing Word by dlmodel if present1And Word2Similarity dlmodelSym (Word)1,Word2) Then Word Sim (Word)1,Word2)=dlmodelSym(Word1,Word2) X 0.6, to 9), otherwise wordSim (Word)1,Word2) 0, to 9);
9) outputting word similarity: word Sim (Word)1,Word2)。
4. The algorithm for scoring the similarity between a conjunctive discipline synonym and a word vector as claimed in claim 1, wherein the word vector-based sentence similarity in step three is calculated as follows:
a. inputting an algorithm: sentence S1And S2
b. Initialization: a geographic corpus dlmodel;
c. will S1And S2Extracting keywords and distributing weights by the method of the first step to obtain a sequence Seq1And Seq2
d. Traversal Seq1And Seq2Searching whether the word exists in a dlmodel, if so, converting the word into a word vector (the dimension of the word vector is 192) by using the dlmodel, and otherwise, marking the word as a 0 vector;
e. the word vector of the ith word in the sequence Seq is written as wviWeight of wiSeparately calculating Seq by equation 11And Seq2Obtain sentence vector V1And V2
Figure FDA0002835257500000021
f. Using a cosine similarity algorithm, V is calculated by equation 21And V2Cosine similarity cosSim (V) of1,V2) Let S1And S2Sentence similarity sentenceSim (S)1,S2) Equal to cosSim (V)1,V2);
Figure FDA0002835257500000031
g. Outputting sentence similarity: sententesenim (S)1,S2)。
5. The algorithm for scoring the similarity between a conjoint subject synonym and a word vector as claimed in claim 1, wherein the algorithm for scoring the similarity between a conjoint subject synonym and a word vector in step four is as follows:
(1) inputting an algorithm: standard answer text B corresponding to score and examinee answer text S;
(2) extracting key words and distributing weight to the B and S by the method of the step one to obtain a sequence
SeqB(bx,bt,bw)={(bx1,bt1,bw1),(bx2,bt2,bw2),...,(bxn,btn,bwn)},
SeqS(sx,st,sw)={(sx1,st1,sw1),(sx2,st2,sw2),...,(sxn,stn,swn)};
(3) Traversing SeqB (bx, bt, bw), and searching the ith word part of speech bt in SeqS (sx, st, sw) and SeqB (bx, bt, bw)iThe same set of words, swlist (sx) ═ { sx1,sx2,...,sxn};
(4) Bx is calculated by the algorithm of the step twoiCalculating similarity with words in SWList (sx) in sequence, and multiplying the maximum value of the calculation result by weight bwiObtaining word weighting similarity siThen, the word similarity between B and S (B, S) ═ Σ Si
(5) Calculating B and S by a sentence similarity algorithm in the third step to obtain the sentence similarity sentenceSim (B, S) of B and S;
the correspondence between the statement similarity credibility value alpha and the word similarity is as follows:
Figure FDA0002835257500000032
(6) calculating a sentence similarity credibility value alpha through wordSim (B, S) by referring to the corresponding relation, and calculating the comprehensive similarity (B, S) of B and S by using an expression 3;
similarity(B,S)=wordSim(B,S)+sentenceSim(B,S)×α (3)
the correspondence between the similarity and the score coefficient lambda is as follows:
Figure FDA0002835257500000041
(7) calculating a corresponding score coefficient lambda by means of similarity (B, S) according to the corresponding relation between the similarity and the score coefficient, and calculating the score student score according to the formula 4;
studentScore=score×λ (4)
(8) outputting the score of the examinee: studentScore.
CN202011475757.8A 2020-12-15 2020-12-15 Similarity scoring algorithm combining subject synonyms and word vectors Pending CN112632970A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011475757.8A CN112632970A (en) 2020-12-15 2020-12-15 Similarity scoring algorithm combining subject synonyms and word vectors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011475757.8A CN112632970A (en) 2020-12-15 2020-12-15 Similarity scoring algorithm combining subject synonyms and word vectors

Publications (1)

Publication Number Publication Date
CN112632970A true CN112632970A (en) 2021-04-09

Family

ID=75313262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011475757.8A Pending CN112632970A (en) 2020-12-15 2020-12-15 Similarity scoring algorithm combining subject synonyms and word vectors

Country Status (1)

Country Link
CN (1) CN112632970A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779196A (en) * 2021-09-07 2021-12-10 大连大学 Customs synonym recognition method fusing multi-level information
CN113934814A (en) * 2021-08-01 2022-01-14 北京工业大学 Automatic scoring method for subjective questions of ancient poetry

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960786A (en) * 2019-03-27 2019-07-02 北京信息科技大学 Chinese Measurement of word similarity based on convergence strategy
CN110175585A (en) * 2019-05-30 2019-08-27 北京林业大学 It is a kind of letter answer correct system and method automatically
CN110442760A (en) * 2019-07-24 2019-11-12 银江股份有限公司 A kind of the synonym method for digging and device of question and answer searching system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960786A (en) * 2019-03-27 2019-07-02 北京信息科技大学 Chinese Measurement of word similarity based on convergence strategy
CN110175585A (en) * 2019-05-30 2019-08-27 北京林业大学 It is a kind of letter answer correct system and method automatically
CN110442760A (en) * 2019-07-24 2019-11-12 银江股份有限公司 A kind of the synonym method for digging and device of question and answer searching system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张俊飞;: "改进TF-IDF结合余弦定理计算中文语句相似度", 现代计算机(专业版), no. 32, 15 November 2017 (2017-11-15) *
张均胜;石崇德;徐红姣;高影繁;何彦青;: "一种基于短文本相似度计算的主观题自动阅卷方法", 图书情报工作, no. 19, 5 October 2014 (2014-10-05) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113934814A (en) * 2021-08-01 2022-01-14 北京工业大学 Automatic scoring method for subjective questions of ancient poetry
CN113779196A (en) * 2021-09-07 2021-12-10 大连大学 Customs synonym recognition method fusing multi-level information
CN113779196B (en) * 2021-09-07 2024-02-13 大连大学 Customs synonym identification method integrating multi-level information

Similar Documents

Publication Publication Date Title
CN108399163B (en) Text similarity measurement method combining word aggregation and word combination semantic features
Berant et al. Semantic parsing via paraphrasing
CN104137102B (en) Non- true type inquiry response system and method
CN104794169B (en) A kind of subject terminology extraction method and system based on sequence labelling model
CN107967318A (en) A kind of Chinese short text subjective item automatic scoring method and system using LSTM neutral nets
Gao et al. Automated pyramid summarization evaluation
Üstün et al. Characters or morphemes: How to represent words?
CN111914532A (en) Chinese composition scoring method
Fu et al. Learning semantic hierarchies: A continuous vector space approach
CN112632970A (en) Similarity scoring algorithm combining subject synonyms and word vectors
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN111858896A (en) Knowledge base question-answering method based on deep learning
Lagakis et al. Automated essay scoring: A review of the field
Xu et al. Implicitly incorporating morphological information into word embedding
CN115510863A (en) Question matching task oriented data enhancement method
Chang et al. Automated Chinese essay scoring based on multilevel linguistic features
CN113934814A (en) Automatic scoring method for subjective questions of ancient poetry
Hao et al. SCESS: a WFSA-based automated simplified chinese essay scoring system with incremental latent semantic analysis
CN111079582A (en) Image recognition English composition running question judgment method
Lahbari et al. A rule-based method for Arabic question classification
Torres et al. Using machine learning methods to avoid the pitfall of cognates and false friends in Spanish-Portuguese word pairs
CN114462389A (en) Automatic test paper subjective question scoring method
Lin et al. Design and implementation of intelligent scoring system for handwritten short answer based on deep learning
CN112085985A (en) Automatic student answer scoring method for English examination translation questions
Gillard et al. Question Answering Evaluation Survey.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination