CN111241239B - Method for detecting repeated questions, related device and readable storage medium - Google Patents

Method for detecting repeated questions, related device and readable storage medium Download PDF

Info

Publication number
CN111241239B
CN111241239B CN202010013765.4A CN202010013765A CN111241239B CN 111241239 B CN111241239 B CN 111241239B CN 202010013765 A CN202010013765 A CN 202010013765A CN 111241239 B CN111241239 B CN 111241239B
Authority
CN
China
Prior art keywords
topic
pair
question
data
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010013765.4A
Other languages
Chinese (zh)
Other versions
CN111241239A (en
Inventor
李旭浩
沙晶
付瑞吉
王士进
魏思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202010013765.4A priority Critical patent/CN111241239B/en
Publication of CN111241239A publication Critical patent/CN111241239A/en
Application granted granted Critical
Publication of CN111241239B publication Critical patent/CN111241239B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method for detecting repeated questions, related equipment and a readable storage medium, wherein after a question pair to be subjected to repeated question detection is obtained, a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair are determined based on multiple detection data of each question in the question pair, namely, question face data, analysis data and student response data; whether the two questions in the question pair are the duplicate questions or not is determined based on the determined word similarity result, the semantic similarity result and the answer distribution similarity result, whether the two questions in the question pair are the duplicate questions or not can be detected from multiple angles, and compared with the method for detecting whether the two questions in the question pair are the duplicate questions or not from a single angle, the accuracy for detecting whether the two questions in the question pair are the duplicate questions or not can be improved, and the purpose of avoiding recommending the duplicate questions for the user as much as possible is achieved.

Description

Method for detecting repeated questions, related device and readable storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method for detecting a question, a related device, and a readable storage medium.
Background
With the development of internet technology, online education is more and more frequently appearing in the visual field of people. A successful online education system cannot be used for well-separated education resource construction. Under the existing mode, the construction of the education resource library needs to invest a large amount of manpower, and the knowledge point coverage of the resource library is increased by continuously adding and processing new questions.
At present, an online education system recommends topics in a resource library for a user based on an individualized recommendation algorithm, however, with the expansion of the resource library, the number of topics will be in the millions or even tens of millions, and due to the difference of time, place, processor and the like of processing each topic, repeated topics inevitably appear in the resource library, which directly results in the recommendation of repeated topics for the user, and the user experience effect is poor.
Therefore, it is desirable to provide a method for detecting duplicate topics to avoid recommending duplicate topics to the user as much as possible.
Disclosure of Invention
In view of the foregoing, the present application provides a method for detecting a problem, a related device, and a readable storage medium. The specific scheme is as follows:
a method of problem detection, comprising:
obtaining a question pair to be subjected to repeated question detection;
determining detection data of each question in the question pair, wherein the detection data comprises question surface data, analysis data and student answering data;
determining a word similarity result, a semantic similarity result and an answer distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;
and determining whether the two questions in the question pair are repeated questions or not based on the word similarity result, the semantic similarity result and the answer distribution similarity result.
Optionally, the determining detection data of each topic in the topic pair includes:
acquiring original data of each question in the question pair;
determining detection data of each topic in the topic pair based on the original data of each topic in the topic pair;
wherein, the determining the detection data of each topic in the topic pair based on the original data of each topic in the topic pair comprises:
taking the original data of each topic in the topic pair as the detection data of each topic in the topic pair;
or the like, or, alternatively,
and carrying out standardization processing on the original data of each topic in the topic pair, wherein the processed data is used as the detection data of each topic in the topic pair.
Optionally, determining a result of word similarity between two topics in the topic pair according to the detection data of each topic in the topic pair, including:
performing word segmentation processing on the topic data in the detection data of each topic in the topic pair to obtain a topic word segmentation result of each topic in the topic pair;
and obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation result of the two topics in the topic pair.
Optionally, determining a semantic similarity result of two topics in the topic pair according to the detection data of each topic in the topic pair, including:
performing semantic word segmentation on a combination of topic data and analytic data in the detection data of each topic in the topic pair to obtain a first semantic word segmentation result of each topic in the topic pair;
and obtaining a semantic similarity result of the two questions in the question pair based on the first semantic word segmentation result of the two questions in the question pair.
Optionally, determining a result of similarity of response distribution of two topics in the topic pair according to the detection data of each topic in the topic pair, including:
performing semantic word segmentation on student response data in the detection data of each question in the question pair to obtain a second semantic word segmentation result of each question in the question pair;
and obtaining answer distribution similarity results of the two questions in the question pair based on the question face segmentation results and the second semantic segmentation results of the two questions in the question pair.
Optionally, the obtaining a result of similarity of response distribution of the two topics in the topic pair based on the topic segmentation result and the second semantic segmentation result of the two topics in the topic pair includes:
obtaining a wrong answer distribution similarity result and a correct answer distribution similarity result of each question in the question pair based on the question face segmentation result and the second semantic segmentation result of each question in the question pair;
and taking the wrong answer distribution similarity result and the correct answer distribution similarity result of the two questions in the question pair as answer distribution similarity results of the two questions in the question pair.
Optionally, the determining whether two topics in the topic pair are double topics based on the word similarity result, the semantic similarity result, and the answer distribution similarity result includes:
inputting the word similarity result, the semantic similarity result and the answer distribution similarity result into a classification model, and determining whether two questions in the question pair are repeated questions;
the classification model is obtained by taking a word similarity result, a semantic similarity result and a response distribution similarity result of each training topic in the training topic pair as training samples and taking labeling information for identifying whether two topics in the training topic pair are repeated topics as sample labels for training.
An apparatus for detecting a problem, the apparatus comprising:
the acquisition unit is used for acquiring a question pair to be subjected to repeated question detection;
the detection data determining unit is used for determining detection data of each question in the question pair, and the detection data comprises question surface data, analysis data and student response data;
the similarity determining unit is used for determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;
and the repeated question determining unit is used for determining whether the two questions in the question pair are repeated questions or not based on the word similarity result, the semantic similarity result and the answer distribution similarity result.
Optionally, the obtaining unit includes:
the original data acquisition unit is used for acquiring the original data of each topic in the topic pair;
the detection data determining unit is used for determining the detection data of each topic in the topic pair based on the original data of each topic in the topic pair;
wherein the detection data determining unit includes:
a first detection data determining unit, configured to use original data of each topic in the topic pair as detection data of each topic in the topic pair;
or the like, or, alternatively,
and the second detection data determining unit is used for carrying out standardization processing on the original data of each topic in the topic pair, and the processed data is used as the detection data of each topic in the topic pair.
Optionally, the similarity determining unit includes:
the topic segmentation unit is used for performing segmentation processing on topic data in the detection data of each topic in the topic pair to obtain a topic segmentation result of each topic in the topic pair;
and the word similarity result determining unit is used for obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation result of the two topics in the topic pair.
Optionally, the similarity determining unit includes:
the first semantic word segmentation unit is used for performing semantic word segmentation on a combination of the topic data and the analysis data in the detection data of each topic in the topic pair to obtain a first semantic word segmentation result of each topic in the topic pair;
and the semantic similarity result determining unit is used for obtaining a semantic similarity result of the two topics in the topic pair based on the first semantic word segmentation result of the two topics in the topic pair.
Optionally, the similarity determining unit includes:
the second semantic word segmentation unit is used for performing semantic word segmentation on student response data in the detection data of each question in the question pair to obtain a second semantic word segmentation result of each question in the question pair;
and the answer distribution similarity determining unit is used for obtaining answer distribution similarity results of the two questions in the question pair based on the question surface word segmentation results and the second semantic word segmentation results of the two questions in the question pair.
Optionally, the answer distribution similarity determining unit is specifically configured to:
obtaining a wrong answer distribution similarity result and a correct answer distribution similarity result of each question in the question pair based on the question face segmentation result and the second semantic segmentation result of each question in the question pair;
and taking the wrong answer distribution similarity result and the correct answer distribution similarity result of the two questions in the question pair as answer distribution similarity results of the two questions in the question pair.
Optionally, the question-of-importance determining unit is specifically configured to:
inputting the word similarity result, the semantic similarity result and the answer distribution similarity result into a classification model, and determining whether two questions in the question pair are repeated questions;
the classification model is obtained by taking a word similarity result, a semantic similarity result and a response distribution similarity result of each training topic in the training topic pair as training samples and taking labeling information for identifying whether two topics in the training topic pair are repeated topics as sample labels for training.
A problem detection system comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the method for detecting the repeated problems.
A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of problem detection as described above.
By means of the technical scheme, the application discloses a method for detecting the repeated questions, related equipment and a readable storage medium, after a question pair to be subjected to repeated question detection is obtained, a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair are determined based on multiple detection data of each question in the question pair, namely, question face data, analysis data and student response data; and determining whether the two subjects in the subject pair are the duplicate subjects or not based on the determined word similarity result, the semantic similarity result and the answer distribution similarity result, so that whether the two subjects in the subject pair are the duplicate subjects or not can be detected from multiple angles, and compared with the method for detecting whether the two subjects in the subject pair are the duplicate subjects or not from a single angle, the accuracy for detecting whether the two subjects in the subject pair are the duplicate subjects or not can be improved, and the purpose of recommending the duplicate subjects to the user to the greatest extent can be achieved.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic flowchart of a method for detecting a problem of interest according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a problem detection model according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a device for detecting a duplicate problem disclosed in an embodiment of the present application;
fig. 4 is a block diagram of a hardware structure of a system for detecting a problem of repetition disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Next, the problem detection method provided by the present application is described by the following examples.
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for detecting a repeated topic disclosed in an embodiment of the present application, where the method includes:
s101: and obtaining a question pair to be subjected to repeated question detection.
In the present application, the question pair to be subjected to repeated question detection may be any question pair in an education resource library of an online education system, and in terms of question types, may be question pairs of various question types, such as blank filling questions, selection questions, answer questions and the like, and in terms of subjects belonging to the subjects, such as subjects of mathematics, physics, chemistry and the like. The number of the subject pairs to be subjected to the repeated subject detection can be one or more.
S102: and determining detection data of each topic in the topic pair.
In the prior art, when detecting the repeated topics, only the topic data of each topic in the topic pair to be detected is directly determined as the detection data, and the repeated topics which can be obviously judged from the topic are detected according to the topic data of each topic in the topic pair, that is, the repeated topics can be judged by people seeing at first. Intuitively, the visual impact of the subjects with the same subjects or most of the subjects with the same subjects on people is large, and the first impression given to people is a big probability of being a heavy subject. In fact, from the perspective of one-line feedback data, this type of data also has a large weight and is extremely easy to complain by users.
However, topics generally include three parts of topic surfaces, parsing and student response, in some cases, although topic surfaces of two topics are not similar, the knowledge points and response points that two topics may examine have great similarity, and in this case, two topics should be determined as important topics, but only based on the topic data, results that two topics are not important topics are obtained.
Therefore, in the present application, the detection data of each topic in the topic pair includes topic surface data, analysis data, and student response data of the topic. Wherein the student response data may include a plurality of different response data.
S103: and determining a word similarity result, a semantic similarity result and a response distribution similarity result of the two questions in the question pair according to the detection data of each question in the question pair.
In the application, after the topic data, the analytic data and the student response data of each topic in the topic pair are determined, the word similarity result, the semantic similarity result and the response distribution similarity result of two topics in the topic pair can be determined according to the topic data, the analytic data and the student response data of each topic in the topic pair.
In the method, the topic data, the analytic data and the student response data of each topic in the topic pair can be divided into different groups according to the requirements of three different levels, namely the word level, the semantic level and the response content level, and the similarity result of the corresponding level is obtained based on the detection data in the different groups.
In the present application, the topic data of each topic in the topic pair may be divided into one group, the topic data and the analytic data of each topic in the topic pair may be divided into one group, and the topic data of each topic in the topic pair and the student response data may be divided into one group. Based on the division mode, in the application, the method can be used for determining the word similarity results of two questions in the question pair according to the question face data of each question in the question pair and the analytic data, determining the semantic similarity results of the two questions in the question pair according to the question face data and the analytic data of each question, and determining the answer distribution similarity results of the two questions in the question pair according to the question face data and the student answer data of each question in the question pair. Specific implementations of determining the respective similarity results will be described in detail by the following examples.
S104: and determining whether the two questions in the question pair are repeated questions or not based on the word similarity result, the semantic similarity result and the answer distribution similarity result.
In this application, confirm according to the topic data, analytic data and the student of every topic in the topic pair after the word similarity result of two topics in the topic pair, semantic similarity result and the distribution similarity result of answering, can confirm whether two topics in the topic pair are the answer based on word similarity result, semantic similarity result and the distribution similarity result of answering.
It should be noted that, based on the word similarity result, the semantic similarity result, and the answer distribution similarity result, there may be various specific implementation manners for determining whether two topics in the topic pair are the answers, as an implementable manner, the weights occupied by the word similarity result, the semantic similarity result, and the answer distribution similarity result when determining whether two topics in the topic pair are the answers may be determined, and based on the word similarity result, the weight thereof, the semantic similarity result, the weight thereof, the answer distribution similarity result, and the weight thereof, whether two topics in the topic pair are the answers is determined. As another possible implementation manner, different neural network models can be used to determine whether two topics in a topic pair are duplicates based on the word similarity result, the semantic similarity result, and the answer distribution similarity result, and a specific implementation manner will be described in detail by the following embodiments.
The embodiment discloses a method for detecting repeated questions, which comprises the steps of obtaining a question pair to be subjected to repeated question detection, and determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair based on multiple detection data of each question in the question pair, namely, question data, analysis data and student response data; whether the two questions in the question pair are the duplicate questions or not is determined based on the determined word similarity result, the semantic similarity result and the answer distribution similarity result, whether the two questions in the question pair are the duplicate questions or not can be detected from multiple angles, and compared with the method for detecting whether the two questions in the question pair are the duplicate questions or not from a single angle, the accuracy for detecting whether the two questions in the question pair are the duplicate questions or not can be improved, and the purpose of avoiding recommending the duplicate questions for the user as much as possible is achieved.
In this application, a specific implementation manner of determining detection data of each topic in the topic pair is also disclosed, and the method may include the following steps:
s201: and acquiring the original data of each topic in the topic pair.
In the present application, the original data of each topic in the topic pair includes original topic data, original analytic data, and original student response data of each topic in the topic pair.
S202: and determining the detection data of each topic in the topic pair based on the original data of each topic in the topic pair.
As an implementation manner, the original data of each topic in the topic pair can be used as the detection data of each topic in the topic pair in the present application.
However, the original data of each topic in the topic pair often has various problems, for example, the formula representation is not unique, picture data which cannot be identified exists, and the like, and these problems will directly affect the accuracy of subsequent repeated topic detection.
In order to solve the above problem, as another possible embodiment, the original data of each topic in the topic pair may be normalized, and the processed data may be used as the detection data of each topic in the topic pair. The standardization processing comprises picture identification processing, formula regularization processing and the like.
For the convenience of understanding, the mathematical topic is taken as an example in the application, and a specific implementation manner of performing the normalization processing on the raw data of the mathematical topic is given as follows:
firstly, judging whether picture formula data exist or not according to html data of each module in original data of a mathematical problem, if so, adopting an OCR (optical character recognition) system to recognize and extract formulas in the pictures, and replacing corresponding pictures to form a LaTeX formula so as to ensure the integrity of the formula data content.
Then, the formula data in the original data of the mathematical problem is normalized, all symbols in the formula data are converted into LaTeX symbol representation, normalization of synonymous LaTeX representation, simplified LaTeX representation, and all special effect representations such as bold body representation, color representation, arrow representation and the like are removed.
Based on the mode, the standardized data of the mathematic questions can be obtained by processing the original data of the mathematic questions, the standardized analysis data of the mathematic questions can be obtained by processing the original analysis data of the mathematic questions, and the standardized data of the mathematic questions can be obtained by processing the answer data of the original students of the mathematic questions.
It should be noted that, in the present application, the detection data of each topic in the topic pair may be input into the similarity detection model to obtain the word similarity result, the semantic similarity result, and the answer distribution similarity result of the two topics in the topic pair. The similarity detection model comprises a word segmentation system, a semantic word segmentation system, a word level detection system, a semantic level detection system and a student answer detection system.
Based on the similarity detection model, the implementation manner of determining the word similarity result, the semantic similarity result and the answer distribution similarity result of two topics in the topic pair according to the detection data of each topic in the topic pair is disclosed in the application, and the detailed description is specifically provided through the following contents.
In this application, an implementation manner for determining a result of similarity between words of two topics in a topic pair according to detection data of each topic in the topic pair is disclosed, and the implementation manner may include:
s301: and performing word segmentation processing on the topic data in the detection data of each topic in the topic pair to obtain a topic word segmentation result of each topic in the topic pair.
In the application, the topic data in the detection data of each topic in the topic pair can be input into the word segmentation system, so that a topic word segmentation result of each topic in the topic pair is obtained.
It should be noted that the word segmentation system can be an existing chinese word segmentation system such as jieba, snowNLP, stanfordCoreNLP, english word segmentation system such as NLTK, spaCy, and StanfordCoreNLP. However, for the subjects containing Chinese, english letters, english symbols, special mathematical symbols and the like, because the types of characters contained in the subjects are more, the existing Chinese word segmentation system or English word segmentation system cannot be directly applied to the subjects.
S302: and obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation results of the two topics in the topic pair.
In the method, the topic segmentation results of the two topics in the topic pair can be input into the word level detection system, and the word similarity result of the two topics in the topic pair is obtained.
The word level detection system can construct vectors corresponding to the topic segmentation results of the two topics in the topic pair, and calculate the similarity of the vectors corresponding to the topic segmentation results of the two topics as the similarity result of the words of the two topics in the topic pair. Various ways of calculating the similarity of the vectors corresponding to the topic word segmentation results of the two topics can be provided, for example, the cosine similarity represented by the two vectors can be calculated, the distribution and the proportion of n-grams of the two vectors can be counted, and the similarity of the two vectors can be obtained by comparing the difference of the n-grams.
In this application, a specific implementation manner for determining a semantic similarity result of two topics in a topic pair according to detection data of each topic in the topic pair is also disclosed, and the implementation manner may include the following steps:
s401: and performing semantic word segmentation on the combination of the topic data and the analysis data in the detection data of each topic in the topic pair to obtain a first semantic word segmentation result of each topic in the topic pair.
Because the analysis of the repeated questions has extremely similar formats and methods, the analysis is various and difficult to compare, and the analysis cannot be used independently. Therefore, in the application, the topic data and the analytic data in the detection data of each topic in the topic pair can be input into the semantic segmentation system to obtain the first semantic segmentation result of each topic in the topic pair.
It should be noted that the semantic word segmentation system is an improvement of the word segmentation system described in S301, and compared to the word segmentation system, the semantic word segmentation system has a function of performing abstraction and normalization processing on a word segmentation result of the word segmentation system, and can eliminate ambiguity on word segmentation to a certain extent. For example, delta and triangle in the word segmentation system are both expressed as triangles in the semantic word segmentation system.
S402: and obtaining a semantic similarity result of the two questions in the question pair based on the first semantic segmentation result of the two questions in the question pair.
In the method, the first semantic word segmentation result of the two topics in the topic pair can be input into the semantic level detection system to obtain the semantic similarity result of the two topics in the topic pair.
In the application, the semantic level detection system can construct the coding information corresponding to the first semantic segmentation result of the two topics in the topic pair, and map the coding information corresponding to the first semantic segmentation result of the two topics to obtain the semantic similarity result of the two topics in the topic pair.
The method for constructing the vector corresponding to the first semantic segmentation result of the two topics in the topic pair can be as follows: firstly, obtaining initial low-dimensional vectors corresponding to first semantic word segmentation results of two questions in a question pair based on modes such as word2vec, glove, fasttext and the like, and then, obtaining coding information corresponding to the first semantic word segmentation results by passing the initial low-dimensional vectors through a plurality of layers of LSTMs, wherein the coding information can represent deep semantics and syntax of the first semantic word segmentation results.
In this application, a specific implementation manner for determining a result of similarity of response distribution of two topics in a topic pair according to detection data of each topic in the topic pair is also disclosed, and the method includes the following steps:
s501: and performing semantic word segmentation on the student response data in the detection data of each question in the question pair to obtain a second semantic word segmentation result of each question in the question pair.
In the application, the student response data in the detection data of each topic in the topic pair can be input into the semantic word segmentation system to obtain a second semantic word segmentation result of each topic in the topic pair. The semantic word segmentation system can refer to the related descriptions in S301 and S401, and the description is omitted here.
S502: and obtaining answer distribution similarity results of the two questions in the question pair based on the question face segmentation results and the second semantic segmentation results of the two questions in the question pair.
For repeated topics, the answer of one topic can be regarded as the answer of another topic without considering the specific application background, or, in a plurality of answer results of two topics, the distribution of the wrong type or wrong logic in the wrong answer, and if there are a plurality of correct answers, the distribution of the correct answer type or the correct answer logic also has similarity.
As an implementable mode, the topic segmentation result and the second semantic segmentation result of the two topics in the topic pair can be input into the student answering detection system, and the answering distribution similarity result of the two topics in the topic pair is obtained.
Since students have diversity in answering and the answering mode is greatly influenced by individuals, and noise may cover up important information during encoding, it is not preferable to directly compare the answers of students who answer two questions in question pairs to determine the distribution similarity of the answers of the two questions in question pairs. Moreover, the student answering each topic in the topic pair can include correct answers and incorrect answers. There is a similar distribution of correct responses and incorrect responses.
Therefore, in the application, the student answering detection system can obtain the wrong answering distribution similarity result and the correct answering distribution similarity result of each topic in the topic pair based on the topic surface word segmentation result and the second semantic word segmentation result of each topic in the topic pair; and taking the wrong answer distribution similarity result and the correct answer distribution similarity result of the two questions in the question pair as answer distribution similarity results of the two questions in the question pair.
For easy understanding, assume that the subject pair contains a subject A and a subject B, and the correct student answering data of the subject A is A 1 ~A k The wrong student response data is A k+1 ~A n The correct student response data of the subject B is B 1 ~B j The wrong student response data is B j+1 ~B m
Answer data A for each correct student of topic A 1 ~A k By means of a multi-layer LSTM right Obtaining the vector code of each correct student response data, carrying out attention calculation on each word segmentation code in the topic word segmentation result of the topic A to obtain the attention calculation result of the correct student response data, and carrying out normalization processing on the attention calculation result of the correct student response data to obtain the distribution of the attention calculation result of the correct student response data; averaging the distribution of the attention calculation results of the correct student response data to obtain the information distribution dr of the question A corresponding to the correct student response data of the question A A
Answering data A for each wrong student for topic A k+1 ~A n By means of a multi-layer LSTM wrong Obtaining the vector code of each wrong student response data, carrying out attention calculation with each participle code in the topic participle result of the topic A to obtain the attention calculation result of the wrong student response data, and carrying out normalization processing on the attention calculation result of the wrong student response data to obtain the distribution of the attention calculation result of the wrong student response data; averaging the distribution of the attention calculation results of the wrong student response data to obtain the information distribution dw of the question A corresponding to the wrong student response data of the question A A
Data B of every correct student for subject B 1 ~B j By means of a multi-layer LSTM right Obtaining the vector code of the correct student response data, and performing attention calculation with each participle code in the topic participle result of the topic A to obtain the correct student responseThe attention calculation results of the data are normalized, and the distribution of the attention calculation results of the correct student response data is obtained; averaging the distribution of the attention calculation results of the correct student response data to obtain the information distribution dr of the question A corresponding to the correct student response data of the question B B
Data B of every wrong student for topic B j+1 ~B m By means of a multi-layer LSTM wrong Obtaining the vector code of each wrong student response data, carrying out attention calculation on each word segmentation code in the topic word segmentation result of the topic A to obtain the attention calculation result of the wrong student response data, and carrying out normalization processing on the attention calculation result of the wrong student response data to obtain the distribution of the attention calculation result of the wrong student response data; averaging the distribution of the attention calculation results of the wrong student response data to obtain the information distribution dw of the question A corresponding to the wrong student response data of the question B B
Calculating dr A And dr B The Kullback-Leibler distance obtains the result of the similarity of the correct answer distribution of the question A, and dw is calculated A And dw B The Kullback-Leibler distance of (A) gives the similarity results of the distribution of the false answers of topic A.
Similarly, the distribution similarity result of correct answers at the topic B and the distribution similarity result of wrong answers at the topic B can be obtained.
It should be noted that, for repeated titles, the effective information of the title is the same. Because the question information used for correct answer is similar, the question information distribution corresponding to the correct answer can be obtained by using the interaction of the correct answer and the question, and whether the two questions are repeated or not is judged by comparing the question information distribution difference obtained by the interaction of the correct answers of the two questions and the same question. Similarly, a similar determination can be made using the error response.
In the application, a specific implementation manner for determining whether two topics in a topic pair are repeated topics based on a word similarity result, a semantic similarity result and a response distribution similarity result is also disclosed, and the manner can be as follows:
and inputting the word similarity result, the semantic similarity result and the answer distribution similarity result into a classification model, and determining whether two questions in the question pair are repeated questions or not.
It should be noted that the classification model is obtained by training, with the word similarity result, the semantic similarity result, and the answer distribution similarity result of each training question in the training question pair as training samples, and with the labeling information for identifying whether two questions in the training question pair are duplicates as sample labels.
The classification model may be a multi-tier perceptron. The output of the classification model can be the probability that two questions in the question pair are the repeated questions, when whether the two questions in the question pair are the repeated questions is determined based on the classification model, whether the probability output by the classification model is larger than a preset threshold value or not can be judged, if the probability output by the classification model is larger than the preset threshold value, the two questions in the question pair are determined to be the repeated questions, and if the probability output by the classification model is smaller than or equal to the preset threshold value, the two questions in the question pair are determined not to be the repeated questions.
It should be further noted that, as shown in fig. 2, in the present application, a similarity detection model may be generated by combining a segmentation system, a semantic segmentation system, a word level detection system, a semantic level detection system, and a student response detection system, and the similarity detection model and the classification model may be combined into a repeated topic detection model, when detecting repeated topics based on the repeated topic detection model, the segmentation system, the semantic segmentation system, the word level detection system, the semantic level detection system, and the student response detection system in the repeated topic detection model may perform a step of determining a word similarity result, a semantic similarity result, and a response distribution similarity result of two topics in the topic pair according to detection data of each topic pair, and the classification model may perform a step of determining whether two topics in the pair are repeated topics based on the word similarity result, the semantic similarity result, and the response distribution similarity result.
The following describes the apparatus for detecting a problem, which is disclosed in the embodiments of the present application, and the apparatus for detecting a problem and the method for detecting a problem described above can be referred to in correspondence.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a problem detection apparatus disclosed in the embodiment of the present application. As shown in fig. 3, the apparatus for detecting a problem of repetition may include:
an obtaining unit 21, configured to obtain a question pair to be subjected to question duplication detection;
a detection data determining unit 22, configured to determine detection data of each question in the question pair, where the detection data includes question surface data, analysis data, and student response data;
a similarity determining unit 23, configured to determine a word similarity result, a semantic similarity result, and a response distribution similarity result of two topics in the topic pair according to the detection data of each topic in the topic pair;
and the repeated question determining unit 24 is configured to determine whether two questions in the question pair are repeated questions based on the word similarity result, the semantic similarity result, and the answer distribution similarity result.
Optionally, the obtaining unit includes:
the original data acquisition unit is used for acquiring the original data of each topic in the topic pair;
the detection data determining unit is used for determining the detection data of each topic in the topic pair based on the original data of each topic in the topic pair;
wherein the detection data determining unit includes:
a first detection data determining unit, configured to use original data of each topic in the topic pair as detection data of each topic in the topic pair;
or the like, or, alternatively,
and the second detection data determining unit is used for carrying out standardization processing on the original data of each topic in the topic pair, and the processed data is used as the detection data of each topic in the topic pair.
Optionally, the similarity determining unit includes:
the topic segmentation unit is used for performing segmentation processing on topic data in the detection data of each topic in the topic pair to obtain a topic segmentation result of each topic in the topic pair;
and the word similarity result determining unit is used for obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation result of the two topics in the topic pair.
Optionally, the similarity determining unit includes:
the first semantic word segmentation unit is used for performing semantic word segmentation on a combination of the topic data and the analysis data in the detection data of each topic in the topic pair to obtain a first semantic word segmentation result of each topic in the topic pair;
and the semantic similarity result determining unit is used for obtaining a semantic similarity result of the two questions in the question pair based on a first semantic segmentation result of the two questions in the question pair.
Optionally, the similarity determining unit includes:
the second semantic word segmentation unit is used for performing semantic word segmentation on student response data in the detection data of each question in the question pair to obtain a second semantic word segmentation result of each question in the question pair;
and the answer distribution similarity determining unit is used for obtaining answer distribution similarity results of the two questions in the question pair based on the question surface word segmentation results and the second semantic word segmentation results of the two questions in the question pair.
Optionally, the answer distribution similarity determining unit is specifically configured to:
obtaining a wrong answer distribution similarity result and a correct answer distribution similarity result of each question in the question pair based on the question face segmentation result and the second semantic segmentation result of each question in the question pair;
and taking the wrong answer distribution similarity result and the correct answer distribution similarity result of the two questions in the question pair as answer distribution similarity results of the two questions in the question pair.
Optionally, the topic-of-importance determination unit is specifically configured to:
inputting the word similarity result, the semantic similarity result and the answer distribution similarity result into a classification model, and determining whether two questions in the question pair are repeated questions;
the classification model is obtained by taking a word similarity result, a semantic similarity result and a response distribution similarity result of each training topic in the training topic pair as training samples and taking labeling information for identifying whether two topics in the training topic pair are repeated topics as sample labels for training.
Fig. 4 is a block diagram of a hardware structure of a system for detecting a duplicate problem disclosed in an embodiment of the present application, and referring to fig. 4, the hardware structure of the system for detecting a duplicate problem may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits or the like configured to implement an embodiment of the present invention;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
obtaining a question pair to be subjected to repeated question detection;
determining detection data of each question in the question pair, wherein the detection data comprises question surface data, analysis data and student answering data;
determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;
and determining whether the two questions in the question pair are repeated questions or not based on the word similarity result, the semantic similarity result and the answer distribution similarity result.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:
obtaining a question pair to be subjected to repeated question detection;
determining detection data of each question in the question pair, wherein the detection data comprises question surface data, analysis data and student answering data;
determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;
and determining whether the two questions in the question pair are repeated questions or not based on the word similarity result, the semantic similarity result and the answer distribution similarity result.
Alternatively, the detailed function and the extended function of the program may be as described above.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for detecting a problem, comprising:
obtaining a question pair to be subjected to repeated question detection;
determining detection data of each question in the question pair, wherein the detection data comprises question surface data, analysis data and student answering data;
determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;
determining whether two questions in the question pair are repeated questions based on the word similarity result, the semantic similarity result and the answer distribution similarity result;
determining a word similarity result, a semantic similarity result and a response distribution similarity result of two topics in the topic pair according to the detection data of each topic in the topic pair, wherein the determining comprises the following steps:
determining a similarity result of the words of the two questions in the question pair according to the question face data of each question in the question pair, determining a semantic similarity result of the two questions in the question pair according to the question face data and the analytic data of each question in the question pair, and determining an answer distribution similarity result of the two questions in the question pair according to the question face data of each question in the question pair and the answer data of students.
2. The method of claim 1, wherein the determining detection data for each topic in the topic pair comprises:
acquiring original data of each question in the question pair;
determining detection data of each topic in the topic pair based on the original data of each topic in the topic pair;
wherein, the determining the detection data of each topic in the topic pair based on the original data of each topic in the topic pair comprises:
taking the original data of each topic in the topic pair as the detection data of each topic in the topic pair;
or the like, or a combination thereof,
and carrying out standardization processing on the original data of each topic in the topic pair, wherein the processed data is used as the detection data of each topic in the topic pair.
3. The method of claim 1, wherein determining a word similarity result for two topics in the topic pair based on the detected data for each topic in the topic pair comprises:
performing word segmentation processing on the topic data in the detection data of each topic in the topic pair to obtain a topic word segmentation result of each topic in the topic pair;
and obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation results of the two topics in the topic pair.
4. The method of claim 1, wherein determining semantic similarity results for two topics in the topic pair based on the detection data for each topic in the topic pair comprises:
performing semantic word segmentation on a combination of topic data and analytic data in the detection data of each topic in the topic pair to obtain a first semantic word segmentation result of each topic in the topic pair;
and obtaining a semantic similarity result of the two questions in the question pair based on the first semantic segmentation result of the two questions in the question pair.
5. The method of claim 3, wherein determining a distribution similarity result for answers to two topics in the topic pair based on the detected data for each topic in the topic pair comprises:
performing semantic word segmentation on student response data in the detection data of each question in the question pair to obtain a second semantic word segmentation result of each question in the question pair;
and obtaining answer distribution similarity results of the two questions in the question pair based on the question face segmentation results and the second semantic segmentation results of the two questions in the question pair.
6. The method of claim 5, wherein obtaining a similarity result of response distribution of two topics in the topic pair based on a topic segmentation result and a second semantic segmentation result of the two topics in the topic pair comprises:
obtaining a wrong answer distribution similarity result and a correct answer distribution similarity result of each question in the question pair based on the question face segmentation result and the second semantic segmentation result of each question in the question pair;
and taking the wrong answer distribution similarity result and the correct answer distribution similarity result of the two questions in the question pair as answer distribution similarity results of the two questions in the question pair.
7. The method of claim 1, wherein the determining whether two topics in the topic pair are duplicates based on the word similarity result, the semantic similarity result, and the answer distribution similarity result comprises:
inputting the word similarity result, the semantic similarity result and the answer distribution similarity result into a classification model, and determining whether two questions in the question pair are repeated questions;
the classification model is obtained by taking a word similarity result, a semantic similarity result and a response distribution similarity result of each training topic in the training topic pair as training samples and taking labeling information for identifying whether two topics in the training topic pair are repeated topics as sample labels for training.
8. An apparatus for detecting a problem, the apparatus comprising:
the acquisition unit is used for acquiring a question pair to be subjected to repeated question detection;
the detection data determining unit is used for determining detection data of each question in the question pairs, and the detection data comprises question surface data, analysis data and student answering data;
the similarity determining unit is used for determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;
a repeated question determining unit, configured to determine whether two questions in the question pair are repeated questions based on the word similarity result, the semantic similarity result, and the answer distribution similarity result;
the similarity determining unit is specifically configured to determine a result of similarity between words of two subjects in the subject pair according to the subject face data of each subject in the subject pair, determine a result of semantic similarity between two subjects in the subject pair according to the subject face data and analytic data of each subject in the subject pair, and determine a result of answer distribution similarity between two subjects in the subject pair according to the subject face data of each subject in the subject pair and answer data of students.
9. A problem detection system comprising a memory and a processor;
the memory is used for storing programs;
the processor, configured to execute the program, and implement the steps of the method according to any one of claims 1 to 7.
10. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for detecting the duplicate topic according to any one of the claims 1 to 7.
CN202010013765.4A 2020-01-07 2020-01-07 Method for detecting repeated questions, related device and readable storage medium Active CN111241239B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010013765.4A CN111241239B (en) 2020-01-07 2020-01-07 Method for detecting repeated questions, related device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010013765.4A CN111241239B (en) 2020-01-07 2020-01-07 Method for detecting repeated questions, related device and readable storage medium

Publications (2)

Publication Number Publication Date
CN111241239A CN111241239A (en) 2020-06-05
CN111241239B true CN111241239B (en) 2022-12-02

Family

ID=70875940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010013765.4A Active CN111241239B (en) 2020-01-07 2020-01-07 Method for detecting repeated questions, related device and readable storage medium

Country Status (1)

Country Link
CN (1) CN111241239B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898343B (en) * 2020-08-03 2023-07-14 北京师范大学 Similar topic identification method and system based on phrase structure tree
CN113051886B (en) * 2021-03-25 2023-12-01 科大讯飞股份有限公司 Test question duplicate checking method, device, storage medium and equipment
CN116680422A (en) * 2023-07-31 2023-09-01 山东山大鸥玛软件股份有限公司 Multi-mode question bank resource duplicate checking method, system, device and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320772A (en) * 2015-11-02 2016-02-10 武汉大学 Associated paper query method for patent duplicate checking
CN105824798A (en) * 2016-03-03 2016-08-03 云南电网有限责任公司教育培训评价中心 Examination question de-duplicating method of examination question base based on examination question key word likeness
WO2016188283A1 (en) * 2015-05-26 2016-12-01 阿里巴巴集团控股有限公司 Repeated data identification method and device
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN106599317A (en) * 2016-12-30 2017-04-26 上海智臻智能网络科技股份有限公司 Test data processing method and device for question-answering system and terminal
CN106651696A (en) * 2016-11-16 2017-05-10 福建天泉教育科技有限公司 Approximate question push method and system
CN107977347A (en) * 2017-12-04 2018-05-01 海南云江科技有限公司 A kind of topic De-weight method and computing device
CN109948121A (en) * 2017-12-20 2019-06-28 北京京东尚科信息技术有限公司 Article similarity method for digging, system, equipment and storage medium
CN110134777A (en) * 2019-05-29 2019-08-16 三角兽(北京)科技有限公司 Problem De-weight method, device, electronic equipment and computer readable storage medium
CN110362681A (en) * 2019-06-19 2019-10-22 平安科技(深圳)有限公司 The recognition methods of question answering system replication problem, device and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016188283A1 (en) * 2015-05-26 2016-12-01 阿里巴巴集团控股有限公司 Repeated data identification method and device
CN105320772A (en) * 2015-11-02 2016-02-10 武汉大学 Associated paper query method for patent duplicate checking
CN105824798A (en) * 2016-03-03 2016-08-03 云南电网有限责任公司教育培训评价中心 Examination question de-duplicating method of examination question base based on examination question key word likeness
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN106651696A (en) * 2016-11-16 2017-05-10 福建天泉教育科技有限公司 Approximate question push method and system
CN106599317A (en) * 2016-12-30 2017-04-26 上海智臻智能网络科技股份有限公司 Test data processing method and device for question-answering system and terminal
CN107977347A (en) * 2017-12-04 2018-05-01 海南云江科技有限公司 A kind of topic De-weight method and computing device
CN109948121A (en) * 2017-12-20 2019-06-28 北京京东尚科信息技术有限公司 Article similarity method for digging, system, equipment and storage medium
CN110134777A (en) * 2019-05-29 2019-08-16 三角兽(北京)科技有限公司 Problem De-weight method, device, electronic equipment and computer readable storage medium
CN110362681A (en) * 2019-06-19 2019-10-22 平安科技(深圳)有限公司 The recognition methods of question answering system replication problem, device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于多示例学习的题库重复性检测研究";汤世平;《北京理工大学学报》;20051231;第1071-1074页 *
一种基于语义向量空间模型的作业查重算法;黄菊;《电子科学技术》;20161110(第06期);第118-121页 *

Also Published As

Publication number Publication date
CN111241239A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
Martinc et al. Supervised and unsupervised neural approaches to text readability
CN111241239B (en) Method for detecting repeated questions, related device and readable storage medium
Heiberger et al. Statistical analysis and data display an intermediate course with examples in R
US8370278B2 (en) Ontological categorization of question concepts from document summaries
US20160034757A1 (en) Generating an Academic Topic Graph from Digital Documents
Desa Evaluating measurement invariance of TALIS 2013 complex scales: Comparison between continuous and categorical multiple-group confirmatory factor analyses
WO2016085409A1 (en) A method and system for sentiment classification and emotion classification
US20160379515A1 (en) System and method for enhancing logical thinking in curation learning
CN111831831A (en) Knowledge graph-based personalized learning platform and construction method thereof
CN110287405B (en) Emotion analysis method, emotion analysis device and storage medium
CN111639485A (en) Course recommendation method based on text similarity and related equipment
CN110765241B (en) Super-outline detection method and device for recommendation questions, electronic equipment and storage medium
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN116561262A (en) Test question correcting method and related device
Chen et al. Latent space model for process data
Lee et al. Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis
CN112541069A (en) Text matching method, system, terminal and storage medium combined with keywords
CN112256826A (en) Emotion analysis method, evaluation method and emotion analysis model training method and device
CN116311322A (en) Document layout element detection method, device, storage medium and equipment
CN111753062A (en) Method, device, equipment and medium for determining session response scheme
CN112732868B (en) Answer analysis method for answers, electronic device and storage medium
CN110941709B (en) Information screening method and device, electronic equipment and readable storage medium
CN114117015A (en) Knowledge point tracing method, device, equipment and storage medium
CN113569112A (en) Tutoring strategy providing method, system, device and medium based on question
Ding Online and Offline Mixed Teaching Mode Based on Multimedia Computer‐Aided Music Lessons during the Epidemic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant