CN111241239B

CN111241239B - Method for detecting repeated questions, related device and readable storage medium

Info

Publication number: CN111241239B
Application number: CN202010013765.4A
Authority: CN
Inventors: 李旭浩; 沙晶; 付瑞吉; 王士进; 魏思
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2022-12-02
Anticipated expiration: 2040-01-07
Also published as: CN111241239A

Abstract

The application discloses a method for detecting repeated questions, related equipment and a readable storage medium, wherein after a question pair to be subjected to repeated question detection is obtained, a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair are determined based on multiple detection data of each question in the question pair, namely, question face data, analysis data and student response data; whether the two questions in the question pair are the duplicate questions or not is determined based on the determined word similarity result, the semantic similarity result and the answer distribution similarity result, whether the two questions in the question pair are the duplicate questions or not can be detected from multiple angles, and compared with the method for detecting whether the two questions in the question pair are the duplicate questions or not from a single angle, the accuracy for detecting whether the two questions in the question pair are the duplicate questions or not can be improved, and the purpose of avoiding recommending the duplicate questions for the user as much as possible is achieved.

Description

Method for detecting repeated questions, related device and readable storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method for detecting a question, a related device, and a readable storage medium.

Background

With the development of internet technology, online education is more and more frequently appearing in the visual field of people. A successful online education system cannot be used for well-separated education resource construction. Under the existing mode, the construction of the education resource library needs to invest a large amount of manpower, and the knowledge point coverage of the resource library is increased by continuously adding and processing new questions.

At present, an online education system recommends topics in a resource library for a user based on an individualized recommendation algorithm, however, with the expansion of the resource library, the number of topics will be in the millions or even tens of millions, and due to the difference of time, place, processor and the like of processing each topic, repeated topics inevitably appear in the resource library, which directly results in the recommendation of repeated topics for the user, and the user experience effect is poor.

Therefore, it is desirable to provide a method for detecting duplicate topics to avoid recommending duplicate topics to the user as much as possible.

Disclosure of Invention

In view of the foregoing, the present application provides a method for detecting a problem, a related device, and a readable storage medium. The specific scheme is as follows:

a method of problem detection, comprising:

obtaining a question pair to be subjected to repeated question detection;

determining detection data of each question in the question pair, wherein the detection data comprises question surface data, analysis data and student answering data;

determining a word similarity result, a semantic similarity result and an answer distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;

and determining whether the two questions in the question pair are repeated questions or not based on the word similarity result, the semantic similarity result and the answer distribution similarity result.

Optionally, the determining detection data of each topic in the topic pair includes:

acquiring original data of each question in the question pair;

determining detection data of each topic in the topic pair based on the original data of each topic in the topic pair;

wherein, the determining the detection data of each topic in the topic pair based on the original data of each topic in the topic pair comprises:

taking the original data of each topic in the topic pair as the detection data of each topic in the topic pair;

or the like, or, alternatively,

and carrying out standardization processing on the original data of each topic in the topic pair, wherein the processed data is used as the detection data of each topic in the topic pair.

Optionally, determining a result of word similarity between two topics in the topic pair according to the detection data of each topic in the topic pair, including:

performing word segmentation processing on the topic data in the detection data of each topic in the topic pair to obtain a topic word segmentation result of each topic in the topic pair;

and obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation result of the two topics in the topic pair.

Optionally, determining a semantic similarity result of two topics in the topic pair according to the detection data of each topic in the topic pair, including:

performing semantic word segmentation on a combination of topic data and analytic data in the detection data of each topic in the topic pair to obtain a first semantic word segmentation result of each topic in the topic pair;

and obtaining a semantic similarity result of the two questions in the question pair based on the first semantic word segmentation result of the two questions in the question pair.

Optionally, determining a result of similarity of response distribution of two topics in the topic pair according to the detection data of each topic in the topic pair, including:

performing semantic word segmentation on student response data in the detection data of each question in the question pair to obtain a second semantic word segmentation result of each question in the question pair;

and obtaining answer distribution similarity results of the two questions in the question pair based on the question face segmentation results and the second semantic segmentation results of the two questions in the question pair.

Optionally, the obtaining a result of similarity of response distribution of the two topics in the topic pair based on the topic segmentation result and the second semantic segmentation result of the two topics in the topic pair includes:

obtaining a wrong answer distribution similarity result and a correct answer distribution similarity result of each question in the question pair based on the question face segmentation result and the second semantic segmentation result of each question in the question pair;

and taking the wrong answer distribution similarity result and the correct answer distribution similarity result of the two questions in the question pair as answer distribution similarity results of the two questions in the question pair.

Optionally, the determining whether two topics in the topic pair are double topics based on the word similarity result, the semantic similarity result, and the answer distribution similarity result includes:

inputting the word similarity result, the semantic similarity result and the answer distribution similarity result into a classification model, and determining whether two questions in the question pair are repeated questions;

the classification model is obtained by taking a word similarity result, a semantic similarity result and a response distribution similarity result of each training topic in the training topic pair as training samples and taking labeling information for identifying whether two topics in the training topic pair are repeated topics as sample labels for training.

An apparatus for detecting a problem, the apparatus comprising:

the acquisition unit is used for acquiring a question pair to be subjected to repeated question detection;

the detection data determining unit is used for determining detection data of each question in the question pair, and the detection data comprises question surface data, analysis data and student response data;

the similarity determining unit is used for determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;

and the repeated question determining unit is used for determining whether the two questions in the question pair are repeated questions or not based on the word similarity result, the semantic similarity result and the answer distribution similarity result.

Optionally, the obtaining unit includes:

the original data acquisition unit is used for acquiring the original data of each topic in the topic pair;

the detection data determining unit is used for determining the detection data of each topic in the topic pair based on the original data of each topic in the topic pair;

wherein the detection data determining unit includes:

a first detection data determining unit, configured to use original data of each topic in the topic pair as detection data of each topic in the topic pair;

or the like, or, alternatively,

and the second detection data determining unit is used for carrying out standardization processing on the original data of each topic in the topic pair, and the processed data is used as the detection data of each topic in the topic pair.

Optionally, the similarity determining unit includes:

the topic segmentation unit is used for performing segmentation processing on topic data in the detection data of each topic in the topic pair to obtain a topic segmentation result of each topic in the topic pair;

and the word similarity result determining unit is used for obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation result of the two topics in the topic pair.

Optionally, the similarity determining unit includes:

the first semantic word segmentation unit is used for performing semantic word segmentation on a combination of the topic data and the analysis data in the detection data of each topic in the topic pair to obtain a first semantic word segmentation result of each topic in the topic pair;

and the semantic similarity result determining unit is used for obtaining a semantic similarity result of the two topics in the topic pair based on the first semantic word segmentation result of the two topics in the topic pair.

Optionally, the similarity determining unit includes:

the second semantic word segmentation unit is used for performing semantic word segmentation on student response data in the detection data of each question in the question pair to obtain a second semantic word segmentation result of each question in the question pair;

and the answer distribution similarity determining unit is used for obtaining answer distribution similarity results of the two questions in the question pair based on the question surface word segmentation results and the second semantic word segmentation results of the two questions in the question pair.

Optionally, the answer distribution similarity determining unit is specifically configured to:

Optionally, the question-of-importance determining unit is specifically configured to:

A problem detection system comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the method for detecting the repeated problems.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of problem detection as described above.

By means of the technical scheme, the application discloses a method for detecting the repeated questions, related equipment and a readable storage medium, after a question pair to be subjected to repeated question detection is obtained, a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair are determined based on multiple detection data of each question in the question pair, namely, question face data, analysis data and student response data; and determining whether the two subjects in the subject pair are the duplicate subjects or not based on the determined word similarity result, the semantic similarity result and the answer distribution similarity result, so that whether the two subjects in the subject pair are the duplicate subjects or not can be detected from multiple angles, and compared with the method for detecting whether the two subjects in the subject pair are the duplicate subjects or not from a single angle, the accuracy for detecting whether the two subjects in the subject pair are the duplicate subjects or not can be improved, and the purpose of recommending the duplicate subjects to the user to the greatest extent can be achieved.

Drawings

Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a method for detecting a problem of interest according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a problem detection model according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a device for detecting a duplicate problem disclosed in an embodiment of the present application;

fig. 4 is a block diagram of a hardware structure of a system for detecting a problem of repetition disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Next, the problem detection method provided by the present application is described by the following examples.

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for detecting a repeated topic disclosed in an embodiment of the present application, where the method includes:

s101: and obtaining a question pair to be subjected to repeated question detection.

In the present application, the question pair to be subjected to repeated question detection may be any question pair in an education resource library of an online education system, and in terms of question types, may be question pairs of various question types, such as blank filling questions, selection questions, answer questions and the like, and in terms of subjects belonging to the subjects, such as subjects of mathematics, physics, chemistry and the like. The number of the subject pairs to be subjected to the repeated subject detection can be one or more.

S102: and determining detection data of each topic in the topic pair.

In the prior art, when detecting the repeated topics, only the topic data of each topic in the topic pair to be detected is directly determined as the detection data, and the repeated topics which can be obviously judged from the topic are detected according to the topic data of each topic in the topic pair, that is, the repeated topics can be judged by people seeing at first. Intuitively, the visual impact of the subjects with the same subjects or most of the subjects with the same subjects on people is large, and the first impression given to people is a big probability of being a heavy subject. In fact, from the perspective of one-line feedback data, this type of data also has a large weight and is extremely easy to complain by users.

However, topics generally include three parts of topic surfaces, parsing and student response, in some cases, although topic surfaces of two topics are not similar, the knowledge points and response points that two topics may examine have great similarity, and in this case, two topics should be determined as important topics, but only based on the topic data, results that two topics are not important topics are obtained.

Therefore, in the present application, the detection data of each topic in the topic pair includes topic surface data, analysis data, and student response data of the topic. Wherein the student response data may include a plurality of different response data.

S103: and determining a word similarity result, a semantic similarity result and a response distribution similarity result of the two questions in the question pair according to the detection data of each question in the question pair.

In the application, after the topic data, the analytic data and the student response data of each topic in the topic pair are determined, the word similarity result, the semantic similarity result and the response distribution similarity result of two topics in the topic pair can be determined according to the topic data, the analytic data and the student response data of each topic in the topic pair.

In the method, the topic data, the analytic data and the student response data of each topic in the topic pair can be divided into different groups according to the requirements of three different levels, namely the word level, the semantic level and the response content level, and the similarity result of the corresponding level is obtained based on the detection data in the different groups.

In the present application, the topic data of each topic in the topic pair may be divided into one group, the topic data and the analytic data of each topic in the topic pair may be divided into one group, and the topic data of each topic in the topic pair and the student response data may be divided into one group. Based on the division mode, in the application, the method can be used for determining the word similarity results of two questions in the question pair according to the question face data of each question in the question pair and the analytic data, determining the semantic similarity results of the two questions in the question pair according to the question face data and the analytic data of each question, and determining the answer distribution similarity results of the two questions in the question pair according to the question face data and the student answer data of each question in the question pair. Specific implementations of determining the respective similarity results will be described in detail by the following examples.

S104: and determining whether the two questions in the question pair are repeated questions or not based on the word similarity result, the semantic similarity result and the answer distribution similarity result.

In this application, confirm according to the topic data, analytic data and the student of every topic in the topic pair after the word similarity result of two topics in the topic pair, semantic similarity result and the distribution similarity result of answering, can confirm whether two topics in the topic pair are the answer based on word similarity result, semantic similarity result and the distribution similarity result of answering.

It should be noted that, based on the word similarity result, the semantic similarity result, and the answer distribution similarity result, there may be various specific implementation manners for determining whether two topics in the topic pair are the answers, as an implementable manner, the weights occupied by the word similarity result, the semantic similarity result, and the answer distribution similarity result when determining whether two topics in the topic pair are the answers may be determined, and based on the word similarity result, the weight thereof, the semantic similarity result, the weight thereof, the answer distribution similarity result, and the weight thereof, whether two topics in the topic pair are the answers is determined. As another possible implementation manner, different neural network models can be used to determine whether two topics in a topic pair are duplicates based on the word similarity result, the semantic similarity result, and the answer distribution similarity result, and a specific implementation manner will be described in detail by the following embodiments.

The embodiment discloses a method for detecting repeated questions, which comprises the steps of obtaining a question pair to be subjected to repeated question detection, and determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair based on multiple detection data of each question in the question pair, namely, question data, analysis data and student response data; whether the two questions in the question pair are the duplicate questions or not is determined based on the determined word similarity result, the semantic similarity result and the answer distribution similarity result, whether the two questions in the question pair are the duplicate questions or not can be detected from multiple angles, and compared with the method for detecting whether the two questions in the question pair are the duplicate questions or not from a single angle, the accuracy for detecting whether the two questions in the question pair are the duplicate questions or not can be improved, and the purpose of avoiding recommending the duplicate questions for the user as much as possible is achieved.

In this application, a specific implementation manner of determining detection data of each topic in the topic pair is also disclosed, and the method may include the following steps:

s201: and acquiring the original data of each topic in the topic pair.

In the present application, the original data of each topic in the topic pair includes original topic data, original analytic data, and original student response data of each topic in the topic pair.

S202: and determining the detection data of each topic in the topic pair based on the original data of each topic in the topic pair.

As an implementation manner, the original data of each topic in the topic pair can be used as the detection data of each topic in the topic pair in the present application.

However, the original data of each topic in the topic pair often has various problems, for example, the formula representation is not unique, picture data which cannot be identified exists, and the like, and these problems will directly affect the accuracy of subsequent repeated topic detection.

In order to solve the above problem, as another possible embodiment, the original data of each topic in the topic pair may be normalized, and the processed data may be used as the detection data of each topic in the topic pair. The standardization processing comprises picture identification processing, formula regularization processing and the like.

For the convenience of understanding, the mathematical topic is taken as an example in the application, and a specific implementation manner of performing the normalization processing on the raw data of the mathematical topic is given as follows:

firstly, judging whether picture formula data exist or not according to html data of each module in original data of a mathematical problem, if so, adopting an OCR (optical character recognition) system to recognize and extract formulas in the pictures, and replacing corresponding pictures to form a LaTeX formula so as to ensure the integrity of the formula data content.

Then, the formula data in the original data of the mathematical problem is normalized, all symbols in the formula data are converted into LaTeX symbol representation, normalization of synonymous LaTeX representation, simplified LaTeX representation, and all special effect representations such as bold body representation, color representation, arrow representation and the like are removed.

Based on the mode, the standardized data of the mathematic questions can be obtained by processing the original data of the mathematic questions, the standardized analysis data of the mathematic questions can be obtained by processing the original analysis data of the mathematic questions, and the standardized data of the mathematic questions can be obtained by processing the answer data of the original students of the mathematic questions.

It should be noted that, in the present application, the detection data of each topic in the topic pair may be input into the similarity detection model to obtain the word similarity result, the semantic similarity result, and the answer distribution similarity result of the two topics in the topic pair. The similarity detection model comprises a word segmentation system, a semantic word segmentation system, a word level detection system, a semantic level detection system and a student answer detection system.

Based on the similarity detection model, the implementation manner of determining the word similarity result, the semantic similarity result and the answer distribution similarity result of two topics in the topic pair according to the detection data of each topic in the topic pair is disclosed in the application, and the detailed description is specifically provided through the following contents.

In this application, an implementation manner for determining a result of similarity between words of two topics in a topic pair according to detection data of each topic in the topic pair is disclosed, and the implementation manner may include:

s301: and performing word segmentation processing on the topic data in the detection data of each topic in the topic pair to obtain a topic word segmentation result of each topic in the topic pair.

In the application, the topic data in the detection data of each topic in the topic pair can be input into the word segmentation system, so that a topic word segmentation result of each topic in the topic pair is obtained.

It should be noted that the word segmentation system can be an existing chinese word segmentation system such as jieba, snowNLP, stanfordCoreNLP, english word segmentation system such as NLTK, spaCy, and StanfordCoreNLP. However, for the subjects containing Chinese, english letters, english symbols, special mathematical symbols and the like, because the types of characters contained in the subjects are more, the existing Chinese word segmentation system or English word segmentation system cannot be directly applied to the subjects.

S302: and obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation results of the two topics in the topic pair.

In the method, the topic segmentation results of the two topics in the topic pair can be input into the word level detection system, and the word similarity result of the two topics in the topic pair is obtained.

The word level detection system can construct vectors corresponding to the topic segmentation results of the two topics in the topic pair, and calculate the similarity of the vectors corresponding to the topic segmentation results of the two topics as the similarity result of the words of the two topics in the topic pair. Various ways of calculating the similarity of the vectors corresponding to the topic word segmentation results of the two topics can be provided, for example, the cosine similarity represented by the two vectors can be calculated, the distribution and the proportion of n-grams of the two vectors can be counted, and the similarity of the two vectors can be obtained by comparing the difference of the n-grams.

In this application, a specific implementation manner for determining a semantic similarity result of two topics in a topic pair according to detection data of each topic in the topic pair is also disclosed, and the implementation manner may include the following steps:

s401: and performing semantic word segmentation on the combination of the topic data and the analysis data in the detection data of each topic in the topic pair to obtain a first semantic word segmentation result of each topic in the topic pair.

Because the analysis of the repeated questions has extremely similar formats and methods, the analysis is various and difficult to compare, and the analysis cannot be used independently. Therefore, in the application, the topic data and the analytic data in the detection data of each topic in the topic pair can be input into the semantic segmentation system to obtain the first semantic segmentation result of each topic in the topic pair.

It should be noted that the semantic word segmentation system is an improvement of the word segmentation system described in S301, and compared to the word segmentation system, the semantic word segmentation system has a function of performing abstraction and normalization processing on a word segmentation result of the word segmentation system, and can eliminate ambiguity on word segmentation to a certain extent. For example, delta and triangle in the word segmentation system are both expressed as triangles in the semantic word segmentation system.

S402: and obtaining a semantic similarity result of the two questions in the question pair based on the first semantic segmentation result of the two questions in the question pair.

In the method, the first semantic word segmentation result of the two topics in the topic pair can be input into the semantic level detection system to obtain the semantic similarity result of the two topics in the topic pair.

In the application, the semantic level detection system can construct the coding information corresponding to the first semantic segmentation result of the two topics in the topic pair, and map the coding information corresponding to the first semantic segmentation result of the two topics to obtain the semantic similarity result of the two topics in the topic pair.

The method for constructing the vector corresponding to the first semantic segmentation result of the two topics in the topic pair can be as follows: firstly, obtaining initial low-dimensional vectors corresponding to first semantic word segmentation results of two questions in a question pair based on modes such as word2vec, glove, fasttext and the like, and then, obtaining coding information corresponding to the first semantic word segmentation results by passing the initial low-dimensional vectors through a plurality of layers of LSTMs, wherein the coding information can represent deep semantics and syntax of the first semantic word segmentation results.

In this application, a specific implementation manner for determining a result of similarity of response distribution of two topics in a topic pair according to detection data of each topic in the topic pair is also disclosed, and the method includes the following steps:

s501: and performing semantic word segmentation on the student response data in the detection data of each question in the question pair to obtain a second semantic word segmentation result of each question in the question pair.

In the application, the student response data in the detection data of each topic in the topic pair can be input into the semantic word segmentation system to obtain a second semantic word segmentation result of each topic in the topic pair. The semantic word segmentation system can refer to the related descriptions in S301 and S401, and the description is omitted here.

S502: and obtaining answer distribution similarity results of the two questions in the question pair based on the question face segmentation results and the second semantic segmentation results of the two questions in the question pair.

For repeated topics, the answer of one topic can be regarded as the answer of another topic without considering the specific application background, or, in a plurality of answer results of two topics, the distribution of the wrong type or wrong logic in the wrong answer, and if there are a plurality of correct answers, the distribution of the correct answer type or the correct answer logic also has similarity.

As an implementable mode, the topic segmentation result and the second semantic segmentation result of the two topics in the topic pair can be input into the student answering detection system, and the answering distribution similarity result of the two topics in the topic pair is obtained.

Since students have diversity in answering and the answering mode is greatly influenced by individuals, and noise may cover up important information during encoding, it is not preferable to directly compare the answers of students who answer two questions in question pairs to determine the distribution similarity of the answers of the two questions in question pairs. Moreover, the student answering each topic in the topic pair can include correct answers and incorrect answers. There is a similar distribution of correct responses and incorrect responses.

Therefore, in the application, the student answering detection system can obtain the wrong answering distribution similarity result and the correct answering distribution similarity result of each topic in the topic pair based on the topic surface word segmentation result and the second semantic word segmentation result of each topic in the topic pair; and taking the wrong answer distribution similarity result and the correct answer distribution similarity result of the two questions in the question pair as answer distribution similarity results of the two questions in the question pair.

For easy understanding, assume that the subject pair contains a subject A and a subject B, and the correct student answering data of the subject A is A ₁ ～A _k The wrong student response data is A _k+1 ～A _n The correct student response data of the subject B is B ₁ ～B _j The wrong student response data is B _j+1 ～B _m 。

Answer data A for each correct student of topic A ₁ ～A _k By means of a multi-layer LSTM _right Obtaining the vector code of each correct student response data, carrying out attention calculation on each word segmentation code in the topic word segmentation result of the topic A to obtain the attention calculation result of the correct student response data, and carrying out normalization processing on the attention calculation result of the correct student response data to obtain the distribution of the attention calculation result of the correct student response data; averaging the distribution of the attention calculation results of the correct student response data to obtain the information distribution dr of the question A corresponding to the correct student response data of the question A _A ；

Answering data A for each wrong student for topic A _k+1 ～A _n By means of a multi-layer LSTM _wrong Obtaining the vector code of each wrong student response data, carrying out attention calculation with each participle code in the topic participle result of the topic A to obtain the attention calculation result of the wrong student response data, and carrying out normalization processing on the attention calculation result of the wrong student response data to obtain the distribution of the attention calculation result of the wrong student response data; averaging the distribution of the attention calculation results of the wrong student response data to obtain the information distribution dw of the question A corresponding to the wrong student response data of the question A _A ；

Data B of every correct student for subject B ₁ ～B _j By means of a multi-layer LSTM _right Obtaining the vector code of the correct student response data, and performing attention calculation with each participle code in the topic participle result of the topic A to obtain the correct student responseThe attention calculation results of the data are normalized, and the distribution of the attention calculation results of the correct student response data is obtained; averaging the distribution of the attention calculation results of the correct student response data to obtain the information distribution dr of the question A corresponding to the correct student response data of the question B _B ；

Data B of every wrong student for topic B _j+1 ～B _m By means of a multi-layer LSTM _wrong Obtaining the vector code of each wrong student response data, carrying out attention calculation on each word segmentation code in the topic word segmentation result of the topic A to obtain the attention calculation result of the wrong student response data, and carrying out normalization processing on the attention calculation result of the wrong student response data to obtain the distribution of the attention calculation result of the wrong student response data; averaging the distribution of the attention calculation results of the wrong student response data to obtain the information distribution dw of the question A corresponding to the wrong student response data of the question B _B ；

Calculating dr _A And dr _B The Kullback-Leibler distance obtains the result of the similarity of the correct answer distribution of the question A, and dw is calculated _A And dw _B The Kullback-Leibler distance of (A) gives the similarity results of the distribution of the false answers of topic A.

Similarly, the distribution similarity result of correct answers at the topic B and the distribution similarity result of wrong answers at the topic B can be obtained.

It should be noted that, for repeated titles, the effective information of the title is the same. Because the question information used for correct answer is similar, the question information distribution corresponding to the correct answer can be obtained by using the interaction of the correct answer and the question, and whether the two questions are repeated or not is judged by comparing the question information distribution difference obtained by the interaction of the correct answers of the two questions and the same question. Similarly, a similar determination can be made using the error response.

In the application, a specific implementation manner for determining whether two topics in a topic pair are repeated topics based on a word similarity result, a semantic similarity result and a response distribution similarity result is also disclosed, and the manner can be as follows:

and inputting the word similarity result, the semantic similarity result and the answer distribution similarity result into a classification model, and determining whether two questions in the question pair are repeated questions or not.

It should be noted that the classification model is obtained by training, with the word similarity result, the semantic similarity result, and the answer distribution similarity result of each training question in the training question pair as training samples, and with the labeling information for identifying whether two questions in the training question pair are duplicates as sample labels.

The classification model may be a multi-tier perceptron. The output of the classification model can be the probability that two questions in the question pair are the repeated questions, when whether the two questions in the question pair are the repeated questions is determined based on the classification model, whether the probability output by the classification model is larger than a preset threshold value or not can be judged, if the probability output by the classification model is larger than the preset threshold value, the two questions in the question pair are determined to be the repeated questions, and if the probability output by the classification model is smaller than or equal to the preset threshold value, the two questions in the question pair are determined not to be the repeated questions.

It should be further noted that, as shown in fig. 2, in the present application, a similarity detection model may be generated by combining a segmentation system, a semantic segmentation system, a word level detection system, a semantic level detection system, and a student response detection system, and the similarity detection model and the classification model may be combined into a repeated topic detection model, when detecting repeated topics based on the repeated topic detection model, the segmentation system, the semantic segmentation system, the word level detection system, the semantic level detection system, and the student response detection system in the repeated topic detection model may perform a step of determining a word similarity result, a semantic similarity result, and a response distribution similarity result of two topics in the topic pair according to detection data of each topic pair, and the classification model may perform a step of determining whether two topics in the pair are repeated topics based on the word similarity result, the semantic similarity result, and the response distribution similarity result.

The following describes the apparatus for detecting a problem, which is disclosed in the embodiments of the present application, and the apparatus for detecting a problem and the method for detecting a problem described above can be referred to in correspondence.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a problem detection apparatus disclosed in the embodiment of the present application. As shown in fig. 3, the apparatus for detecting a problem of repetition may include:

an obtaining unit 21, configured to obtain a question pair to be subjected to question duplication detection;

a detection data determining unit 22, configured to determine detection data of each question in the question pair, where the detection data includes question surface data, analysis data, and student response data;

a similarity determining unit 23, configured to determine a word similarity result, a semantic similarity result, and a response distribution similarity result of two topics in the topic pair according to the detection data of each topic in the topic pair;

and the repeated question determining unit 24 is configured to determine whether two questions in the question pair are repeated questions based on the word similarity result, the semantic similarity result, and the answer distribution similarity result.

Optionally, the obtaining unit includes:

wherein the detection data determining unit includes:

or the like, or, alternatively,

Optionally, the similarity determining unit includes:

and the semantic similarity result determining unit is used for obtaining a semantic similarity result of the two questions in the question pair based on a first semantic segmentation result of the two questions in the question pair.

Optionally, the similarity determining unit includes:

Optionally, the topic-of-importance determination unit is specifically configured to:

Fig. 4 is a block diagram of a hardware structure of a system for detecting a duplicate problem disclosed in an embodiment of the present application, and referring to fig. 4, the hardware structure of the system for detecting a duplicate problem may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits or the like configured to implement an embodiment of the present invention;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

obtaining a question pair to be subjected to repeated question detection;

determining a word similarity result, a semantic similarity result and a response distribution similarity result of two questions in the question pair according to the detection data of each question in the question pair;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

obtaining a question pair to be subjected to repeated question detection;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for detecting a problem, comprising:

obtaining a question pair to be subjected to repeated question detection;

determining whether two questions in the question pair are repeated questions based on the word similarity result, the semantic similarity result and the answer distribution similarity result;

determining a word similarity result, a semantic similarity result and a response distribution similarity result of two topics in the topic pair according to the detection data of each topic in the topic pair, wherein the determining comprises the following steps:

determining a similarity result of the words of the two questions in the question pair according to the question face data of each question in the question pair, determining a semantic similarity result of the two questions in the question pair according to the question face data and the analytic data of each question in the question pair, and determining an answer distribution similarity result of the two questions in the question pair according to the question face data of each question in the question pair and the answer data of students.

2. The method of claim 1, wherein the determining detection data for each topic in the topic pair comprises:

acquiring original data of each question in the question pair;

or the like, or a combination thereof,

3. The method of claim 1, wherein determining a word similarity result for two topics in the topic pair based on the detected data for each topic in the topic pair comprises:

and obtaining a word similarity result of the two topics in the topic pair based on the topic segmentation results of the two topics in the topic pair.

4. The method of claim 1, wherein determining semantic similarity results for two topics in the topic pair based on the detection data for each topic in the topic pair comprises:

and obtaining a semantic similarity result of the two questions in the question pair based on the first semantic segmentation result of the two questions in the question pair.

5. The method of claim 3, wherein determining a distribution similarity result for answers to two topics in the topic pair based on the detected data for each topic in the topic pair comprises:

6. The method of claim 5, wherein obtaining a similarity result of response distribution of two topics in the topic pair based on a topic segmentation result and a second semantic segmentation result of the two topics in the topic pair comprises:

7. The method of claim 1, wherein the determining whether two topics in the topic pair are duplicates based on the word similarity result, the semantic similarity result, and the answer distribution similarity result comprises:

8. An apparatus for detecting a problem, the apparatus comprising:

the detection data determining unit is used for determining detection data of each question in the question pairs, and the detection data comprises question surface data, analysis data and student answering data;

a repeated question determining unit, configured to determine whether two questions in the question pair are repeated questions based on the word similarity result, the semantic similarity result, and the answer distribution similarity result;

the similarity determining unit is specifically configured to determine a result of similarity between words of two subjects in the subject pair according to the subject face data of each subject in the subject pair, determine a result of semantic similarity between two subjects in the subject pair according to the subject face data and analytic data of each subject in the subject pair, and determine a result of answer distribution similarity between two subjects in the subject pair according to the subject face data of each subject in the subject pair and answer data of students.

9. A problem detection system comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, and implement the steps of the method according to any one of claims 1 to 7.

10. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for detecting the duplicate topic according to any one of the claims 1 to 7.