CN113392196B

CN113392196B - Question retrieval method and system based on multi-mode cross comparison

Info

Publication number: CN113392196B
Application number: CN202110622823.8A
Authority: CN
Inventors: 余胜泉; 陈鹏鹤; 刘杰飞; 徐琪; 陈玲; 卢宇
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2023-04-21
Anticipated expiration: 2041-06-04
Also published as: CN113392196A

Abstract

The invention provides a topic retrieval system and a method based on multi-mode cross comparison, wherein the system comprises: the system comprises a topic data analysis module, a topic similarity calculation module and a result output module; the system comprises a question data analysis module, a data processing module and a data processing module, wherein the question data analysis module is used for receiving question information input by a user, and preprocessing and structuring arrangement are carried out; the topic similarity calculation module is used for cross-calculating the similarity between the topic input by the user and the text representation and the picture representation of the topic in the topic library, and weighting and calculating the comprehensive similarity; and the result output module is used for returning the related information such as the questions, the answers and the like in the question bank with the comprehensive similarity larger than the preset subject threshold value to the user. The system of the invention can lead the retrieval result of each subject in the subject library to be more accurate.

Description

Question retrieval method and system based on multi-mode cross comparison

Technical Field

The invention relates to the technical field of computers, in particular to a topic retrieval method and system based on multi-mode cross comparison.

Background

In recent years, with the development of the Internet and artificial intelligence technology, a question and answer system is greatly developed, and great help is provided for personalized teaching. It is becoming increasingly important how to quickly and accurately retrieve questions from a question bank that are the same as or similar to a user-entered question and then give an answer to the question.

Currently, the implementation mode of the topic retrieval system is generally realized by comparing the text similarity among topics, a user transmits the text for describing topic information to the topic retrieval system, and the topic retrieval system selects the topic with the largest similarity as a retrieval result and returns the retrieval result to the user by comparing the text similarity between the topic text input by the user and the topic text of the topic database. If the topic information input by the user is in the form of pictures, the similarity of the topics is compared by comparing the similarity of the pictures.

At present, text similarity calculation methods are mainly divided into two types. The text similarity calculation method based on characters and the text similarity calculation method based on vector space respectively.

Text similarity calculation methods based on characters, such as traditional editing distance, hamming distance, jaccard, LCS and the like, evaluate the text similarity by directly comparing the same characters and sequence relations between two text strings.

Text similarity calculation methods based on vector space, such as TF-IDF and BM25, calculate the text through cosine similarity after vector representation and directly compare the similarity between the texts through neural network.

With the development of multimedia, users describe the title information by text and combined with pictures, and other than by plain text, more cases currently describe the title information.

Currently, mainstream topic retrieval services and systems on the market only support a retrieval mode for text topics or picture topics, such as "search topic" (the address is https:// www.xuesai.cn/souti /) and the like, as shown in fig. 1 and 2.

The prior art has mainly the following problems, which are described and illustrated below.

(1) Merely entering text content, users often experience conditions of unknowing how to express, or of unclear expression, for topics in the form of math, physics, or text-to-text, such as speaking with reference to the drawing.

(2) Only the picture content is input, but although a correct answer can be returned, the answer may not be able to meet the needs of the user, such as shown in fig. 2, the user may be unfamiliar with the triangle rule of vector addition in the answer, so that even if the answer is seen, it may not be clear how to do the questions. Thus, if the user can be allowed to input the requirements, the user can be assisted better.

(3) There are also question-answering systems for users to input a question text portion and a question picture portion at the same time, but the question picture portion is recognized as a text representation by a picture text recognition technology, and then the text similarity between questions is compared after the question text portion and the recognized picture text content are directly spliced. Due to the defect of the picture text recognition technology, the situation of wrong recognition of the content of the topics, such as that the same topic is different due to different light rays, angles and the like, and the finally recognized result is different, so that the situation can cause that the original same topic is judged to be dissimilar due to the wrong picture text recognition.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a topic retrieval method and a topic retrieval system based on multi-mode cross comparison, which are characterized in that (1) a topic data analysis module is used for carrying out structural arrangement on input topics of a user, (2) a topic similarity calculation module is used for calculating the similarity between the input topics of the user and candidate topics one by one, and the result information is returned to the user.

The invention is realized by the following technical scheme.

According to an aspect of the present invention, there is provided a topic retrieval system based on multi-modal cross-comparison, comprising: the system comprises a topic data analysis module, a topic similarity calculation module and a structure output module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the topic data analysis module is used for receiving topic information input by a user and preprocessing the topic information;

the topic similarity calculation module is used for calculating the similarity between the topic input by the user and the topic in the topic library;

and the result output module is used for returning the questions in the question bank with the similarity larger than a preset subject threshold value to the user.

Further, in the topic similarity calculation module, it includes:

a. the text after the title is cleaned and the text after the content is cleaned are spliced to be used as text representation, and the text identification content of the title picture is used as picture representation;

b. the text representation and the picture representation of the subject input by the user are represented by T1 and P1, the text representation and the picture representation of the subject in the subject library are represented by T2 and P2, the similarity between T1 and T2, between T1 and P2, between P1 and T2, between P1 and P2 is calculated, and the similarity is represented by S1, S2, S3 and S4 respectively;

c. and calculating the comprehensive similarity s.

Further, the similarity S1 is calculated by adopting a Jaccard method; calculating similarity S2, S3 and S4 through cosine similarity; preferably, the topic picture text recognition is converted into a vector representation by the BERT model, the topic picture is converted into a vector by the LeNet convolutional network model, and then the two vectors are spliced to form a vectorized representation of the picture representation.

Further, the calculation formula of the comprehensive similarity s is as follows:

w is subject weight.

Further, the subject weights are:

subject of science	Weights (w 1, w2, w3, w 4)	Subject of science	Weights (w 1, w2, w3, w 4)
				Chinese language	5.5，2，2，0.5	Physical properties	5，1，1，3
History of	5，2，2，1	Mathematics	4，2，2，2
				Geography	5，2，2，1	Chemical chemistry	5，1，1，3
Politics	4，2，2，2	Biological material	4，2，2，2

Further, in the result output module, if the comprehensive similarity is greater than a subject threshold corresponding to the subject input by the user, the problem in the subject library is used as the candidate problem.

Further, the subject threshold is:

subject of science	Threshold value	Subject of science	Threshold value
				Chinese language	0.8	Physical properties	0.7
History of	0.8	Mathematics	0.5
				Geography	0.7	Chemical chemistry	0.5
Politics	0.6	Biological material	0.5

According to another aspect of the present invention, there is provided a topic retrieval method based on multi-modal cross comparison, including:

step 1, receiving question information input by a user;

step 2, calculating the similarity between the received questions and the questions in the question bank;

and step 3, returning the questions in the question bank with the similarity larger than the threshold value of the questions as candidate questions to the user.

Further, the step 2 includes:

a. acquiring text representation and picture representation of the topic information;

b. the text representation and the picture representation of the questions are represented by T1 and P1, the text representation and the picture representation of the questions in the question bank are represented by T2 and P2, and then the similarity S1, S2, S3 and S4 of the four parts of contents are compared in a crossing manner; s1 is calculated by adopting a Jaccard method; a cosine similarity calculation method is adopted in the calculation of S2, S3 and S4;

c. and calculating the comprehensive similarity s.

Further, in the step 3, the questions in the question bank with the integrated similarity larger than the corresponding question threshold value of the user input questions are used as candidate questions.

Compared with the prior art, the invention has the beneficial effects that:

(1) Compared with the similarity comparison of the texts and the similarity of the pictures of the questions directly through the similarity comparison of the texts and the similarity comparison of the pictures of the questions, the accuracy of the comparison of the questions is greatly improved by adopting a multi-mode-based cross comparison method for the retrieval of the questions.

(2) When the method is used for calculating the comprehensive similarity of the questions of the four parts of similarity of the cross comparison, different weight parameter combinations are set based on experience according to different purposes of the questions, and the accuracy of the comparison of the questions is improved for calculating the questions of different purposes of the cross comparison.

(3) The invention adopts the comprehensive similarity result of the question comparison to compare with the preset threshold corresponding to different questions for judging the question similarity, thereby improving the accuracy of the question comparison.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an interface schematic diagram of a prior art question-answering system;

FIG. 2 is another interface schematic of a prior art question-answering system;

FIG. 3 is a schematic diagram of a topic retrieval system in accordance with one embodiment of the present invention;

FIG. 4 is a flow diagram of topic data parsing in accordance with one embodiment of the present invention;

FIG. 5 is a schematic diagram of a cross-comparison in accordance with one embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be examined and fully described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a question retrieval system based on multi-mode cross comparison, which can be applied to a question and answer system, receives questions input by a user, retrieves in a question bank, and returns the same and similar questions or answers to the user so as to help the user to complete a learning task. As shown in fig. 3, the system includes the following modules:

(1) The title data analysis module: and receiving topic information input by a user, wherein the topic information comprises subject information, a question title, a question description, a question picture and question type information. The topic data analysis module can be arranged on a front-end webpage or a mobile terminal APP.

After receiving the topic data, the topic data analysis module divides the topic data into topic text data and topic picture data, and respectively preprocesses the topic text data and the topic picture data. The topic data parsing module is shown in fig. 4.

The preprocessing of the topic text data comprises text cleaning and word segmentation. The text cleaning work comprises unified coding processing, illegal character removal, emoji symbol removal, pigment character symbol removal, HTML label and other symbols, and invalid character removal.

The invalid characters are characters irrelevant to the topic content after analysis and statistics of the topic data in the topic library, and mainly comprise the following steps:

thank you, solve online, please, teacher, help, look at, help, will not, solve, each position, help, hard, you good, trouble, solve, help me talk, send, this hope can help me solve, answer how to do the question, how to answer, help me, check, be less likely, be less understood, and be more difficult to understand is not clear, not understood, not clear, explaining, helping me see, worry, solve, hope for teacher help to see, teach me, not understand questions, how to write, how to solve, how to ask, how to train, how to do in detail, do nothing to do, do not understand, read not understand questions, help me analysis, ask, speak to me, help me speak, do not know, not particularly understand, do not know, do not have thinking, do not think, speak, do not know, point to point, ask for instruction, wish to do detailed, mould, volume, middle school, examination, end of period, notification, version.

The topic picture data is a problem picture, the picture text recognition is needed, the topic information in the picture is recognized as text information through the existing picture recognition service, and then the data are structured and arranged.

The topic text and the topic picture data are preprocessed and then combined into topic data, the structure of the topic data is shown as follows, and picture information stored in the topic data is a picture address.

{

"query_id": a unique identifier for the topic data,

"query title original text information",

"query_content_original": topic content original text information,

"query_type": question type,

"query_create_time": question creation time,

"query_subject": discipline information,

"query_pic": topic picture information,

"query_pic_content": topic picture text identification content,

"query_title_clean": text after the title is cleaned,

"query_content_clean": text after cleaning of question content

}

Examples:

{

"question_id":00063274-1008-4525-9b7d-9f297f72bde1,

"query_title_original" please help the teacher see how about the question of "hogwash Ming? ,

"query_content_original" is to analyze the material in the question, ask for "mountain not high, have the name of the law. What is the idea that the author wants to express? ,

"query type": answer questions,

"question_create_time":2019-06-27 20:58:28,

"query_subject": chinese,

"question_pic":https://cs.101.com/v0.1/download/actions/direct？dentryId＝9e563a62-a96a-49e2-a238-a810b550b001&serviceName＝fep,

"query_pic_content": hogwash Ming: mountain is not high, and is called Xian Zhi. The water is not deep, and there is long-term. What is bad, we are here we. The upper order of the coating mark is green, and grass color enters the shade for green. The talking about laughing is good and there is no white text in the past.

"query_title_clean": with respect to "hogwash Ming",

"query_content_clean" is to analyze the material in the question, ask for "mountain not high, and have the name of the law. "what the author wants to express thought is

}

(2) And the topic similarity calculation module is used for comparing the topic data with the topic data in the topic database one by one and returning the same and similar topic information to the user. As shown in fig. 5, a specific comparison method is as follows:

a. and splicing the text after the topic title is cleaned and the text after the topic content is cleaned to be used as a text representation, and jointly using the topic picture text identification content and the topic picture as a picture representation.

b. The text representation of the title 1 and the picture representation are represented by T1 and P1, the text representation of the title 2 and the picture representation are represented by T2 and P2, and then the similarity of the four parts of contents, namely the similarity of (T1, T2), (T1, P2), (P1, T2) and (P1, P2) is represented by S1, S2, S3 and S4 respectively.

When calculating the similarity S1, namely comparing the similarity between the topic text representations, a Jaccard method is adopted for calculation, firstly, the text representations to be compared are segmented, then stop words and punctuation marks are removed, and the similarity is calculated by comparing intersection sets and union sets of words contained in the texts, wherein a specific calculation formula is shown as follows.

Wherein J (A, B) is E [0,1].

When calculating the similarity S2 and S3, that is, the similarity between the text representation and the picture representation, the text representation is first vectorized by the BERT model, and then the picture representation is vectorized as well, specifically: the text recognition of the topic picture is converted into vector representation through the BERT model, the topic picture is converted into the vector through the LeNet convolutional network model, the two vectors are spliced to be used as vectorized representation of the picture representation, and the similarity between the vectors is calculated through cosine similarity. Wherein the cosine similarity calculation formula is shown below.

When calculating the similarity S4, that is, comparing the similarity of the picture representations, the topic pictures are simultaneously vectorized, and then cosine similarity is calculated.

After the similarity result of the cross comparison is obtained, different weights Wi are given to different similarity comparison values according to prior experience after the pre-statistical analysis, and then the comprehensive similarity of the two questions to be compared is expressed as follows:

/>

for each discipline, the corresponding weights are set as follows (the subject weights may be set empirically, or may be obtained through training of a neural network):

table 1 weight setting case table

The method has the advantages that the accuracy of comparison can be improved, compared with the prior method, the text content of the questions and the text content of the pictures are compared separately by the cross comparison method, and the more refined mode is adopted to analyze the contents of the questions during comparison.

The following examples are illustrative:

title 1:

the content of the title: analyzing the material in the questions, please ask for the name of "mountain is not high, and have the name of Xian Zhi. What is the idea that the author wants to express?

Topic picture text content: crude Chamber Ming (crude Chamber Ming): mountain is not high, and is called Xian Zhi. The water is not deep, and there is long-term. What is bad, we are here we. The upper order of the coating mark is green, and grass color enters the shade for green. The talking about laughing is good and there is no white text in the past.

Title 2:

the content of the title: please ask "the upper green of the coating mark, grass color goes into the shade. What meaning?

Topic picture text content: crude Chamber Ming (crude Chamber Ming): mountain is not high, and is called Xian Zhi. The water is not deep, and there is long-term. What is bad, we are here we. The upper order of the coating mark is green, and grass color enters the shade for green.

If the topic text content and the topic picture content are directly spliced, then Jaccard is used for calculating the topic content similarity of the topic 1 and the topic 2 to be 0.68.

After the comparison by the cross comparison, the similarity between the text content of the title 1 and the text content of the title 2 is 0.05, the similarity between the picture content of the title 1 and the text content of the title 2 is 0.14, the similarity between the text content of the title 1 and the picture content of the title 2 is 0.19, the similarity between the picture content of the title 1 and the picture content of the title 2 is 0.95, and according to the preset weight information (5.5,2,2,0.5) of the Chinese subject, the similarity between the title 1 and the title 2 is (5.5+0.05+2+0.14+2+0.19+0.5+0.95)/4=0.3, which indicates that the similarity between the two is small or dissimilar. In practice, the two questions can be judged to be dissimilar manually, so that the result of the cross comparison is 0.3 more likely to reflect the similarity of the questions than the result of the previous method, 0.68.

(3) And the result output module is used for comparing the calculated comprehensive similarity with a preset subject threshold, wherein different comparison thresholds are set for different subjects, as shown in table 2. If the comprehensive similarity is larger than a subject threshold corresponding to the subject input by the user, the problem in the subject library is considered to be similar to the problem input by the user, and the problem is taken as a candidate problem. The threshold may also be obtained based on a dataset and artificial intelligence network training. After comparing all the questions in the question bank, sorting the candidate questions according to the comprehensive similarity, and selecting the first five questions to return to the user; if the number of the candidate questions is less than five, returning all candidate questions meeting the conditions to the user; if the problem does not exist, prompting the user that the similar problem is not found, and publishing the problem input by the user as a new problem.

Table 2 threshold setting case table

step 1, receiving question information input by a user;

In step 1, after receiving the topic data, the topic data is first divided into topic text data and topic picture data, and the topic text data and the topic picture data are respectively preprocessed. The specific process is described in the topic resolution data module.

In step 2, a, obtaining a text representation and a picture representation of the subject information input by the user; b. the text representation and the picture representation of the questions are represented by T1 and P1, the text representation and the picture representation of the questions in the question bank are represented by T2 and P2, and then the similarity S1, S2, S3 and S4 of the four parts of contents are compared in a crossing manner; s1 is calculated by adopting a Jaccard method; when S2 and S3 are calculated, the text representation is firstly represented in a vectorization mode through a BERT model, then the topic picture is also represented in a vectorization mode, specifically, the topic picture text recognition is also converted into the vector representation through the BERT model, the topic picture is converted into the vector through a LeNet convolutional network model, then the two vectors are spliced to be used as vectorization representations of the picture representation, and then the similarity among the vectors is calculated through cosine similarity; when calculating the similarity S4, calculating cosine similarity of vectorized representations of the topic pictures in the topic library and the topic pictures; c. and calculating the comprehensive similarity s. The specific calculation method is as described above.

In step 3, the calculated similarity is compared with a preset subject threshold, wherein different comparison thresholds are set for different subjects, as shown in table 2. If the similarity is larger than the subject threshold corresponding to the subject input by the user, the problem in the subject library is considered to be similar to the problem input by the user, and the problem is taken as the candidate problem. The threshold may also be obtained based on a dataset and artificial intelligence network training. After comparing all the questions in the question bank, sorting the candidate questions according to the similarity, and selecting the first five questions to return to the user; if the number of the candidate questions is less than five, returning all candidate questions meeting the conditions to the user; if the problem does not exist, prompting the user that the similar problem is not found, and publishing the problem input by the user as a new problem.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting thereof; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A topic retrieval system based on multi-modal cross-comparisons, the system comprising: the system comprises a topic data analysis module, a topic similarity calculation module and a result output module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the result output module is used for returning the questions in the question bank with the similarity larger than a preset subject threshold value to a user;

the topic similarity calculation module comprises:

a. the text after the topic title is cleaned and the text after the topic content is cleaned are spliced to be used as text representation, and the topic picture text identification content and the topic picture are used as picture representation;

c. calculating comprehensive similarity s;

the similarity S1 is calculated by adopting a Jaccard method; calculating similarity S2, S3 and S4 through cosine similarity; the text recognition of the topic picture is converted into vector representation through a BERT model, the topic picture is converted into a vector through a LeNet convolutional network model, and then the two vectors are spliced to be used as vectorized representation of the picture representation;

the calculation formula of the comprehensive similarity s is as follows:

w is subject weight.

2. The topic retrieval system of claim 1 wherein the topic weights are:

。

3. The system of claim 1, wherein in the result output module, if the integrated similarity is greater than a subject threshold corresponding to the subject input by the user, the problem in the subject library is taken as a candidate problem.

4. The topic retrieval system of claim 3, wherein the topic threshold is:

。

5. A topic retrieval method based on multi-mode cross comparison comprises the following steps:

step 1, receiving question information input by a user;

step 3, returning the questions in the question bank with the similarity larger than the threshold value of the questions as candidate questions to the user;

the step 2 comprises the following steps:

c. calculating comprehensive similarity s;

the calculation formula of the comprehensive similarity s is as follows:

w is a subject weight, wherein the subject picture text recognition is converted into a vector representation through a BERT model, the subject picture is converted into a vector through a LeNet convolutional network model, and then the two vectors are spliced to be used as a vectorized representation of the picture representation.

6. The method according to claim 5, wherein in the step 3, the questions in the question bank having the integrated similarity larger than the threshold of the questions corresponding to the user input questions are used as the candidate questions.