CN116226345A - Method and device for generating problem set and electronic equipment - Google Patents

Method and device for generating problem set and electronic equipment Download PDF

Info

Publication number
CN116226345A
CN116226345A CN202310158523.8A CN202310158523A CN116226345A CN 116226345 A CN116226345 A CN 116226345A CN 202310158523 A CN202310158523 A CN 202310158523A CN 116226345 A CN116226345 A CN 116226345A
Authority
CN
China
Prior art keywords
text
question
target
texts
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310158523.8A
Other languages
Chinese (zh)
Inventor
刘坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310158523.8A priority Critical patent/CN116226345A/en
Publication of CN116226345A publication Critical patent/CN116226345A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a method for generating a question set, and relates to the technical fields of intelligent question answering, natural language processing, big data and the like. The specific scheme is as follows: after a candidate question set comprising a plurality of candidate question texts is obtained from a search engine log, segmenting a preset reference text to obtain a plurality of paragraph texts, determining the relevance between each candidate question text and each paragraph text, and screening target question texts associated with the reference text from the plurality of candidate question texts according to each relevance. Therefore, based on the correlation degree between each candidate question text and each paragraph in the reference text obtained from the search engine log, the target question text associated with the reference text is screened out from the candidate question texts, and therefore accuracy and comprehensiveness of generating the question set are improved.

Description

Method and device for generating problem set and electronic equipment
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of intelligent question answering, natural language processing, big data and the like, and specifically relates to a method and a device for generating a question set and electronic equipment.
Background
The intelligent question-answering system generally queries question-answer pairs in a pre-constructed common question-answer library according to the questions of the user, and rapidly determines answers corresponding to the questions. Whether the problems contained in the common problem solving library are accurate and comprehensive can directly influence the effectiveness of the intelligent question-answering system. Therefore, in order to increase the effectiveness of the intelligent question-answering system, it is necessary to fully and accurately collect questions that a user may have posed before building a common question-answering library.
Disclosure of Invention
The disclosure provides a method and device for generating a problem set and electronic equipment.
According to an aspect of the present disclosure, there is provided a method for generating a problem set, including:
obtaining a candidate problem set from a search engine log, wherein the candidate problem set comprises a plurality of candidate problem texts;
segmenting a preset reference text to obtain a plurality of paragraph texts;
determining the relativity between each candidate question text and each paragraph text;
and screening target question texts associated with the reference texts from the candidate question texts according to each relevance.
According to another aspect of the present disclosure, there is provided an apparatus for generating a problem set, including:
The acquisition module is used for acquiring a candidate problem set from the search engine log, wherein the candidate problem set comprises a plurality of candidate problem texts;
the segmentation module is used for segmenting a preset reference text to obtain a plurality of paragraph texts;
the determining module is used for determining the relativity between each candidate question text and each paragraph text;
and the screening module is used for screening target question texts associated with the reference texts from the plurality of candidate question texts according to each relevance.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the embodiments described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method according to the above-described embodiments.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flow chart of a method for generating a problem set according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of another method for generating a problem set according to an embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating another method for generating a problem set according to an embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating another method for generating a problem set according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of another apparatus for generating a problem set according to an embodiment of the present disclosure;
fig. 6 is a block diagram of an electronic device used to implement the generation of a problem set for embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Artificial intelligence is the discipline of studying certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person using a computer, and has the technical field of both hardware and software aspects. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a deep learning technology, a big data processing technology, a knowledge graph technology and the like.
NLP (Natural Language Processing ) is an important direction in the fields of computer science and artificial intelligence, and the content of NLP research includes, but is not limited to, the following branch fields: text classification, information extraction, automatic abstracting, intelligent question and answer, topic recommendation, machine translation, topic word recognition, knowledge base construction, deep text representation, named entity recognition, text generation, text analysis (lexical, syntactic, grammatical, etc.), speech recognition and synthesis, and the like.
Big data, or massive data, refers to data that is so large in size that it is impossible to access, manage, process, and sort data in a reasonable time through the current mainstream software tools, thereby helping business operations decision.
Typically, question text in the target domain in the candidate question set is screened by whether the question text contains domain words for the target domain. The method only focuses on partial information in the problem text, but ignores the whole semantics of the problem text, so that the screened problem text has lower accuracy.
In the method, the target question text associated with the reference text is screened from a plurality of candidate question texts based on the correlation degree between each candidate question text and each paragraph in the reference text, which is obtained from the search engine log, so that the accuracy and the comprehensiveness of generating the question set are improved.
The method, apparatus, electronic device, and storage medium for generating a problem set according to the embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
The method for generating the problem set implemented by the present disclosure is configured in a generating device for generating the problem set (hereinafter referred to as generating device) for illustration, and the generating device may be applied to any electronic device, so that the electronic device may perform a function of generating the problem set.
The electronic device may be any device with computing capability, for example, may be a personal computer (Personal Computer, abbreviated as PC), a mobile terminal, and the mobile terminal may be a tablet computer, a personal digital assistant, a wearable device, and other hardware devices with various operating systems, touch screens, and/or display screens.
Fig. 1 is a flow chart of a method for generating a problem set according to an embodiment of the present disclosure.
As shown in fig. 1, the method includes:
step 101, a candidate question set is obtained from a search engine log, wherein the candidate question set comprises a plurality of candidate question texts.
In the disclosure, in order to collect more problem texts, the logs of a general search engine or the logs of a search engine special for the field can be analyzed and filtered through preset rules to obtain a candidate problem set and a label corresponding to each candidate problem text. The labels may be fields corresponding to candidate question text, and the like, which is not limited by the present disclosure.
It will be appreciated that the applicant who generated the problem set is on-line with a new product and does not have a user history search log for that product. And the log of the general search engine contains a large number of questions raised for the same type of product. Thus, a candidate problem set may be obtained from a log of a general search engine. On the one hand, the method is beneficial to rapidly meeting the requirement of the applicant who does not have the historical search log on generating the problem set. On the other hand, the method is beneficial to improving the comprehensiveness of the generated problem set.
Step 102, segmenting a preset reference text to obtain a plurality of paragraph texts.
The reference text may be a description text for solving each problem in the set of problems to be generated. For example, when a problem set corresponding to a car of a certain model is generated, the reference text may be a specification corresponding to the car of the certain model. When a question set corresponding to a website is generated, the reference text can be an operation guide corresponding to the website.
In the present disclosure, the segmentation rules may be set according to the composition format of the reference text, or the like. And then, based on the rule, segmenting the reference text according to the reading sequence of the reference text to obtain a plurality of paragraph texts. So as to accurately screen out target question text in the candidate question set based on the multiple paragraph texts respectively.
Alternatively, the reference text may be segmented in units of sentences, paragraphs, or the like according to the need. The present disclosure is not limited in this regard.
Step 103, determining the relatedness between each candidate question text and each paragraph text.
In the disclosure, vector conversion may be performed on each candidate question text and each paragraph text, to determine a vector corresponding to each candidate question text and a vector corresponding to each paragraph text. And then, determining the relevance between each candidate question text and each paragraph text according to the vector corresponding to each candidate question text and the distance between the vectors corresponding to each paragraph text.
And step 104, screening target question texts associated with the reference texts from the plurality of candidate question texts according to each relevance.
In the present disclosure, when the correlation between any candidate question text and any paragraph text is greater than a threshold, the candidate question text is described as a question that may be asked for the paragraph text. At this time, the candidate question text may be determined as the target question text associated with the reference text. Thereby acquiring a question set corresponding to the reference text.
It will be appreciated that the reference is descriptive text for each question in the set of questions to be generated for the solution. Therefore, according to the relevance between each candidate question text and each paragraph in the reference text, the target question text associated with the reference text in the candidate question texts can be accurately screened out, so as to generate the question set aiming at the reference text.
In the present disclosure, since the target question text in the question set is filtered based on the reference text, and the reference file corresponds to a fixed domain, there should be a large number of target question texts in the same domain in the question set. When the number of target problem texts in a certain field in the problem set is small, the error in screening the target problem texts corresponding to the field in the problem set is described.
Therefore, the target question text can be classified according to the labels corresponding to the target question text, and the number of each type of target question text can be determined. And deleting the target problem text under the condition that the corresponding number of the target problem text is smaller than a third threshold value. Thereby further improving the accuracy of generating the problem set.
In the method, after a candidate question set comprising a plurality of candidate question texts is obtained from a search engine log, a preset reference text is segmented to obtain a plurality of paragraph texts, then the relevance between each candidate question text and each paragraph text is determined, and a target question text associated with the reference text is screened from the plurality of candidate question texts according to each relevance. Therefore, based on the correlation degree between each candidate question text and each paragraph in the reference text obtained from the search engine log, the target question text associated with the reference text is screened out from the candidate question texts, and therefore accuracy and comprehensiveness of generating the question set are improved.
Fig. 2 is a flow chart of a method for generating a problem set according to an embodiment of the disclosure.
As shown in fig. 2, the method includes:
Step 201, a candidate question set is obtained from a search engine log, wherein the candidate question set comprises a plurality of candidate question texts.
Step 202, segmenting a preset reference text to obtain a plurality of paragraph texts.
In the present disclosure, the specific implementation process of step 201 to step 202 may refer to the detailed description of any embodiment of the present disclosure, which is not repeated herein.
In step 203, vector conversion is performed on each candidate question text and each paragraph text, and a first semantic vector corresponding to each candidate question text and a second semantic vector corresponding to each paragraph text are determined.
The dimensions of the first semantic vector and the second semantic vector are the same.
In the disclosure, vector conversion may be performed on each candidate question text and each paragraph text through any two natural language processing models trained in advance, so as to determine a first semantic vector corresponding to each candidate question text and a second semantic vector corresponding to each paragraph text.
Alternatively, the vector conversion may be performed on each candidate question text and each paragraph text using the same natural language processing model, which is a matching model (e.g., a matching double-tower model) that is based on joint training of the question text and the paragraph text.
Step 204, constructing a plurality of first matrixes and second matrixes based on the plurality of first semantic vectors and the plurality of second semantic vectors, wherein each first matrix comprises the plurality of first semantic vectors, and each second matrix comprises the plurality of second semantic vectors.
When the number of candidate question texts is large, or the number of paragraph texts is large, the efficiency of sequentially calculating the distance between each first semantic vector and each second semantic vector is low.
In the present disclosure, a plurality of first matrices including a plurality of first semantic vectors may be constructed based on a plurality of first semantic vectors, and a plurality of second matrices including a plurality of second semantic vectors may be constructed based on a plurality of second semantic vectors. Then, the distance between each first semantic vector and each second semantic vector is calculated based on the first matrix and the second matrix, and the efficiency of determining the distance between each first semantic vector and each second semantic vector can be improved.
Step 205, calculating the product between each first matrix and each second matrix to determine a first distance between each first semantic vector constituting the first matrix and each second semantic vector constituting the second matrix.
In the present disclosure, each first matrix is multiplied by each second matrix, respectively, and a first distance between each first semantic vector constituting the first matrix and each second semantic vector constituting the second matrix is determined.
It will be appreciated that a single matrix multiplication may simultaneously obtain a first distance between a plurality of first semantic vectors comprising a first matrix and a plurality of second semantic vectors comprising a second matrix. Thereby improving the efficiency of determining the distance between each first semantic vector and each second semantic vector. Further facilitating the efficiency of generating problem sets.
Step 206, determining the correlation degree between each candidate question text and each paragraph text according to each first distance.
In the present disclosure, the larger the first distance, the smaller the correlation between the corresponding candidate question text and the paragraph text may be determined. The smaller the first distance, the greater the degree of correlation between the corresponding candidate question text and paragraph text may be determined.
Step 207, screening target question text associated with the reference text from the candidate question texts according to each relevance.
In the present disclosure, the specific implementation process of step 207 may refer to the detailed description of any embodiment of the present disclosure, which is not repeated herein.
In the disclosure, after a plurality of candidate question texts and a plurality of paragraph texts corresponding to a reference text in a search engine log are obtained, vector conversion is performed on each candidate question text and each paragraph text, and a first semantic vector corresponding to each candidate question text and a second semantic vector corresponding to each paragraph text are determined. And constructing a plurality of first matrixes and second matrixes based on the plurality of first semantic vectors and the plurality of second semantic vectors, wherein each first matrix comprises the plurality of first semantic vectors, and each second matrix comprises the plurality of second semantic vectors. The product between each first matrix and each second matrix is calculated separately to determine a first distance between each first semantic vector comprising the first matrix and each second semantic vector comprising the second matrix. And determining the relevance between each candidate question text and each paragraph text according to each first distance. And screening target question texts associated with the reference texts from the candidate question texts according to each relevance. Therefore, the correlation degree between each first semantic vector and each second semantic vector can be quickly and accurately determined by calculating the product between each first matrix and each second matrix, and target question texts associated with the reference texts are screened out from a plurality of candidate question texts according to each correlation degree. Thereby improving the accuracy and efficiency of determining the generated problem set.
Fig. 3 is a flowchart of a method for generating a problem set according to an embodiment of the present disclosure.
As shown in fig. 3, the method includes:
step 301, a candidate question set is obtained from a search engine log, wherein the candidate question set comprises a plurality of candidate question texts.
Step 302, segmenting a preset reference text to obtain a plurality of paragraph texts.
Step 303, determining a correlation between each candidate question text and each paragraph text.
And step 304, screening target question texts associated with the reference texts from a plurality of candidate question texts according to each relevance.
In the present disclosure, the specific implementation process of step 301 to step 304 may refer to the detailed description of any embodiment of the present disclosure, which is not repeated herein.
And 305, carrying out vector conversion on each target question text in the question set, and determining a third semantic vector corresponding to each target question text.
In the present disclosure, due to different language expressions, one question may correspond to a plurality of different question texts, but answers corresponding to the plurality of different question texts are the same. Thus, to avoid repeated mining of synonymous question-answer pairs in generating question-answer pairs based on a question set, multiple target question text may be grouped to determine target question text corresponding to the same question as a group.
In the disclosure, in order to improve accuracy of determining the problem group, vector conversion may be performed on each target problem text in the problem set through a pre-trained arbitrary natural language processing model (such as a matching dual-tower model), so as to determine a third semantic vector corresponding to each target problem text. And then, determining a problem group corresponding to each target problem text based on the third semantic vector corresponding to each target problem text.
Step 306, constructing a plurality of third matrices based on the plurality of third semantic vectors, wherein each third matrix includes the plurality of third semantic vectors.
When the number of the third semantic vectors is large, the efficiency of sequentially calculating the distances between the plurality of third semantic vectors is low. And all third semantic vectors need to be read into the memory, so that the memory occupation is large, and the risk of memory breakdown possibly exists.
In the present disclosure, a plurality of third matrices including a plurality of third semantic vectors may be constructed based on the plurality of third semantic vectors. Then, a second distance between every two of the plurality of third semantic vectors can be calculated based on the third matrix, so that the distance between every two of the plurality of third semantic vectors can be determined simultaneously.
In step 307, the product of the plurality of third matrices is calculated to determine a second distance between the plurality of third semantic vectors.
In the disclosure, two third matrices are sequentially read into a memory, products among the multiple third matrices are calculated, and second distances among the multiple third semantic vectors are obtained. Therefore, the efficiency and the reliability of determining the second distance between every two of the plurality of third semantic vectors are improved, and the accuracy and the reliability of determining the efficiency of the problem group are further improved.
Step 308, determining the similarity between the plurality of target question texts according to each second distance.
In the present disclosure, the larger the second distance, the smaller the similarity between the corresponding two target question texts may be determined. The smaller the second distance, the greater the similarity between the corresponding two target question texts can be determined.
Step 309, grouping the target question texts according to the similarity, and determining the question group to which each target question text belongs.
In the disclosure, according to the similarity, a preset clustering algorithm may be adopted to group each target problem text, and determine a problem group to which each target problem text belongs. The higher the similarity between two target question texts, the greater the likelihood of belonging to the same question group.
It can be understood that, because the third semantic vector includes semantic information of the target question text, the similarity between the target question texts is determined based on the third semantic vector, and the target question texts are grouped according to the similarity, so that the accuracy of determining the question group is improved.
In the disclosure, after determining target problem texts, vector conversion may be performed on each target problem text in a problem set, a third semantic vector corresponding to each target problem text is determined, a plurality of third matrixes including a plurality of third semantic vectors are constructed based on the plurality of third semantic vectors, then products among the plurality of third matrixes are calculated to determine second distances among the plurality of third semantic vectors, similarity among the plurality of target problem texts is determined according to each second distance, then each target problem text is grouped according to the similarity, and a problem group to which each target problem text belongs is determined. Therefore, the similarity between the target question texts can be rapidly and accurately determined by calculating the product between the third matrixes, and the target question texts are grouped according to the similarity. Thereby improving the efficiency and accuracy of determining the problem set.
Fig. 4 is a flowchart of a method for generating a problem set according to an embodiment of the present disclosure.
As shown in fig. 4, the method includes:
step 401, step 301 obtains a candidate question set from a search engine log, where the candidate question set includes a plurality of candidate question texts.
Step 402, segmenting a preset reference text to obtain a plurality of paragraph texts.
Step 403, determining a correlation between each candidate question text and each paragraph text.
And step 404, screening target question text associated with the reference text from a plurality of candidate question texts according to each relevance.
And step 405, performing vector conversion on each target question text in the question set, and determining a third semantic vector corresponding to each target question text.
Step 406, constructing a plurality of third matrices based on the plurality of third semantic vectors, wherein each third matrix includes the plurality of third semantic vectors.
In step 407, the product of the plurality of third matrices is calculated to determine a second distance between the plurality of third semantic vectors.
Step 408, determining the similarity between the plurality of target question texts according to each second distance.
In the present disclosure, the specific implementation process of step 401 to step 408 may refer to the detailed description of any embodiment of the present disclosure, which is not repeated herein.
Step 409, randomly selecting a preset number of target question texts with similarity smaller than a first threshold value from the plurality of target question texts, as a center question text.
In the disclosure, a preset number of target question texts with similarity smaller than a first threshold value may be randomly selected from a plurality of target question texts, and the target question texts are used as center question texts. And then, grouping the target question texts by taking the center question text as a reference, and determining a question group to which each target question belongs.
In step 410, in the case that the similarity between any target question text and any central question text is greater than the second threshold, it is determined that any target question text and any central question text belong to the same question group.
In the disclosure, when the similarity between a certain target question text and a certain center question text is greater than a second threshold, it is indicated that the target question text and the center question text correspond to the same question, and at this time, it may be determined that the target question text and the center question text belong to the same question group. For example, the similarity between the target question text b and the center question text a is greater than the second threshold, and it may be determined that the target question text b and the center question text a belong to the same question group.
In step 411, under the condition that the similarity between any target question text and each central question text is smaller than or equal to the second threshold, determining, according to the similarity between any target question text and each target question text in each question group, the question group to which any target question text belongs, until determining the question group to which each target question text belongs.
In the present disclosure, there may be target question text having a similarity with each center question text less than or equal to the second threshold. At this time, the question group to which the target question text belongs may be determined according to the similarity between the target question text and each target question text in each question group until the question group to which each target question text belongs is determined.
For example, assuming that there are two center question texts a and d, the target question text b and the center question text a belong to the same question group, the target question text c and the center question text d belong to the same question group, and the similarity between the target question text e and the center question text a and the center question text d are smaller than the second threshold. At this time, the question group to which the target question text e belongs may be determined according to the similarity between the target question text e and the target question text b and the target question text c, respectively. When the similarity between the target question text e and the target question text b is greater than the second threshold, it may be determined that the target question text e and the center question text a belong to the same question group. Or when the similarity between the target question text e and the target question text b is larger than the similarity between the target question text e and the target question text c, determining that the target question text e and the center question text a belong to the same question group. Thus, the problem groups to which each target problem text belongs are gradually propagated outwards until the problem groups to which each target problem text belongs are determined.
Therefore, the similarity between the target question texts is simply compared, and the question group to which each target question text belongs can be determined without complex calculation, so that the efficiency of determining the question group is improved.
Optionally, in the case that a certain target question text belongs to a plurality of question groups, the target question text may be deleted, so as to ensure accuracy of the question groups.
Optionally, in the case that the similarity between a certain target question text and each central question text is smaller than or equal to the second threshold, it may be determined whether the similarity between the target question text and each target question text in each question group is greater than the threshold. And determining that the target question text belongs to the question group under the condition that the similarity between the target question text and any one target question text in a certain question group is larger than a threshold value. And deleting the target question text under the condition that the similarity between the target question text and any one of the plurality of question groups is greater than a threshold value, so as to ensure the accuracy of the question groups.
In the disclosure, after target question texts associated with a reference text are selected from a plurality of candidate question texts and the similarity between the plurality of target question texts is determined, a preset number of target question texts with similarity smaller than a first threshold value can be randomly selected from the plurality of target question texts to serve as center question texts, then, under the condition that the similarity between any target question text and any center question text is larger than a second threshold value, the condition that any target question text and any center question text belong to the same question group is determined, and then, under the condition that the similarity between any target question text and each center question text is smaller than or equal to the second threshold value, the condition that the similarity between any target question text and each target question text in each question group is smaller than or equal to the second threshold value, the question group to which any target question text belongs is determined until the question group to which each target question text belongs is determined. Therefore, the problem groups to which the target problem texts belong are determined by comparing the similarity between the target problem texts, the complexity of determining the problem groups is reduced, and the accuracy and the efficiency of determining the problem groups are improved.
In order to achieve the above embodiments, the embodiments of the present disclosure further provide a device for generating a problem set.
Fig. 5 is a schematic structural diagram of an apparatus for generating a problem set according to an embodiment of the present disclosure.
As shown in fig. 5, the question set generating apparatus 500 includes: an acquisition module 510, a segmentation module 520, a determination module 530, and a screening module 540.
An obtaining module 510, configured to obtain a candidate problem set from a search engine log, where the candidate problem set includes a plurality of candidate problem texts;
a segmentation module 520, configured to segment a preset reference text to obtain a plurality of paragraph texts;
a determining module 530, configured to determine a relevance between each candidate question text and each paragraph text;
and a screening module 540, configured to screen the target question text associated with the reference text from the plurality of candidate question texts according to each relevance.
In one possible implementation manner of the embodiment of the present disclosure, the determining module 530 is configured to:
vector conversion is carried out on each candidate question text and each paragraph text, and a first semantic vector corresponding to each candidate question text and a second semantic vector corresponding to each paragraph text are determined;
constructing a plurality of first matrixes and second matrixes based on the plurality of first semantic vectors and the plurality of second semantic vectors, wherein each first matrix comprises the plurality of first semantic vectors, and each second matrix comprises the plurality of second semantic vectors;
Calculating the product between each first matrix and each second matrix respectively to determine a first distance between each first semantic vector forming the first matrix and each second semantic vector forming the second matrix;
and determining the relevance between each candidate question text and each paragraph text according to each first distance.
In a possible implementation manner of the embodiment of the disclosure, the device further includes a grouping module, configured to:
vector conversion is carried out on each target problem text in the problem set, and a third semantic vector corresponding to each target problem text is determined;
constructing a plurality of third matrices based on the plurality of third semantic vectors, wherein each third matrix comprises the plurality of third semantic vectors;
calculating the product between every two of the plurality of third matrixes to determine a second distance between every two of the plurality of third semantic vectors;
according to each second distance, determining the similarity between a plurality of target question texts;
and grouping the target question texts according to the similarity, and determining the question group to which each target question text belongs.
In one possible implementation manner of the embodiment of the disclosure, the grouping module is configured to:
randomly selecting a preset number of target question texts with similarity smaller than a first threshold value from a plurality of target question texts to serve as center question texts;
Under the condition that the similarity between any target question text and any center question text is larger than a second threshold value, determining that any target question text and any center question text belong to the same question group;
and under the condition that the similarity between any target question text and each central question text is smaller than or equal to a second threshold value, determining the question group to which any target question text belongs according to the similarity between any target question text and each target question text in each question group until the question group to which each target question text belongs is determined.
In one possible implementation manner of the embodiment of the disclosure, the grouping module is further configured to:
and deleting any target question text in response to any target question text belonging to the plurality of question groups.
In one possible implementation manner of the embodiment of the present disclosure, the candidate problem set further includes a label corresponding to each candidate problem text, and the filtering module 540 is further configured to:
classifying the target problem texts according to the labels corresponding to the target problem texts, and determining the number of each type of target problem texts;
and deleting any target question text in response to the corresponding number of any target question text being smaller than a third threshold.
It should be noted that, the explanation of the foregoing embodiment of the method for generating the problem set is also applicable to the apparatus of this embodiment, so that the description is omitted here.
In the method, after a candidate question set comprising a plurality of candidate question texts is obtained from a search engine log, a preset reference text is segmented to obtain a plurality of paragraph texts, then the relevance between each candidate question text and each paragraph text is determined, and a target question text associated with the reference text is screened from the plurality of candidate question texts according to each relevance. Therefore, based on the correlation degree between each candidate question text and each paragraph in the reference text obtained from the search engine log, the target question text associated with the reference text is screened out from the candidate question texts, and therefore accuracy and comprehensiveness of generating the question set are improved.
According to an embodiment of the disclosure, the disclosure further provides an electronic device, a readable storage medium.
Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 602 or a computer program loaded from a storage unit 608 into a RAM (Random Access Memory ) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An I/O (Input/Output) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing units 601 include, but are not limited to, a CPU (Central Processing Unit ), a GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, DSPs (Digital Signal Processor, digital signal processors), and any suitable processors, controllers, microcontrollers, and the like. The computing unit 601 performs the respective methods and processes described above, such as a method of generating a problem set. For example, in some embodiments, the method of generating a set of questions may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the method of generating a problem set described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method of generating the problem set in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service (Virtual Private Server, virtual special servers) are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (15)

1. A method of generating a problem set, comprising:
obtaining a candidate problem set from a search engine log, wherein the candidate problem set comprises a plurality of candidate problem texts;
segmenting a preset reference text to obtain a plurality of paragraph texts;
determining the relatedness between each candidate question text and each paragraph text;
and screening target question texts associated with the reference texts from a plurality of candidate question texts according to each relevance.
2. The method of claim 1, wherein said determining a relevance between each of said candidate question text and each of said paragraph text comprises:
vector conversion is carried out on each candidate question text and each paragraph text, and a first semantic vector corresponding to each candidate question text and a second semantic vector corresponding to each paragraph text are determined;
constructing a plurality of first matrixes and second matrixes based on the plurality of first semantic vectors and the plurality of second semantic vectors, wherein each first matrix comprises a plurality of first semantic vectors, and each second matrix comprises a plurality of second semantic vectors;
calculating the product between each first matrix and each second matrix respectively to determine a first distance between each first semantic vector forming the first matrix and each second semantic vector forming the second matrix;
and determining the relevance between each candidate question text and each paragraph text according to each first distance.
3. The method of claim 1, further comprising:
performing vector conversion on each target problem text in the problem set, and determining a third semantic vector corresponding to each target problem text;
Constructing a plurality of third matrixes based on the plurality of third semantic vectors, wherein each third matrix comprises the plurality of third semantic vectors;
calculating the product of the third matrixes to determine the second distance between the third semantic vectors;
according to each second distance, determining the similarity among a plurality of target question texts;
and grouping the target question texts according to the similarity, and determining a question group to which each target question text belongs.
4. The method of claim 3, wherein the grouping the target question texts according to the similarity, determining a question group to which each of the target question texts belongs, comprises:
randomly selecting a preset number of target question texts with similarity smaller than a first threshold value from a plurality of target question texts to serve as center question texts;
determining that any target question text and any center question text belong to the same question group under the condition that the similarity between any target question text and any center question text is larger than a second threshold;
and under the condition that the similarity between any target question text and each central question text is smaller than or equal to the second threshold value, determining the question group to which any target question text belongs according to the similarity between any target question text and each target question text in each question group until determining the question group to which each target question text belongs.
5. The method of claim 4, further comprising:
and deleting any target question text in response to the target question text belonging to the plurality of question groups.
6. The method of claim 1, wherein the candidate question set further comprises a label corresponding to each of the candidate question text, the method further comprising:
classifying the target question texts according to the labels corresponding to the target question texts, and determining the number of each type of target question text;
and deleting any target question text in response to the number corresponding to the any target question text being smaller than a third threshold.
7. A question set generating apparatus, comprising:
the acquisition module is used for acquiring a candidate problem set from the search engine log, wherein the candidate problem set comprises a plurality of candidate problem texts;
the segmentation module is used for segmenting a preset reference text to obtain a plurality of paragraph texts;
the determining module is used for determining the relatedness between each candidate question text and each paragraph text;
and the screening module is used for screening target question texts associated with the reference texts from a plurality of candidate question texts according to each relevance.
8. The apparatus of claim 7, wherein the means for determining is configured to:
vector conversion is carried out on each candidate question text and each paragraph text, and a first semantic vector corresponding to each candidate question text and a second semantic vector corresponding to each paragraph text are determined;
constructing a plurality of first matrixes and second matrixes based on the plurality of first semantic vectors and the plurality of second semantic vectors, wherein each first matrix comprises a plurality of first semantic vectors, and each second matrix comprises a plurality of second semantic vectors;
calculating the product between each first matrix and each second matrix respectively to determine a first distance between each first semantic vector forming the first matrix and each second semantic vector forming the second matrix;
and determining the relevance between each candidate question text and each paragraph text according to each first distance.
9. The apparatus of claim 7, further comprising a grouping module to:
performing vector conversion on each target problem text in the problem set, and determining a third semantic vector corresponding to each target problem text;
Constructing a plurality of third matrixes based on the plurality of third semantic vectors, wherein each third matrix comprises the plurality of third semantic vectors;
calculating the product of the third matrixes to determine the second distance between the third semantic vectors;
according to each second distance, determining the similarity among a plurality of target question texts;
and grouping the target question texts according to the similarity, and determining a question group to which each target question text belongs.
10. The apparatus of claim 9, wherein the grouping module is to:
randomly selecting a preset number of target question texts with similarity smaller than a first threshold value from a plurality of target question texts to serve as center question texts;
determining that any target question text and any center question text belong to the same question group under the condition that the similarity between any target question text and any center question text is larger than a second threshold;
and under the condition that the similarity between any target question text and each central question text is smaller than or equal to the second threshold value, determining the question group to which any target question text belongs according to the similarity between any target question text and each target question text in each question group until determining the question group to which each target question text belongs.
11. The apparatus of claim 10, wherein the grouping module is further to:
and deleting any target question text in response to the target question text belonging to the plurality of question groups.
12. The apparatus of claim 7, wherein the candidate question set further comprises a label corresponding to each of the candidate question text, the screening module further to:
classifying the target question texts according to the labels corresponding to the target question texts, and determining the number of each type of target question text;
and deleting any target question text in response to the number corresponding to the any target question text being smaller than a third threshold.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.
CN202310158523.8A 2023-02-13 2023-02-13 Method and device for generating problem set and electronic equipment Pending CN116226345A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310158523.8A CN116226345A (en) 2023-02-13 2023-02-13 Method and device for generating problem set and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310158523.8A CN116226345A (en) 2023-02-13 2023-02-13 Method and device for generating problem set and electronic equipment

Publications (1)

Publication Number Publication Date
CN116226345A true CN116226345A (en) 2023-06-06

Family

ID=86590573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310158523.8A Pending CN116226345A (en) 2023-02-13 2023-02-13 Method and device for generating problem set and electronic equipment

Country Status (1)

Country Link
CN (1) CN116226345A (en)

Similar Documents

Publication Publication Date Title
US20220358292A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
CN113553412B (en) Question-answering processing method, question-answering processing device, electronic equipment and storage medium
US20230004819A1 (en) Method and apparatus for training semantic retrieval network, electronic device and storage medium
CN113128209B (en) Method and device for generating word stock
CN115062718A (en) Language model training method and device, electronic equipment and storage medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN114547301A (en) Document processing method, document processing device, recognition model training equipment and storage medium
CN112699237B (en) Label determination method, device and storage medium
CN113408280A (en) Negative example construction method, device, equipment and storage medium
US20230070966A1 (en) Method for processing question, electronic device and storage medium
CN116402166A (en) Training method and device of prediction model, electronic equipment and storage medium
US20220198358A1 (en) Method for generating user interest profile, electronic device and storage medium
US20230052623A1 (en) Word mining method and apparatus, electronic device and readable storage medium
US20210342379A1 (en) Method and device for processing sentence, and storage medium
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
CN114444514A (en) Semantic matching model training method, semantic matching method and related device
CN116226345A (en) Method and device for generating problem set and electronic equipment
CN115033701B (en) Text vector generation model training method, text classification method and related device
CN116069914B (en) Training data generation method, model training method and device
CN114818683B (en) Operation and maintenance method and device based on mobile terminal
CN114925185B (en) Interaction method, model training method, device, equipment and medium
US11907668B2 (en) Method for selecting annotated sample, apparatus, electronic device and storage medium
CN115048523B (en) Text classification method, device, equipment and storage medium
CN114201607B (en) Information processing method and device
CN116244432B (en) Pre-training method and device for language model and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination