CN110990389A

CN110990389A - Method and device for simplifying question bank and computer readable storage medium

Info

Publication number: CN110990389A
Application number: CN201911211335.7A
Authority: CN
Inventors: 徐涛; 吴峰; 郭伟
Original assignee: Shanghai Yidianshikong Network Co Ltd
Current assignee: Shanghai Yidianshikong Network Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-04-10

Abstract

The application discloses a method and a device for simplifying a question bank and a computer-readable storage medium. The method for simplifying the question bank comprises the following steps: performing word segmentation pretreatment on the text content of the test questions in the question bank; performing text similarity calculation on any two test questions subjected to word segmentation preprocessing to construct a similarity matrix; and comparing each similarity value in the similarity matrix with a preset threshold, deleting one test question of the two test questions corresponding to the similarity value when the similarity value exceeds the preset threshold, and deleting the similarity values, which are calculated by the rest of the test questions and are the same as the deleted test questions, from the similarity matrix. The problem of a plurality of highly similar examination questions can appear in the exercise process of solving the user and using current driving test App to do the question is solved.

Description

Method and device for simplifying question bank and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for simplifying a question bank, and a computer-readable storage medium.

Background

With the increase of the income and living standard of residents, the number of vehicles in China is greatly increased. Under the rapid increase of the number of vehicles, the number of required vehicle drivers also shows rapid increase, thereby stimulating the release of the driving training service requirement. In China, the motor vehicle driving license is acquired through driving examination, and the subjects as theoretical examination need to spend a lot of time for repeated exercise to ensure smooth passing. The subject base of the subject-one examination is a nationwide uniform version, a plurality of very similar subjects exist in the subject base, and in the examination exercise process, all the test subjects in the subject base are randomly extracted by the conventional App to the user for exercise, so that the user can repeatedly do a large number of very similar test subjects.

Disclosure of Invention

The application mainly aims to provide a method and a device for simplifying a question bank and a computer readable storage medium, so as to solve the problem that a plurality of highly similar test questions can appear in the process of using a driving test App to do questions by a user.

To achieve the above object, according to one aspect of the present application, there is provided a method of reducing a question bank.

The method for simplifying the question bank comprises the following steps:

performing word segmentation pretreatment on the text content of the test questions in the question bank;

performing text similarity calculation on any two test questions subjected to word segmentation preprocessing to construct a similarity matrix;

and comparing each similarity value in the similarity matrix with a preset threshold, deleting one test question of the two test questions corresponding to the similarity value when the similarity value exceeds the preset threshold, and deleting the similarity values, which are calculated by the rest of the test questions and are the same as the deleted test questions, from the similarity matrix.

Further, the word segmentation preprocessing is performed on the text content of the test questions in the question bank, and includes:

performing word segmentation pretreatment on the text content of the test questions in the question bank to obtain a word segmentation set;

wherein each item of the participle set comprises participles of a line of text content of the test question.

Further, the text similarity calculation for any two test questions subjected to the word segmentation preprocessing includes:

constructing a corpus by using the text content of the test questions subjected to word segmentation pretreatment;

performing TF-IDF calculation on the corpus to obtain a word vector array;

and calculating the cosine similarity between any two vectors in the word vector array.

Further, the constructing a corpus by using the text content of the test questions preprocessed by the word segmentation includes:

constructing a word segmentation dictionary by using the text content of the test questions subjected to the word segmentation pretreatment;

and generating the corpus based on the word segmentation dictionary.

Further, the comparing each similarity value in the similarity matrix with a preset threshold includes:

traversing the similarity matrix row by row or column by column, and comparing each similarity value in the similarity matrix with a preset threshold value.

The device for simplifying the question bank comprises:

the word segmentation preprocessing module is used for carrying out word segmentation preprocessing on the text content of the test questions in the question bank;

the similarity calculation module is used for performing text similarity calculation on any two test questions subjected to word segmentation preprocessing to construct a similarity matrix;

the threshold comparison module is used for comparing each similarity value in the similarity matrix with a preset threshold;

and the repeated test question deleting module is used for deleting one of the two test questions corresponding to the similarity value when the similarity value exceeds the preset threshold value, and deleting the similarity value, which is calculated by the rest of the test questions and is equal to the deleted test question, from the similarity matrix.

Further, in the word segmentation preprocessing module, the word segmentation preprocessing of the text content of the test questions in the question bank includes:

Further, in the similarity calculation module, the text similarity calculation for any two test questions subjected to the word segmentation preprocessing includes:

performing TF-IDF calculation on the corpus to obtain a word vector array;

Further, in the similarity calculation module, the constructing a corpus by using the text content of the test questions preprocessed by the word segmentation includes:

and generating the corpus based on the word segmentation dictionary.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements any one of the steps of the method for reducing the question bank.

In the embodiment of the application, the text similarity is calculated, the purpose of eliminating test questions with high similarity is achieved by constructing the similarity matrix and setting the similarity threshold, so that the purpose of simplifying the question bank is achieved, and the exercise efficiency of the user is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

fig. 1 is a flowchart illustrating a method for simplifying a question bank according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an apparatus for simplifying a question bank according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for reducing an item bank according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the invention and its embodiments and are not intended to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.

Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meanings of these terms in the present invention can be understood by those skilled in the art as appropriate.

Furthermore, the terms "mounted," "disposed," "provided," "connected," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements or components. The specific meanings of the above terms in the present invention can be understood by those of ordinary skill in the art according to specific situations.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 is a schematic flow chart of a method for reducing an item library according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps S101 to S103:

step S101: and performing word segmentation pretreatment on the text content of the test questions in the question bank.

The implementation subject of the method for simplifying the question bank can be a server or a terminal device. Specifically, the test question library is stored in the kjz _ query table, and mainly includes fields of query _ id (test question id), query (test question content), option _ a (option a content), option _ B (option B content), option _ C (option C content), option _ D (option D content), and comments (answer analysis of test question). The invention uses the fields of query _ id and content (the contents of the fields of query and comments are combined together) to generate the question bank sample file query.

1. What behavior is a behavior of a motor vehicle driving on a road that violates the road traffic safety law? The subject stem clearly expresses 'violation of road traffic safety law', so that the behavior belongs to illegal behaviors. The official has no statement of violation/violation.

2. What responsibilities are considered by law for the motor vehicle driver who is criminal about major traffic accidents caused by illegal driving? The traffic control department violates the regulations of the road traffic safety laws and regulations, takes serious traffic accidents to form crimes, studies the criminal responsibility by law, and revokes the driving licenses by the traffic control department of the public security organization. Generally, this can be said: were crime, certainly pursuing criminal liability.

3. How long do motor vehicle drivers have to recover driving license, which constitutes a crime when they escape following an accident? There are rules for rules: "hang two and withdraw three drunk five flees for the lifetime", namely: the automobile certificate is hung for two years, the automobile certificate is withdrawn for three years, and the automobile certificate is hung for five years due to drunkenness, so that the automobile certificate is hung for life because of escape. Here belonging to the escape sling pin.

4. The traffic accident caused by the driving of the motor vehicle violating the road traffic safety laws and regulations belongs to traffic violation behaviors. The subject matter clearly shows that the law and regulation of road traffic safety are violated, and the violation of the law and regulation is illegal behaviors.

The official has no statement of violation/violation.

5. Driving a motor vehicle on a road should accept a corresponding penalty in violation of road traffic regulations. Violations require preparation to be penalized.

Here, the pre-processing of the text content by word segmentation may be to use jieba word segmentation (a chinese word segmentation tool based on Python) to segment the content text content in the queries.

In one embodiment, the word segmentation preprocessing is performed on the text content of the test questions in the question bank, and includes: performing word segmentation pretreatment on the text content of the test questions in the question bank to obtain a word segmentation set; wherein each item of the participle set comprises participles of a line of text content of the test question. Specifically, for example, a content text content in a query.txt file is participled by using jieba participles to obtain a participle set content _ set, wherein each term in the participle set represents a line of content.

Step S102: and performing text similarity calculation on any two test questions subjected to word segmentation preprocessing to construct a similarity matrix.

In one embodiment, the performing the text similarity calculation on any two test questions subjected to the word segmentation preprocessing includes: constructing a corpus by using the text content of the test questions subjected to word segmentation pretreatment; performing TF-IDF calculation on the corpus to obtain a word vector array; and calculating the cosine similarity between any two vectors in the word vector array.

In this embodiment, a dictionary dic formed by all the participles may be calculated by calling the dictionary according to the word segmentation result, and a corpus is generated by using doc2bow of gensim, which is specifically implemented as follows:

dic＝corpora.Dictionary(content_set)

corpus＝[dic.doc2bow(text)for text in content_set]

furthermore, TfidfModel of gensim can be called to perform TF-IDF calculation on the corppus to obtain a tfidf _ array word vector array, which is specifically realized as follows:

tfidf_array＝models.TfidfModel(corpus)

here, TF-IDF is short for Term Frequency-Inverse Document Frequency, and is a commonly used weighting technique for information retrieval and data mining, TF refers to Term Frequency, and IDF refers to Inverse text Frequency index Inverse Document Frequency.

Then, a spark matrix similarity may be called for tfidf _ array to calculate cosine similarity between two vectors in the two vectors, each line of the sim _ array represents similarity between the test question and other test questions, and the specific implementation is as follows:

sim_array＝similarities.SparseMatrixSimilarity(tfidf_array[corpus],num_feature s＝len(dic.keys()))

in one embodiment, the constructing a corpus using the text content of the test questions preprocessed by the word segmentation includes: constructing a word segmentation dictionary by using the text content of the test questions subjected to the word segmentation pretreatment; and generating the corpus based on the word segmentation dictionary.

Step S103: and comparing each similarity value in the similarity matrix with a preset threshold, deleting one test question of the two test questions corresponding to the similarity value when the similarity value exceeds the preset threshold, and deleting the similarity values, which are calculated by the rest of the test questions and are the same as the deleted test questions, from the similarity matrix.

For example, a similarity threshold may be set to 0.9, each similarity value in the similarity matrix may be compared with 0.9, one of the two test questions corresponding to the similarity value may be deleted when the similarity value exceeds 0.9, and the similarity values calculated for the remaining test questions and the deleted test questions may be deleted from the similarity matrix.

In one embodiment, the comparing each similarity value in the similarity matrix with a preset threshold includes: traversing the similarity matrix row by row or column by column, and comparing each similarity value in the similarity matrix with a preset threshold value. Specifically, for example, the matrix may be traversed row by row, if it is determined that the first row includes a similarity value of 0.95 exceeding 0.9, and the similarity between the test question 1 and the test question 4 corresponds to 0.95, the first row is deleted, that is, the test question 1 corresponding to the similarity of 0.95 is deleted, all columns corresponding to the similarities of the test questions 1 calculated for the remaining test questions are deleted, and then the second row and the third row … are processed in the same manner to traverse the matrix rows in this manner, so that the elimination of the duplicate test questions can be completed.

From the above description, it can be seen that the present invention achieves the following technical effects:

the method comprises the steps of converting test questions from texts into vector representation by means of word segmentation and TF-IDF calculation, calculating cosine similarity between vectors to represent the similarity degree between the test questions, and controlling the number of the simplified question banks by setting different similarity threshold values, so that the purpose of removing similar test questions from the question banks is achieved.

According to an embodiment of the present invention, there is also provided an apparatus for implementing the method for reducing the question bank, as shown in fig. 2, the apparatus includes:

a word segmentation preprocessing module 201, configured to perform word segmentation preprocessing on text contents of test questions in the question bank;

the similarity calculation module 202 is configured to perform text similarity calculation on any two test questions subjected to the word segmentation preprocessing, and construct a similarity matrix;

a threshold comparison module 203, configured to compare each similarity value in the similarity matrix with a preset threshold;

and the repeated test question deleting module 204 is configured to delete one test question of the two test questions corresponding to the similarity value when the similarity value exceeds the preset threshold, and delete the similarity value, which is calculated by the remaining test questions and is equal to the deleted test question, from the similarity matrix.

In one embodiment, in the word segmentation preprocessing module 201, the performing word segmentation preprocessing on the text content of the test questions in the question bank includes:

In one embodiment, in the similarity calculation module 202, the performing text similarity calculation on any two test questions subjected to the word segmentation preprocessing includes:

performing TF-IDF calculation on the corpus to obtain a word vector array;

In one embodiment, in the similarity calculation module 202, the constructing a corpus using the text content of the test questions subjected to the word segmentation preprocessing includes:

and generating the corpus based on the word segmentation dictionary.

In the embodiment of the invention, the device 200 for simplifying the question bank converts the test questions from the text into vector representation by using word segmentation processing and TF-IDF calculation, calculates the cosine similarity between the vectors to represent the similarity degree between the test questions, and controls the number of the simplified question bank by setting different similarity threshold values, thereby achieving the purpose of removing the similar test questions from the question bank.

Those skilled in the art will appreciate that the functions implemented by the modules in the reduced question bank apparatus 200 shown in fig. 2 can be understood by referring to the description related to the method for reducing the question bank. The functions of the modules in the reduced question bank apparatus 200 shown in fig. 2 can be implemented by a program running on a processor, or by specific logic circuits.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Fig. 3 is a schematic structural diagram of an apparatus for reducing an item bank according to an embodiment of the present invention, where the apparatus 300 for reducing an item bank shown in fig. 3 is disposed on the terminal, and includes: at least one processor 301, memory 302, user interface 303, at least one network interface 304. The various components of the reduced question bank device 300 are coupled together by a bus system 305. It will be appreciated that the bus system 305 is used to enable communications among the components connected. The bus system 305 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 305 in fig. 3.

The user interface 303 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.

The memory 302 in the embodiment of the present invention is used for storing various types of data to support the operation of the device 300 for reducing the question bank. Examples of such data include: any computer programs for operating on the reduced question library apparatus 300, such as an operating system 3021 and application programs 3022; operating system 3021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and for processing hardware-based tasks. The application programs 3022 may include various application programs for implementing various application services. A program implementing the method of an embodiment of the present invention may be included in the application program 3022.

The method disclosed in the above embodiments of the present invention may be applied to the processor 301, or implemented by the processor 301. The processor 301 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 301. The processor 301 described above may be a general purpose processor, a digital signal processor, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 301 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 302, and the processor 301 reads the information in the memory 302 and performs the steps of the aforementioned methods in conjunction with its hardware.

It will be appreciated that the memory 302 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a ferromagnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Synchronous Random Access Memory), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), Synchronous link Dynamic Random Access Memory (SLDRAM, Synchronous Dynamic Random Access Memory), Direct Memory bus (DRmb Access Memory, Random Access Memory). The memory 302 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.

Based on the method for simplifying the question bank provided by the embodiments of the present application, the present application further provides a computer-readable storage medium, as shown in fig. 3, the computer-readable storage medium may include: a memory 302 for storing a computer program executable by the processor 301 of the reduced question bank apparatus 300 to perform the steps of the method described above. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

It should be noted that: the technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for simplifying a question bank, comprising:

2. The method for reducing the question bank of claim 1, wherein the pre-processing of the text content of the questions in the question bank by word segmentation comprises:

3. The method for reducing the question bank according to claim 1, wherein the performing the text similarity calculation on any two questions subjected to the word segmentation preprocessing comprises:

performing TF-IDF calculation on the corpus to obtain a word vector array;

4. The method of claim 3, wherein the constructing a corpus using the text content of the test questions preprocessed by the word segmentation comprises:

and generating the corpus based on the word segmentation dictionary.

5. The method of claim 1, wherein comparing each similarity value in the similarity matrix with a preset threshold comprises:

6. An apparatus for simplifying a question bank, comprising:

7. The apparatus for simplifying question bank according to claim 6, wherein in said word segmentation preprocessing module, said word segmentation preprocessing of the text content of the questions in the question bank includes:

8. The device for reducing the question bank of claim 6, wherein in the similarity calculation module, the text similarity calculation for any two questions subjected to the word segmentation preprocessing comprises:

performing TF-IDF calculation on the corpus to obtain a word vector array;

9. The apparatus for reducing the question bank according to claim 8, wherein the similarity calculation module, which constructs a corpus using the text contents of the test questions preprocessed by the word segmentation, comprises:

and generating the corpus based on the word segmentation dictionary.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of reducing a question bank according to any one of claims 1 to 5.