CN109885672A

CN109885672A - A kind of question and answer mode intelligent retrieval system and method towards online education

Info

Publication number: CN109885672A
Application number: CN201910159421.1A
Authority: CN
Inventors: 杨燕; 曲瑛琪; 李国斌; 周新运; 虞海江; 白琳; 孙禹
Original assignee: Beijing Open Distance Education Center Co ltd; Institute of Software of CAS
Current assignee: Beijing Open Distance Education Center Co ltd; Institute of Software of CAS
Priority date: 2019-03-04
Filing date: 2019-03-04
Publication date: 2019-06-14
Anticipated expiration: 2039-03-04
Also published as: CN109885672B

Abstract

The present invention relates to a kind of intelligent retrieval system and method towards online education, comprising: student's status information module, problem analysis module, document retrieval module, passage retrieval module, answer extracting module；The present invention is realized the intelligent retrieval function of precision personalization based on student's state model, satisfied answer is provided for user using the DR-ASF intelligent retrieval algorithm of BM25 searching algorithm and the manual similar features of addition；The expense in system time and space is effectively reduced based on question and answer mode passage retrieval technology, realizes the interacting instant information of user and system in semantic level.

Description

A kind of question and answer mode intelligent retrieval system and method towards online education

Technical field

The present invention relates to question and answer mode intelligent retrieval system and methods, belong to computer and Internet informatization.

Background technique

With the rapid development of internet, it is information-based deepen continuously and mobile terminal equipment performance constantly enhances, online Education (E-learning) is come into being.The fast propagation of knowledge is carried out by application Internet technology, people obtain knowledge Mode becomes versatile and flexible, and learning aid can not be limited by time, space.It is continuous with online education and big data technology Development, the problems that the user of online education generates in use can be solved by big data technology.Intelligent customer service is made For the typical case of artificial intelligence, a kind of completely new customer service model is opened for online education field.Intelligent customer service can mention For automatic question answering service, the problem of being proposed according to user with natural language, provides a specific answer, be information retrieval with The research field that natural language processing combines.

In the question and answer mode intelligent retrieval system towards online education, User can use natural language rather than keyword Combination putd question to, system is made accurately anti-based on the current state of the student then on the basis of understanding user demand Feedback returns to the personalized answer of precision rather than a series of document for the technology of plain text corpus use information retrieval. Therefore mainly solving the technical problems that the characteristics of how fully taking into account user and the data of online education, provides from educational data The learning state level of source and student refine more characteristic informations, while also to fully take into account the timeliness of online question answering system Property demand, to improve user satisfaction.

Summary of the invention

Technology of the invention solves the problems, such as: user for online education and the characteristics of data, provides one kind towards online The question and answer mode intelligent retrieval system and method for education utilize the DR-ASF intelligent retrieval of BM25 algorithm and the manual similar features of addition Algorithm is realized the intelligent retrieval function of precision personalization based on student's state model, satisfied answer is provided for user；It is based on The expense in system time and space can be effectively reduced in question and answer mode passage retrieval technology, realizes user and system in semantic level Interacting instant information.

A kind of technical solution of the invention: intelligent retrieval system towards online education, comprising: student's status information Module, problem analysis module, document retrieval module, passage retrieval module, answer extracting module, in which:

Student's status information module: when user proposes problem by online question and answer, it is responsible for access user's portrait number According to library, generating the user state information of the user according to User ID, (user refers in particular to student, User Status letter in on-line education system Breath refers in particular to student's status information), which is called by document retrieval module, and input parameter is student ID, and output parameter is to learn Raw status information.

Described problem parsing module is divided into offline submodule and online submodule.Offline submodule is responsible for utilizing training language Material is based on svm classifier algorithm, offline to complete intent classifier model training and intent classifier model is deployed to online submodule； Online submodule is then called by document retrieval module, is responsible for carrying out semantic parsing to the customer problem of input, and use trains Intent classifier model carry out intention assessment, then by problem repeat progress meaning of a word extension, to obtain more acurrate coverage more Full problem feature set of words.Output valve as problem analysis module is returned to document retrieval module by the specific word set.

The document retrieval module is known by the way of being combined based on student's status information and field business rule from document Know retrieval in library and obtains the document of most matching problem.Document retrieval module includes document repositories management submodule and retrieval submodule Block；It is usually by online education agency qualification that document repositories, which manage submodule for storing notice class document, this kind of document, Universities and colleges, which issue, is reading student, and Document Title and content can follow unified format specification, therefore parse text using regular expression Shelves title and content, extract antistop list and save；Document repositories retrieval submodule parses problem analysis module The problem of feature set of words and student's status information module provide student's status information, utilize canonical matching and business rule Analysis retrieval then is carried out to the document antistop list and document content stored in document repositories, positions destination document.By student Status information introduces, and effectively promotes the recognition capability being intended to problem, improves the accuracy of destination document positioning.Finally, by mesh It marks document and problem characteristic set of words and is passed to passage retrieval module together as input parameter.

The passage retrieval module, as the intermediate module of document retrieval module and answer extracting module, be responsible for processing from The incoming problem feature set of words of document retrieval module and destination document carry out language to destination document based on problem characteristic set of words Justice retrieval extracts, most possible partial target paragraph comprising answer most related to problem.The target paragraph being retrieved mentions For two kinds of processing modes: first is that quizmaster is directly returned to as answer, using retrieval technique based on probability, algorithm complexity Low, time and space expense are lower；Second is that being passed to answer extracting module as parameter, quality is provided more for answer extracting module High, the less paragraph of data volume takes into account the answer efficiency and accuracy of system as process object；

The answer extracting module: it is divided into offline submodule and online submodule.Offline submodule is responsible for utilizing training language Material, is based on deep learning DR-ASF algorithm, offline to complete DR-ASF intelligent retrieval training and by DR-ASF intelligent retrieval mold portion Affix one's name to online submodule；Online submodule is on the basis of the search result that passage retrieval module is passed to, from candidate target phase It falls starting and final position of the location answer in paragraph and returns to quizmaster to extract more accurate answer sentence. DR-ASF intelligent retrieval is better than general DR model due to joined manual similar features, answer extracting effect.Simultaneously because sharp Paragraph is used the size of process object to be substantially reduced, to online as the middle layer of document retrieval module and answer extracting module The timeliness demand of question answering system can also provide better guarantee.

In the answer extracting module, DR-ASF model is passed through based on the DR model (Document Reader) in DrQA Be added in document vectorization Q-D match entirely, Jaccard similarity and with respect to editing distance craft similar features, to document into Row semantic understanding extracts the match information of problem and document sentence；Problem and document use the two-way shot and long term of multilayer to remember net Network is encoded, and the matching degree between them is measured by bilinearity similarity, to predict the position range of answer, is mentioned The accuracy rate of question and answer is risen, realizes the DR-ASF intelligent retrieval model optimized based on manual similar features.

In the answer extracting module, the realization process based on deep learning DR-ASF algorithm is as follows:

(1) offline submodule is responsible for completing intelligent retrieval model training offline using training corpus, the specific steps are as follows:

1) offline submodule carries out vectorization expression to training corpus (i.e. problem and answer document):

A) problem and answer document are segmented and is gone stop words, respectively obtain the word sequence of problem and answer document；

B) it uses through word2vec tool based on the term vector after large-scale corpus pre-training, by each of problem word sequence Word word vector indicates that the vectorization for completing problem indicates；

C) the same with step b) first to answer document, obtain the term vector of answer document；Then answer document is obtained Other characteristic values: POS (part-of-speech tagging) feature vector, NER (name Entity recognition) feature vector and manual similar features vector； POS feature vector and NER feature vector respectively refer to the type sum of part-of-speech tagging and the type sum of name Entity recognition；By hand Similar features indicate the relationship between problem and answer document, consist of three parts: Q-D match entirely, Jaccard coefficient and relatively Editing distance；By the term vector feature of document, part-of-speech tagging feature, name Entity recognition feature and three manual similar features Vector is spliced, and the vectorization for obtaining answer document indicates；

2) after the vectorization expression for completing problem and answer document, multi-layer biaxially oriented shot and long term memory network Stacked is used BiLSTM is as encoder, by vectorization matrix the problem of input and answer document vectorization matrix coder at a regular length Vector, obtain representation and answer document coding vector indicate；By the way that manual similar features are added, to answer document Can more be laid particular stress on when being encoded in document with vocabulary similar in problem；

3) the problem of generating step 2) coding uses the self-attention in attention mechanism as input parameter Mechanism carries out the weighted transformation in sentence to representation, learn the word dependence inside sentence, and it is new to obtain representation Vector indicates；

4) finally, by the new problem that answer document coding vector that step 2) obtains indicates and step 3) obtains encode to Scale is shown as input parameter, by the way that based on bilinearity similarity, come the matching degree of metric question and answer document, prediction is answered Starting and final position of the case in answer document.

(2) offline submodule completes DR-ASF intelligent retrieval model by repetitive exercise, building；

(3) DR-ASF intelligent retrieval model is deployed to online submodule by offline submodule；

(4) online submodule first pre-processes problem and target paragraph: the use that will be passed to from passage retrieval module The feature set of words of family problem carries out vectorization expression based on the term vector of pre-training；It is carried out target paragraph as answer document Vectorization indicates that processing of the specific steps with offline submodule to answer document obtains the vectorization matrix of target paragraph；Then Using the vectorization matrix of problem and the vectorization matrix of target paragraph as input, DR-ASF intelligent retrieval model, model are called Output valve is the accurate answer of problem, returns to quizmaster.

Detailed process is as follows for the business rule of the document retrieval module:

(1) in document repositories management submodule, Document Title and content are parsed using regular expression, is extracted every The antistop list of a document, the specific steps are as follows:

1) according to business rule, document type tags are established；

2) when in a new document deposit document repositories, notice is marked automatically using the regular expression built in system Topic and content are handled, and keyword field is extracted, and are completed antistop list and are automatically generated；

3) standardization processing is carried out to the antistop list extracted, main includes the rule of Doctype Auto-matching and batch Generalized；

4) document content is stored classifiedly in the form of a file under corresponding Doctype catalogue；

5) it is deposited using the antistop list extracted from document, Document Title, document store path as Database field In MySQL database table, database table automatically generates document id as major key for storage；

(2) the problem of document repositories retrieval submodule parses problem analysis module feature set of words, Yi Jixue Student's status information that raw status information module provides, using canonical matching and business rule in document antistop list and document Appearance carries out analysis retrieval, positions destination document, and answer document ID and customer problem are sent to passage retrieval module, specific to walk It is rapid as follows:

1) for different types of customer problem, different notification types can be distributed to according to business rule；

2) Doctype range of search is reduced using heuristic rule using the relevant information in student's status information；

3) issue date newest document is finally returned in same type notification of document set as destination document.

The problem parsing module, the realization process repeated to problem progress intents and problem are as follows:

(1) problem intent classifier: svm classifier algorithm is used, problem is intended to using TF-IDF as feature vector Classification, in order to increase information content, uses 1-gram and 2-gram model, tool since the text information amount in problem is fewer Body realizes that steps are as follows:

1) it uses the method choice of word frequency and extracts problem characteristic item；Stop words is segmented and gone to problem；Statistics is every The keyword and its frequency of class problem pair take 5 before word frequency ranking keywords as problem characteristic word set；Merge different classes of Feature word set, formed total characteristic word set；

2) TF-IDF feature is used to indicate as the language model of problem；

3) then use the normalized method of linear function, the range of data is limited, weaken different word word frequency it Between gap, obtain problem final characteristic set indicate；

4) characteristic set of the problem is indicated and the feature set of every class problem carries out Similar contrasts, select similarity highest Type of the feature set as problem；

5) the offline submodule in problem analysis module is responsible for completing intent classifier model training when offline, and will train Good intent classifier model is deployed to online submodule；

(2) problem repeats: according to the FAQs and business norms of user, synonym table is sorted out, for problem In keyword do synonym expansion, i.e., the word in keyword set is inquired by synonym table, when there is synonym to deposit When, all synonyms of the word are added in keyword set；

(3) the online submodule in problem analysis module is called by document retrieval module, is responsible for carrying out language to customer problem Justice parsing first carries out intention assessment using trained intent classifier model, then is repeated by problem and carry out meaning of a word extension, finally Obtained keyword set is problem characteristic set of words, and the output valve as problem analysis module returns to file retrieval mould Block.

A kind of intelligent search method towards online education of the invention, is included in line process and off-line procedure；

In line process:

(1) student's real name login system is putd question to the problem of system retrieval question and answer interface inputs natural language description；

(2) system obtains student ID according to user information, and calls document retrieval module, input parameter be student ID and Customer problem；

(3) the file retrieval submodule in document retrieval module calls student's status information module first, and input parameter is Student ID.Student's status information module accesses user's representation data database and generates student's state of the user according to student ID Information, and file retrieval submodule is returned to as output valve, output valve includes that student attends school school, enrollment batch, student status batch Secondary, paper batch, examination batch, graduation batch and study schedule information；

(4) file retrieval submodule then calls problem analysis module, and input parameter is customer problem.Problem analysis module Customer problem is parsed, identification problem is intended to and characteristic information, using feature set of words the problem of parsing as output valve Return to file retrieval submodule；

(5) return value of the file retrieval submodule based on above-mentioned two module: student's status information and problem characteristic word set It closes, is based on business rule, using canonical matching search file antistop list and document content from document repositories, positioning can be returned Answer the destination document of the customer problem.

(6) the problem of then file retrieval submodule returns to destination document and problem analysis module feature set of words as Parameter is passed to passage retrieval module, carries out semantic retrieval to destination document using BM25 algorithm, extracts maximally related with problem Partial target paragraph is as output valve；

(7) system provides two kinds of question and answer response modes according to user demand:

1) if user will be retrieved target paragraph and be directly returned to as answer using the quick response mode of default Quizmaster.Which is easy to use, and algorithm complexity is low, and when asks lower with space expense, is also able to satisfy higher recall rate；

If 2) user selects accurate answer-mode, the target paragraph retrieved is passed to answer as input parameter and is taken out Modulus block carrys out prediction result using joined the DR-ASF intelligent retrieval model trained after manual similar features, from candidate Target paragraph in starting and final position of the location answer in paragraph, the higher answer of precision is extracted, as answer Return to quizmaster.

Off-line procedure:

(1) the document repositories management submodule of document retrieval module is responsible for storing in off-line phase and processing notification class is literary Shelves.When there is new notification of document, document repositories management submodule saves the document title and content first, while using just Then expression formula parses Document Title and content, and the antistop list extracted is also stored in document repositories.The mould Block is only called when there is new notification of document, after completing to the parsing of notification of document, is responsible for output valve, i.e., Document Title, Content and antistop list end task after being saved in document repositories.

(2) the offline submodule of problem analysis module is responsible for, using svm classifier algorithm, completing problem using training corpus Intention assessment training, and trained intent classifier model is deployed to online submodule, to provide on-line annealing parsing function. The submodule is run in system off-line, and input value is training corpus, and output valve is trained intent classifier model.

(3) the offline submodule of answer extracting module is responsible for using training corpus, using joined manual similar features DR-ASF algorithm completes DR-ASF intelligent retrieval model training, and trained DR-ASF intelligent retrieval model is deployed to Line submodule, to provide online answer extracting function.The submodule is run in system off-line, and input value is training corpus, defeated Value is trained DR-ASF intelligent retrieval model out.

The advantages of the present invention over the prior art are that:

(1) asking for personalized accurate answer can not be provided for the existing customer service automatically request-answering system in online education field Topic is proposed for student's status information to be integrated in online question and answer searching system to promote the recognition capability for being intended to problem, be utilized Canonical matching based on business rule can precise search go out student attend school school for relevant batch student in the specific study stage Under related announcement document, rather than return to unified general " details please check related school's notice " class answer, thus More personalized question and answer experience is provided for user.

(2) answer extracting technology combination deep learning is made that on the basis of existing DrQA model and improves and optimizates, mentions The DR-ASF intelligent retrieval model based on manual similar features is gone out.Similar spy by hand is added when model is by document vectorization Sign preferably can carry out semantic understanding to document, extract the match information of problem and document sentence.Problem and document are using more The two-way shot and long term memory network of layer is encoded, and the matching degree between them is measured by bilinearity similarity, thus The position range for predicting answer, further improves the accuracy rate of question and answer.

(3) for the user of online education and data the characteristics of, from the learning state level of educational data resource and student More characteristic informations have been extracted, while being additionally contemplates that the accuracy rate and timeliness demand of online question answering system, have utilized paragraph Reduce the size of process object as the middle layer of document retrieval module and answer extracting module.System realizes two kinds of answers Generating mode: 1) quick response mode: using retrieval technique based on probability, and easy to use, algorithm complexity is low, when ask and empty Between expense it is lower, higher recall rate can be met；2) accurate answer-mode: using the answer extracting technology based on deep learning, Algorithm complexity is higher, it is possible to provide higher accuracy rate.

Detailed description of the invention

Fig. 1 is question and answer mode intelligent retrieval system flow diagram of the invention；

Fig. 2 is document retrieval module flow diagram of the invention；

Fig. 3 is passage retrieval block process schematic diagram of the invention；

Fig. 4 is answer extracting model schematic of the invention.

Specific embodiment

With reference to the accompanying drawing and case study on implementation the present invention is described in detail.

As shown in Figure 1, question and answer mode intelligent retrieval system of the present invention towards online education is by hardware platform and software systems It constitutes, hardware platform includes: that offline submodule need to individually dispose more GPU servers for model training, the property of GPU server It can be increased and decreased based on training corpus scale with quantity, each submodule can shared server when carrying out off-line training；Online Module is deployed on cluster server, and the server end as system provides online question and answer service；The deployment of database used in system On private database server；Client uses Web browser, no special hardware requirement.

Software systems are by student's status information module, problem analysis module, document retrieval module, passage retrieval module and answer Case abstraction module is constituted, the specific implementation process is as follows:

(1) student's status information module is the basic components of system.It is retrieved when logging in system by user and by online question and answer When system proposes problem, which accesses user's representation data library first, is generated according to student ID related with question and answer to the user Student's status information, for document retrieval module call.

(2) document retrieval module calls the online submodule of problem analysis module to carry out semantic parsing to problem, passes through meaning Figure identification and problem repeat, and obtain problem characteristic set of words.The offline submodule of problem analysis module is responsible for using training corpus The offline training for completing intent classifier model, and trained intent classifier model is deployed to online submodule.

(3) the problem of system parses problem analysis module feature set of words and student's status information module provide Student's status information, be transmitted to document retrieval module as parameter.Document retrieval module is divided into document repositories management submodule With retrieval submodule.Management submodule, which is responsible for storing being issued in each stage by each universities and colleges of online education agency qualification, to be read to learn Raw notice class document, usually classification and Notice Date index as per advice to establish.Therefore every in knowledge base to being stored in A document needs to parse Document Title and content using regular expression, extracts each document while saving content Antistop list simultaneously saves.Submodule is retrieved using regular expression and business rule to the antistop list and text in document repositories Shelves content carries out retrieval analysis, determines destination document, and be transmitted to passage retrieval mould using destination document and customer problem as parameter Block.

(4) analysis of passage retrieval module Utilizing question is as a result, based on BM25 algorithm to carrying out language in the content of destination document Justice retrieval, extract with the maximally related paragraph of problem, as it is most possible include answer part.The paragraph being retrieved is divided to two Kind of mode is handled: if quick response mode of the user using default, using the highest target paragraph of similarity as answering Case is directly returned to quizmaster；If user selects accurate answer-mode, (5) are entered step.

(5) target paragraph that system returns to previous step is passed to answer extracting module.Answer extracting module is divided into online son Module and offline submodule.Offline submodule is responsible for completing DR-ASF intelligent retrieval model training offline using training corpus, and Trained DR-ASF intelligent retrieval model is deployed to online submodule.Online submodule calls trained DR-ASF intelligence Retrieval model extracts accurate answer from candidate target paragraph, returns to quizmaster.

Above-mentioned each module the specific implementation process is as follows:

1. student's status information module

Student's status information module is one of infrastructure component of system, and major function is to obtain and save student's state letter Breath.In order to preferably provide personalized service, online education field has usually all carried out user for student and its learning state Portrait.User's portrait features the static information and multidate information of student comprehensively, comprising student attend school school, attend school profession, It attends school the static informations such as batch and study schedule, learn liveness, the multidate informations such as geographical location of attending class.It is retrieved for question and answer and is System need to only obtain partial students status information relevant to question and answer retrieval.

When User real name logs in question and answer searching system, according to the real name information of student, visited by parameter of student ID It asks user's representation data library, generates the status information for retrieving the relevant student to problem, specifically include and attend school school, enrollment batch Secondary, student status batch, paper batch, examination batch, graduation batch and study schedule information.Its middle school student's attends school school and study Two information of progress are the most key, can reduce the range of file retrieval in document matches by attending school school；According to study Progress can be matched to the document of most suitable user's current demand.

After getting student's status information, as parameter, directly incoming document retrieval module.

2. problem analysis module

Case study module is responsible for carrying out intents to problem and problem repeats, and obtains the intent classifier and feature of problem Set of words.It is divided into three steps:

(1) problem intent classifier

Using svm classifier algorithm, come to carry out intent classifier to problem using TF-IDF as feature vector.Due in problem Text information amount is fewer, in order to increase information content, uses 1-gram and 2-gram model.The specific implementation steps are as follows:

1) problem characteristic is extracted: system uses the method choice feature of word frequency.The extraction step of characteristic item are as follows:

A) stop words is segmented and gone to the problems in training set using the jieba participle tool of open source；

B) keyword and its frequency for counting every a kind of problem, are ranked up keyword by word frequency, take K before ranking Feature word set of the keyword as this kind of problems.K value is hyper parameter, is defaulted as 5；

C) word that removal occurs simultaneously in the feature set of words of inhomogeneity problem, merges different classes of feature word set, Form total characteristic word set.

2) problem characteristic indicates: the language model for using TF-IDF feature as problem indicates, TF (term frequency) It is the frequency that word occurs in problem, higher this word of explanation of frequency that word occurs is more important, IDF (inverse document Frequency the importance that) can be used for measuring word illustrates that this word does not have when the document that a word occurs Representativeness, so this word importance is lower.Therefore the weight that available word is calculated by TF-IDF formula, as problem Language model indicate.

3) normalized: normalization mainly limits the range of data, and system is normalized using linear function Method:

T is the word frequency of keyword, t in formula_minFor the word frequency of the least keyword of frequency of occurrence in all problems, t_max For the word frequency of the most keyword of the frequency of occurrence in all problems.When using word frequency to be compared as index, different words Word frequency difference can be bigger.The gap between different word word frequency is weakened using normalization, it is ensured that the effect of Question Classification is more preferable.

Normalizing work indicates after the completion to get to the final characteristic set of problem.

4) characteristic set of the problem is indicated and the feature set of every class problem carries out Similar contrasts, select similarity highest Type of the feature set as problem.

5) the offline submodule in problem analysis module is responsible for completing intent classifier model training when offline, and will train Good intent classifier is deployed to online submodule.

(2) problem repeats

Problem repeats, i.e., expresses problem again.Because there may be problems equivalent in meaning in actual life, but The case where being deviated in expression.So needing to repeat problem to promote the effect of question and answer, pass through problem weight here It states to obtain the Feature Words of problem.Specific implementation is divided into two steps:

1) it segments and removes stop words: being extracted as way with problem characteristic, using jieba participle tool in training set Problem is segmented and is gone stop words, goes the word set obtained after stop words to be collectively referred to as keyword set problem.

2) meaning of a word extends: statement inconsistence problems that may be present when mainly for student question, to the vocabulary after participle Carry out synonym expansion.The FAQs and business norms according to user are needed, synonym table is sorted out, for in problem Keyword do synonym expansion, i.e., the word in keyword set is inquired by synonym table, when with the presence of synonym When, all synonyms of the word are added in keyword set.

(3) the online submodule in problem analysis module is called by document retrieval module, is responsible for carrying out language to customer problem Justice parsing, specific steps are as follows: first carry out intention assessment using trained intent classifier model, then carry out word is repeated by problem Justice extension, the keyword set finally obtained is the feature set of words of problem, and the output valve as problem analysis module returns To document retrieval module.

3. document retrieval module

Document retrieval module is divided into document repositories management submodule and retrieval submodule.Idiographic flow schematic diagram is as schemed Shown in 2:

(1) document repositories manage submodule

Online education field is used for the document of question and answer, externally issues usually in the form of notice and inquires for student.Based on industry Business rule, most of Document Title all can include four class fields: school's title, time, batch, notification type, the batch of part Information is then included in document content.Specific step is as follows:

1) according to business rule, document type tags are established；

2) when in a new document deposit document repositories, system carries out antistop list and automatically generates.The function is specific It realizes are as follows: using the regular expression built in system, notice title and content are handled automatically based on business rule, extracted Above-mentioned four classes field out.Higher accuracy rate can guarantee based on business rule when extracting, such as can be by keyword: learning Phase, spring, summer, autumn, winter, the first half of the year, second half year etc. carry out canonical matching, complete the extraction of " time " field.

3) standardization processing is carried out to the antistop list extracted.Because in the title of rightful notice document, school and when Between belong to formal statement, therefore the information extracted is opposite standardizes, without doing extra process.Standardization is mainly concerned with The standardization of Doctype Auto-matching and batch.

A) Doctype matches: the Doctype keyword obtained from Document Title is calculated using cosine similarity and is closed The similarity of keyword and document type tags, candidate type of the highest notification type label of similarity as document, is submitted to Business personnel's audit, business personnel carry out manually beating document type tags according to classification.

B) batch is standardized: since batch information needs to extract from title and content, the format extracted indicates more Kind multiplicity is converted unified representation using rule-based mode to such case.Secondly as in student's status information In, batch can also be subdivided into enrollment batch, student status batch, paper batch, examination batch, graduation batch.Because to from document In the batch that extracts using notification of document type is based on specific batch is automatically performed based on business rule automatic mapping Refinement.

4) document content is stored classifiedly in the form of a file under corresponding Doctype catalogue.

5) it is deposited using the antistop list extracted from document, Document Title, document store path as Database field In MySQL database table, database table automatically generates document id as major key for storage.

(2) file retrieval submodule

The module examines document repositories by the way of being combined based on student's status information and field business rule Rope obtains and the most matched destination document of problem.Realize that steps are as follows:

1) for different types of customer problem, different notification types can be distributed to according to business rule

2) it on the basis of search result, is reduced using the relevant information in student's status information using heuristic rule Doctype range of search；

3) return same type notification of document set in issue date newest document as destination document.

4. passage retrieval module

Passage retrieval module Utilizing question analyzes result to semantic retrieval is carried out in the content of destination document, extracts and asks Inscribe most related, the most possible part paragraph comprising answer.Implementation process is for example as shown in Figure 3:

(1) document is pre-processed, is divided into more fine-grained paragraph set.Used here as simple paragragh Drop into capable division:

1) if document format is html format, html document file is parsed by dom tree, basis < p after parsing > label obtains the text of paragraph, is combined into document segment text collection.

2) if document format is Word format, paragraph segmentation directly is carried out using enter key, obtains document segment text Set.

(2) similarity between each paragraph and problem in document is calculated based on BM25 algorithm, by similarity it is maximum before Three paragraphs, return to user as answer.

BM25 algorithm is a kind of classic algorithm for evaluating correlation between search term and document.Algorithm cuts problem Point, the degree of correlation of each word and document is calculated, obtains problem and file correlation after weighting.The degree of correlation of word and document it is main By word weight, word and document relevance two parts are measured.

1) the paragraph set of destination document is pre-processed, using open source jieba participle tool to whole paragraphs into Row segments and removes stop words, and method is the same；

2) BM25 model is established in the library gensim based on python；

3) the whole feature set of words for the customer problem for obtaining issue handling module are inputted as term；

4) correlation of paragraph and term is calculated using BM25 model；

5) the paragraph point two ways being retrieved is handled: if user uses the quick response mode of default, The target paragraph retrieved is directly returned to quizmaster as answer, entire question and answer process of retrieving terminates；This mode because The algorithm used is simple, therefore question and answer response quickly；Can also obtain simultaneously with the higher paragraph of the problem word degree of correlation, it is relatively straight It sees, interpretation is strong.

If 6) user selects accurate answer-mode, the paragraph of retrieval is passed to answer extracting mould by system Block, as candidate target paragraph.Object to be processed is reduced to paragraph rank from documentation level by this mode, and high degree reduces The size of process object, also provides higher-quality content for answer extracting module, has combined the answer effect of system Rate and accuracy.

5. answer extracting module

Answer extracting module belongs to optional module, if the quick response mode of user's selection default when puing question to, the mould Block will not be called.If the user desired that obtaining more accurately answer, then accurate answer-mode is selected, process will jump to this mould Block.Model be added based on the DR model (Document Reader) in DrQA, when by document vectorization Q-D match entirely, The manual similar features such as Jaccard similarity and opposite editing distance, preferably can carry out semantic understanding to document, extraction is asked The match information of topic and document sentence.Problem and document are encoded using the two-way shot and long term memory network of multilayer, by double Linear similarity measures the matching degree between them, to predict the position range of answer, further improves question and answer Accuracy rate realizes the DR-ASF intelligent retrieval model based on the optimization of manual similar features.

Answer extracting module is divided into online submodule and offline submodule.

(1) offline submodule is responsible for completing intelligent retrieval model training offline using training corpus, building DR-ASF intelligence Retrieval model, concrete model schematic diagram are as shown in Figure 4:

A) problem is segmented first, obtains the sequence Q={ x of the word composition of problem₁,x₂,...,x_n, answer text Shelves are segmented, and sequence D={ y of the word composition in answer document is obtained₁,y₂,...,y_m, wherein x_iAnd y_iIt is all single in sentence A word, n and m are the number of the word in problem and paragraph respectively.

B) vectorization expression, since the length of problem is generally shorter, semantic information wherein included then are carried out to problem It is relatively fewer, so only being indicated with the method for term vector.System directly uses pre- based on large-scale corpus through word2vec tool Term vector after training, dimension are 300 dimensions.To sequence of question Q={ x₁,x₂,...,x_nIn word x_iThe expression of word vector, i.e., The vectorization for completing problem q indicates that q corresponds to a two-dimensional matrix Q at this time_n×v, n is the number of word in q, and v is the dimension of term vector Degree.

C) vectorization expression next is carried out to answer document.Since the length of answer document is relatively long, wherein including Semantic information more horn of plenty, so to paragraph vectorization indicate when, used multiple characteristic values.In addition to above-mentioned word Other than vector, POS (part-of-speech tagging), NER (name Entity recognition) and manual similar features are further comprised.Manual similar features table Show the relationship between problem and answer document, it can when being encoded to answer document by the way that manual similar features are added More lay particular stress on document in vocabulary similar in problem, thus promoted answer answer effect.Manual similar features are by three parts Composition, Q-D match entirely, Jaccard coefficient and with respect to editing distance.

It is to judge whether the word in answer document occurred in problem that Q-D is matched entirely, occurred then being 1, not occur then It is 0.The thinking of this feature is directly to tell where the word in model problem occurs in the material, those proximates Just probably there is answer.

Enable the full matching characteristic vector T=[t of Q-D₁,t₂,...,t_m], wherein the feature vector t of i-th of word_i=f_Q-D(y_i):

Jaccard coefficient is used to measure the similitude and otherness between sample set, and Jaccard coefficient is bigger, sample phase It is higher like spending.The similarity relation between measurement problem and local document is gone using Jaccard coefficient.For similar local document Higher score can be given.The definition and calculating of Jaccard coefficient: if there is set A and set B, Jaccard coefficient is being calculated When, A and B intersection are found out first, then find out A and B union, be finally defined as follows with the size of intersection divided by the size of union:

J (A, B) is the Jaccard coefficient of A and B.

When calculating in a model, cutting is carried out according to window technique to answer document D first, using the length n of problem Q as window Mouth length, with y_iFor window center, work as y_iAbove or below when do not have word, with placeholder M polishing.Document is pressed into window Set after division is denoted as B={ B₁,B₂,...,B_m, problem set is denoted as A, A=Q={ x₁,x₂,...,x_n, answer text Each window B of shelves_iCalculate the Jaccard similarity between A.For example, calculating Jaccard (A, B₁):

A={ x₁,x₂,...,x_n, B₁=M, M ..., M, y₁,y₂,...,y_k, the number of M is length of window n/2, to Lower rounding.| A |=| B |=n.Calculating A ∩ B₁With A ∪ B₁When ignore placeholder M.According to formula, j is obtained₁.For all i ∈ { 1, m } calculates Jaccard (A, B_i) after, obtain the Jaccard similarity feature vector J=[j of answer document₁,j₂,..., j_m], wherein j_mIt is the similarity of m-th of window and problem in answer document.

Editing distance is the algorithms most in use of similarity between calculating character string, refers to that a character string is converted to another word Accord with the least number of operations of string.Operation includes being replaced mutually, deleting one character of a character and insertion for character.It is opposite to compile Collecting distance is to obtain opposite compile divided by the size of window using problem size as the editing distance of window calculation problem and local document Collect distance.The division of window and calculating thinking are consistent with Jaccard similarity.The solution of editing distance is relative complex, needs Opposite editing distance feature vector R=[r is obtained with dynamic programming algorithm₁,r₂,...,r_m], wherein r_mIt is in answer document The opposite editing distance of m-th of window and problem.

Using open source NLP tool to document D={ y₁,y₂,...,y_mCarry out part-of-speech tagging and name Entity recognition.It obtains Part-of-speech tagging feature vector P=[p₁,p₂,...,p_m], p_i∈ [0, typenum (POS) -1] and name Entity recognition feature vector N=[n₁,n₂,...,n_m], n_i∈[0,typenum(NER)-1].Typenum (POS) and typenum (NER) respectively refer to part of speech The type sum of mark and the type sum of name Entity recognition.

Finally, by the term vector feature V of answer document_d, part-of-speech tagging feature P, name Entity recognition feature N and manual phase Spliced like feature vector T, J, R, the vectorization for obtaining answer document indicates D_m×k, k=| v_i|+|p_i|+|n_i|+|t_i|+|j_i |+|r_i|, wherein v_iIt is term vector, p_iIt is part-of-speech tagging feature vector, n_iIt is name Entity recognition feature vector, t_iIt is Q-D complete With feature vector, j_iIt is Jaccard similarity feature vector, r_iIt is opposite editing distance feature vector.

2) after carrying out vectorization expression to problem and answer document, with the two-way LSTM of multilayer to problem and answer document It is encoded.The problem of will inputting vectorization matrix Q_n×vWith answer document vectorization matrix D_m×kIt is encoded into a fixed length The vector of degree.Since RNN is when the length of sentence is too long, it may appear that the problem of gradient disappears, so the encoder selected here It (Encoder) is multi-layer biaxially oriented shot and long term memory network Stacked BiLSTM.It will be used for the Stacked to representation BiLSTM network is known as Q-encoder, and the Stacked BiLSTM network encoded to answer document is known as D-encoder.

BiLSTM model before to LSTM network and backward LSTM network extract the semantic of forward and backward and believe Breath, feedforward networkForward direction reads in sequence, the forward direction hidden state of the sequence of calculationMake preceding to hidden state For the partial information of word coding.Backward networkIt is reversed to read in sequence, the reversed hidden state of the sequence of calculation By the rear partial information for being also used as word to encode to hidden state.By by preceding to hidden stateWith backward hidden stateSplicing Obtain word codingh_iInformation above and hereinafter information are contained simultaneously.Thus obtain single layer BiLSTM model The coding of problemWith the coding of answer document k_sFor h_iDimension Spend size.

The BiLSTM model of multilayer is used in encoding model, each model can arrive the volume of problem and answer document Code, is denoted asWithK is the number of BiLSTM model.Pass through one The hidden state of every layer of BiLSTM is connected to obtain by full articulamentumIt obtains most Whole problem coded representationWith answer coded representationk_c=k_s× k, wherein k_sFor h_iDimension size.

3) with the self-attention mechanism in attention mechanism to representationIt is converted, is learnt again Word dependence inside sentence captures the internal structure of sentence, extracts problem sentence word internal relations.It obtains every in sentence The word weight of a word, using word weight to representationCarry out the weighted transformation in sentence.Pass through a k first_c- 1 The coding vector weighted sum of each word of sentence then by one softmax layers, is obtained each word in sentence by linear layer In shared weight W_n.Finally by word weight W_nMultiplied byObtain new problem codingThe whole of problem is thus obtained Body semantic expressiveness, and each root in problem is according to its importance, it is different to whole semantic contribution.

4) by based on bilinearity similarity come the matching degree of metric question and document, thus predict answer answer text Starting and final position in shelves.

Predict answer position it needs to be determined that initial position and end position of the answer in answer document, therefore intelligent retrieval Model needs to learn two similar function S_s, S_e, the probability of answer starting position and answer end position in document are described respectively Probability.And the probability indicates that we adopt here by the similarity of computational problem vector and the vector of the position word in document Similarity is calculated with bilinearity (Bilinear) algorithm.

If answer document vectorVector is inscribed in rhetoric questionQ and d_iDimension be k_c。 Similar function S_s, S_eInput be all problem vector q and answer document word vector d_i, it indicates as shown by the equation:

S_s(d_i, q) and=d_iW_sq

S_e(d_i, q) and=d_iW_eq

Wherein, W_sAnd W_eIt is the parameter to be learnt.

In order to predict initial position and final position of the answer in answer document, after similar function plus one layer is normalized Layer, each word obtained in document become answer initial position P_startWith answer final position P_endProbability:

P_start∝exp(d_iW_sq)

P_end∝exp(d_iW_eq)

In training, loss function is negative log-likelihood loss function (Negative Log Likelihood), normalization Layer learns W using softmax function cooperation log likelihood cost function_sAnd W_e.In prediction, normalization layer is directly used Softmax function, each word in document, which can be obtained, becomes the probability of answer banner word and answer termination word.Finally selection is answered Max (P in case document_start) and max (P_end) between sentence as prediction answer.

(3) the DR-ASF intelligent retrieval model that training is completed is deployed to online submodule by offline submodule；

(4) online submodule uses DR-ASF intelligent retrieval model, extracts from candidate target paragraph more accurate Answer returns to quizmaster, realizes step are as follows:

1) target paragraph being passed to from passage retrieval module is equally pre-processed with the answer document in training corpus: Candidate target paragraph is segmented using open source NLP tool, removes stop words, extracts part-of-speech tagging feature P, name entity is known Other feature N, and initial term vector V_d；

2) the whole feature set of words for the customer problem for obtaining issue handling module are carried out based on the term vector of pre-training Vectorization indicates, constructs two-dimensional matrix Q_n×v；

3) the manual similar features vector Q-D for extracting target paragraph match entirely, Jaccard coefficient and with respect to editing distance, With term vector feature, part-of-speech tagging feature before, name Entity recognition feature carry out feature connection, obtain target paragraph to Quantization matrix D_m×k；

4) by Q_n×vAnd D_m×kAs input, DR-ASF intelligent retrieval model is called to be handled；

5) output valve of model is the accurate answer of problem, returns to quizmaster.

Claims

1. a kind of intelligent retrieval system towards online education characterized by comprising student's status information module, solution Analyse module, document retrieval module, passage retrieval module, answer extracting module, in which:

Student's status information module: when user proposes problem by online question and answer, being responsible for access user's representation data library, The user state information of the user is generated according to User ID, wherein user refers to student, User Status in the on-line education system Information refers in particular to student's status information；The module is called by document retrieval module, and input parameter is student ID, and output parameter is student Status information；

Described problem parsing module is divided into offline submodule and online submodule；Offline submodule is responsible for utilizing training corpus, base It is offline to complete intent classifier model training and intent classifier model is deployed to online submodule in svm classifier algorithm；Online son Module is then called by document retrieval module, carries out semantic parsing to the customer problem of input, and use trained intent classifier Model carries out intention assessment, then is repeated by problem and carry out meaning of a word extension, obtains the more full problem feature set of words of coverage；It should Output valve as problem analysis module is returned to document retrieval module by feature set of words；

The document retrieval module, by the way of being combined based on student's status information and field business rule, from document knowledge Retrieval obtains the document of most matching problem in library；Document retrieval module includes: document repositories management submodule and retrieval submodule Block；Document repositories manage submodule for storing notice class document, and this kind of notice class document is by online education agency qualification Universities and colleges issue and reading student, Document Title and content can follow unified format specification, and regular expression is used to parse document Title and content extract antistop list and save；Document repositories retrieval submodule parses problem analysis module Student's status information that problem characteristic set of words and student's status information module provide utilizes canonical matching and business rule Analysis retrieval is carried out to the document antistop list and document content stored in document repositories, positions destination document；By student's shape State information introduces, and effectively promotes the recognition capability being intended to problem, improves the accuracy of destination document positioning；Finally, by target Document and problem characteristic set of words are together as the incoming passage retrieval module of input parameter；

The passage retrieval module is responsible for processing from document as the intermediate module of document retrieval module and answer extracting module The incoming problem feature set of words of retrieval module and destination document；Semantic inspection is carried out to destination document based on problem characteristic set of words Rope extracts, most possible partial target paragraph comprising answer most related to problem；

The answer extracting module: it is divided into offline submodule and online submodule；Offline submodule is responsible for utilizing training corpus, base It is offline to complete DR-ASF intelligent retrieval model training and dispose DR-ASF intelligent retrieval model in deep learning DR-ASF algorithm To online submodule；Online submodule is intelligently examined on the basis of the search result that passage retrieval module is passed to, using DR-ASF Rope model is returned from starting of the location answer in target paragraph in paragraph and final position to extract more accurate answer Back to quizmaster.

2. the intelligent retrieval system according to claim 1 towards online education, it is characterised in that: the answer extracting mould In block, DR-ASF model is based on the DR model (Document Reader) in DrQA, by the way that Q-D is added in document vectorization Full matching, Jaccard similarity and opposite editing distance craft similar features, carry out semantic understanding to document, extract problem and The match information of document sentence；Problem and document are encoded using the two-way shot and long term memory network of multilayer, pass through bilinearity Similarity measures the matching degree between them, to predict the position range of answer, promotes the accuracys rate of question and answer, realizes base In the DR-ASF intelligent retrieval model of manual similar features optimization.

3. the intelligent retrieval system according to claim 1 towards online education, it is characterised in that: the answer extracting mould In block, the realization process based on deep learning DR-ASF algorithm is as follows:

B), based on the term vector after large-scale corpus pre-training, each word of problem word sequence is used using through word2vec tool Term vector indicates that the vectorization for completing problem indicates；

C) the same with step b) first to answer document, obtain the term vector of answer document；Then other of answer document are obtained Characteristic value: POS (part-of-speech tagging) feature vector, NER (name Entity recognition) feature vector and manual similar features vector；POS Feature vector and NER feature vector respectively refer to the type sum of part-of-speech tagging and the type sum of name Entity recognition；Manual phase Like the relationship between character representation problem and answer document, consist of three parts: Q-D matches entirely, Jaccard coefficient and opposite compiles Collect distance；By the term vector feature of document, part-of-speech tagging feature, name Entity recognition feature and three manual similar features to Amount is spliced, and the vectorization for obtaining answer document indicates；

3) the problem of generating step 2) coding uses the self-attention mechanism in attention mechanism as input parameter Weighted transformation in sentence is carried out to representation, learns the word dependence inside sentence, obtains the new vector of representation It indicates；

4) finally, the answer document coding vector that step 2) obtains is indicated the new problem coding vector table obtained with step 3) It is shown as input parameter, by, come the matching degree of metric question and answer document, predicting that answer exists based on bilinearity similarity Starting and final position in answer document；

(4) online submodule first pre-processes problem and target paragraph: the user being passed to from passage retrieval module is asked The feature set of words of topic carries out vectorization expression based on the term vector of pre-training；Vector is carried out using target paragraph as answer document Changing indicates, processing of the specific steps with offline submodule to answer document obtains the vectorization matrix of target paragraph；Then it will ask The vectorization matrix of topic and the vectorization matrix of target paragraph call DR-ASF intelligent retrieval model, model output as input Value is the accurate answer of problem, returns to quizmaster.

4. the intelligent retrieval system according to claim 1 towards online education, it is characterised in that: the document retrieval mould Detailed process is as follows for the business rule of block:

(1) in document repositories management submodule, Document Title and content is parsed using regular expression, extracts each text The antistop list of shelves, the specific steps are as follows:

1) according to business rule, document type tags are established；

2) when in a new document deposit document repositories, using the regular expression built in system automatically to notice title and Content is handled, and keyword field is extracted, and is completed antistop list and is automatically generated；

3) standardization processing is carried out to the antistop list extracted, main includes the specification of Doctype Auto-matching and batch Change；

5) it is stored in using the antistop list extracted from document, Document Title, document store path as Database field In MySQL database table, database table automatically generates document id as major key；

(2) the problem of document repositories retrieval submodule parses problem analysis module feature set of words and student's shape Student's status information that state information module provides, using canonical matching and business rule to document antistop list and document content into Row analysis retrieval, positions destination document, and answer document ID and customer problem are sent to passage retrieval module, specific steps are such as Under:

5. the intelligent retrieval system according to claim 1 towards online education, it is characterised in that: the problem parsing Module, the realization process repeated to problem progress intents and problem are as follows:

(1) problem intent classifier: using svm classifier algorithm, comes to carry out intent classifier to problem using TF-IDF as feature vector, Since the text information amount in problem is fewer, in order to increase information content, 1-gram and 2-gram model, specific implementation are used Steps are as follows:

1) it uses the method choice of word frequency and extracts problem characteristic item；Stop words is segmented and gone to problem；Every class is counted to ask The keyword and its frequency of topic pair, take 5 before word frequency ranking keywords as problem characteristic word set；Merge different classes of spy Word set is levied, total characteristic word set is formed；

2) TF-IDF feature is used to indicate as the language model of problem；

3) the normalized method of linear function is then used, the range of data is limited, is weakened between different word word frequency Gap, the final characteristic set for obtaining problem indicate；

4) characteristic set of the problem is indicated and the feature set of every class problem carries out Similar contrasts, select the highest spy of similarity Collect the type as problem；

5) the offline submodule in problem analysis module is responsible for completing intent classifier model training when offline, and will be trained Intent classifier model is deployed to online submodule；

(2) problem repeats: according to the FAQs and business norms of user, synonym table is sorted out, for in problem Keyword does synonym expansion, i.e., is inquired by synonym table the word in keyword set, in the presence of having synonym, All synonyms of the word are added in keyword set；

(3) the online submodule in problem analysis module is called by document retrieval module, is responsible for carrying out customer problem semantic solution Analysis first carries out intention assessment using trained intent classifier model, then is repeated by problem and carry out meaning of a word extension, finally obtains Keyword set be problem characteristic set of words, the output valve as problem analysis module returns to document retrieval module.

6. a kind of intelligent search method towards online education characterized by comprising in line process and off-line procedure；

It is wherein as follows in line process:

(1) student's real name login system is putd question to the problem of retrieving question and answer interface input natural language description；

(2) according to user information, student ID is obtained, and calls document retrieval module, input parameter is student ID and customer problem；

(3) the file retrieval submodule in document retrieval module calls student's status information module first, and input parameter is student ID；

Student's status information module accesses user's representation data database, according to student ID, generates student's state letter of the user Breath, and return to file retrieval submodule as output valve, output valve include student attend school school, enrollment batch, student status batch, Paper batch, examination batch, graduation batch and study schedule information；

(4) file retrieval submodule then calls problem analysis module, and input parameter is customer problem, problem analysis module to Family problem is parsed, and identification problem is intended to and characteristic information, is returned feature set of words the problem of parsing as output valve Give file retrieval submodule；

(5) return value of the file retrieval submodule based on above-mentioned two module: student's status information and problem characteristic set of words, base In business rule, using canonical matching search file antistop list and document content from document repositories, positioning can answer should The destination document of customer problem；

(6) the problem of then file retrieval submodule returns to destination document and problem analysis module feature set of words is as parameter Incoming passage retrieval module carries out semantic retrieval to destination document using BM25 algorithm, extracts and the maximally related part of problem Target paragraph is as output valve；

(7) two kinds of question and answer response modes are provided according to user demand: if user will examine using the quick response mode of default Rope goes out target paragraph as answer and is directly returned to quizmaster, and the response of which system is rapid；If user's selection is precisely answered Mode is then passed to answer extracting module for the target paragraph retrieved as input parameter, using joined manual similar features The DR-ASF intelligent retrieval model come is trained afterwards and carrys out prediction result, and location answer is in paragraph from candidate target paragraph Starting and final position, extract the higher answer of precision, return to quizmaster as answer；

The off-line procedure:

(1) the document repositories management submodule of document retrieval module is responsible in off-line phase storage and processing notification class document, When there is new notification of document, document repositories management submodule saves the document title and content first, while using canonical Expression formula parses Document Title and content, and the antistop list extracted is also stored in document repositories, the module It is only called when there is new notification of document, after completing to the parsing of notification of document, is responsible for output valve, is i.e. Document Title, interior Hold and ends task after being saved in document repositories with antistop list；

(2) the offline submodule of problem analysis module is responsible for, using svm classifier algorithm, completing problem using training corpus and being intended to Recognition training, and trained intent classifier model is deployed to online submodule, to provide on-line annealing parsing function；The son Module is run in system off-line, and input value is training corpus, and output valve is trained intent classifier model；

(3) the offline submodule of answer extracting module is responsible for using the DR- that joined manual similar features using training corpus ASF algorithm completes DR-ASF intelligent retrieval model training, and trained DR-ASF intelligent retrieval model is deployed to online son Module, to provide online answer extracting function, which runs in system off-line, and input value is training corpus, output valve For trained DR-ASF intelligent retrieval model.