CN110909147A

CN110909147A - Method and system for training sorting result selection model output standard question method

Info

Publication number: CN110909147A
Application number: CN201911217885.XA
Authority: CN
Inventors: 孔心宇; 张晓彤
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-03-24
Anticipated expiration: 2039-12-02
Also published as: CN110909147B

Abstract

One or more embodiments of the present specification disclose a method for training a ranking result selection model output criteria query. The method comprises the following steps: obtaining a sorting sequence output by a sorting model, and determining a first result from the sorting sequence by using a sorting result selection model, wherein the first result corresponds to a predicted standard question method A; judging an accurate standard question method B corresponding to the sequencing sequence; and comparing whether the predicted standard question A is consistent with the accurate standard question B, if so, rewarding the sorting result selection model, and otherwise, punishing the sorting result selection model.

Description

Method and system for training sorting result selection model output standard question method

Technical Field

The specification relates to the field of natural language processing, in particular to a method and a system for training a sorting result selection model output standard question method.

Background

In the online intelligent question-answering system, users may have different question-asking modes for questions with the same content, so that it is necessary to confirm standard questions corresponding to the questions input by the users, and to return correct answers. Usually, a user question recognition module is used for matching the user question with the extended question method of the standard question stored offline, a score representing the similarity of the user question and the standard question corresponding to the extended question method with the highest score is obtained as a return. However, in the matching process, there is a possibility that a question may not match the content of the question input by the user, because the question is ranked first in form or content, or due to other accidental factors, the module will output the question.

Therefore, a method and a system for training the sorting result selection model to output the standard question method are expected, and the capability of the system to return correct answers is improved.

Disclosure of Invention

One aspect of an embodiment of the present specification provides a method for training a ranking result selection model output criteria question. The method for selecting the model output standard question method for the training sorting result comprises the following steps: obtaining a sorting sequence output by a sorting model, and determining a first result from the sorting sequence by using a sorting result selection model, wherein the first result corresponds to a predicted standard question method A; judging an accurate standard question method B corresponding to the sequencing sequence; and comparing whether the predicted standard question A is consistent with the accurate standard question B, if so, rewarding the sorting result selection model, and otherwise, punishing the sorting result selection model.

Another aspect of embodiments of the present specification provides a system for training a ranking result selection model output criteria question. The system comprises: the result determining module is used for acquiring a sequencing sequence output by the sequencing model, and determining a first result from the sequencing sequence by using a sequencing result selection model, wherein the first result corresponds to a predicted standard question method A; the judging module is used for judging the accurate standard question method B corresponding to the sequencing sequence; and the reward and punishment module is used for comparing whether the predicted standard question A and the accurate standard question B are consistent, rewarding the sorting result selection model if yes, and punishing the sorting result selection model if not.

Another aspect of an embodiment of the present specification provides an apparatus for training a ranked results selection model output criteria query, the apparatus comprising at least one processor and at least one memory; the at least one memory is for storing computer instructions; the at least one processor is configured to execute at least some of the computer instructions to implement a method of training a ranking result selection model output criteria question.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a block diagram of a system for training a ranking result selection model output criteria question shown in some embodiments herein;

FIG. 2 is an exemplary flow diagram of a method for training a ranking result selection model output criteria question shown in some embodiments herein; and

FIG. 3 is a schematic diagram of a reinforcement learning system in accordance with some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

FIG. 1 is a block diagram of a system for training a ranking result selection model output criteria question shown in some embodiments herein.

As shown in fig. 1, the system for training the sorting result selection model output standard question may include a sequence feature determination module 110, an input module 120, a result determination module 130, a judgment module 140, and a reward and punishment module 150.

Sequence feature determination module 110 may be configured to determine a sequence feature of the ordered sequence based on the augmented question sequence and the fractional sequence. For a detailed description of determining the sequence characteristics of the sorting sequence according to the extended question sequence and the score sequence, refer to fig. 2, which is not repeated herein.

An input module 120, configured to input the sequence feature into the ranking result selection model. For a detailed description of inputting the sequence feature into the sorting result selection model, refer to fig. 2, which is not described herein again.

The result determination module 130 may be configured to obtain a sorted sequence output by the sorting model, and determine a first result from the sorted sequence using the sorting result selection model, where the first result corresponds to the predicted standard question a. For a detailed description of obtaining the sorting sequence output by the sorting model, determining the first result from the sorting sequence by using the sorting result selection model can be referred to fig. 2, and is not described herein again.

The determining module 140 may be configured to determine an accurate standard question mark B corresponding to the sorting sequence. For a detailed description of the accurate standard query method B for determining the sequencing sequence, refer to fig. 2, which is not described herein again.

The reward and punishment module 150 may be configured to compare whether the predicted standard question a and the accurate standard question B are consistent, if so, award the sorting result selection model, and otherwise, punish the sorting result selection model. For a detailed description of comparing whether the predicted standard method a and the accurate standard method B are consistent, refer to fig. 2, which is not repeated herein.

It should be understood that the system and its modules shown in FIG. 1 may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above description of the system for training the sorting result selection model output standard and the modules thereof is only for convenience of description, and the description is not limited to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, in some embodiments, for example, the sequence characteristic determining module 110, the input module 120, the result determining module 130, the judging module 140, and the reward and punishment module 150 disclosed in fig. 1 may be different modules in one system, or may be one module to implement the functions of two or more modules described above. For example, the judging module 140 and the reward and punishment module 150 may be two modules, or one module may have both the judging function and the reward and punishment function. Such variations are within the scope of the present disclosure.

FIG. 2 is an exemplary flow diagram of a method for training a ranking result selection model output criteria question shown in some embodiments herein.

As shown in fig. 2, the method for selecting a model output standard query according to the training ranking result may include:

step 210, obtaining a sorting sequence output by the sorting model, and determining a first result from the sorting sequence by using a sorting result selection model, wherein the first result corresponds to a predicted standard question method A.

In particular, this step may be performed by the result determination module 130.

In some embodiments, the ranking model may include a mathematical statistical model, and the ranking sequence may be derived by a formula calculation (e.g., a degree of match calculation) based on the feature values. The ranking model may also include Machine learning models including, but not limited to, a Logistic Regression (LR) model, a K-Nearest Neighbor algorithm (KNN) model, a Naive Bayes (navbayes, NB) model, a Support Vector Machine (SVM), a Decision Tree (Decision Tree, DT) model, a Regression Tree (Classification and Regression Trees, CART) model, a Gradient Boosting Decision Tree (GBDT) model, an xgboost (xtrement Boosting) model, and the like. In some embodiments, a ranking model may be used to match user query questions and criteria questions saved offline. For example, the matching model may save "where is great? "where is the great wall? "the on-line question-and-answer system can recall the answer corresponding to the standard question from the system knowledge base and return the answer to the user.

In some embodiments, the sequence is orderedAn extended questionnaire sequence may be included. The extended question sequence may consist of a plurality of extended questions. E.g., 5, 10, 15, etc. In some embodiments, the semantics of the augmented query expression are the same as the standard query, but the query style is different from the standard query. For example, a standard question may be "what did sesame credits fall? ", the plurality of augmented interrogations corresponding to the standard interrogations may include: "why did my sesame credit score get low? "," sesame credit score is reduced, what is done? "," why does the sesame credit score decline? "and the like. In some embodiments, the plurality of extended interrogatories that make up the extended interrogation sequence may correspond to at least one standard interrogation. For example, an extended question sequence may include one or more extended questions a corresponding to predicted standard question A_iAnd one or more extended questions B corresponding to the exact standard question B_i. In some embodiments, the ranking model may rank the plurality of extended interrogatories by their similarity to the user query question to obtain an extended interrogation sequence.

For example, the query question for the user "why did my sesame credit score drop? ", there are multiple extended interrogatories as shown in table 1, then the extended interrogation sequence may be a1a2B1C1, where a1, a2 correspond to standard interrogation a, B1 corresponds to standard interrogation B, and C1 corresponds to standard interrogation C.

TABLE 1

In some embodiments, the ordering sequence may include an extended question sequence as well as a score sequence. The score sequence may be comprised of a plurality of scores, which may be used to represent the similarity of a plurality of augmented questioning methods to the query question. In some embodiments, the score may be represented using a probability value between 0 and 1. For example: taking the extended question sequence in the above example as an example, the score sequence corresponding to a1a2b1c1 can be [ 0.98, 0.96, 0.94, 0.69 ]. The score of a1 is 0.98, indicating that a1 has 98% probability of being the same as the query question; the score of c1 is 0.69, indicating that c1 has a 0.69% probability of being the same as the query question.

In some embodiments, a sequence characteristic of the ranking sequence may be determined based on the augmented question sequence and the score sequence, and then input into the ranking result selection model by input module 120.

In some embodiments, the sequence features of the ordered sequence are comprised of a plurality of question features. The method features fuse a plurality of information such as the content, score and position of each extended method in the extended method sequence. In one embodiment, the feature extraction model may be used to extract feature vectors of the extended query, and then the feature vectors, the corresponding scores of the extended query in the score sequence, and the positions of the extended query in the extended query sequence are concatenated to obtain a query feature. In some embodiments, the feature extraction model may include, but is not limited to: BERT model (Bidirectional encoding from transformations), Recurrent Neural Network model (RNN), Convolutional Neural Network model (CNN), and the like. In some embodiments, the features of the augmented questioning method may be extracted in other ways, or obtained from other modules in the system, and are not limited by the description of the present specification.

In some embodiments, the ranking result selection model may be constructed based on a machine learning model, including but not limited to a Logistic Regression (LR) model, a K-Nearest Neighbor algorithm (KNN) model, a Gradient Boosting Decision Tree (GBDT) model, an xgboost (extreme Gradient Boosting) model, and the like.

In some embodiments, the ranked result selection model may output a first result corresponding to predicted standard question method a. For example, taking the extended question sequence b2a1a2a4a6c1 as an example, because there are multiple extended questions ai corresponding to the standard question a in the extended question sequence, and the extended question a1 is located at the front in the sequence, the ranking result selection model selects the extended question a1 as the first result. In some embodiments, the extended method corresponding to predicted standard approach a may not be ranked first in the extended approach sequence. For example, among the extended question sequence B2a1a2a4a6c1, the extended question B2 corresponding to the standard question B is ranked first in the extended question sequence because the question is similar to the query question and has the highest score. In some embodiments, the standard method selected by the ranking result selection model may be related to the number, score, and position of the extended methods in the extended method sequence corresponding to the standard method. For example, in the extended question sequence b2a1a2a4a6c1, the closer the score values of the extended questions a1, a2, a4 and a6 corresponding to the standard question a are to 1, the greater the probability that the ranking result selection model selects the standard question a; the more advanced the expanded questions a1, a2, a4 and a6 corresponding to the standard question A are in the sequence, the greater the probability that the ranking result selects the model to select the standard question A; the more the number of the expanded questions a1, a2, a4 and a6 corresponding to the standard question A in the sequence, the greater the probability that the ranking result selection model selects the standard question A.

And step 220, judging the accurate standard question method B corresponding to the sequencing sequence.

Specifically, this step may be performed by the determination module 140.

In some embodiments, the ordered sequence in the training dataset may be labeled in advance for which exact standard question. The tag data may be manually tagged or obtained in other ways. The determining module 140 may determine the accurate standard question B corresponding to the sorting sequence by sorting the trained tag data.

And 230, comparing whether the predicted standard question A is consistent with the accurate standard question B, if so, rewarding the sorting result selection model, otherwise, punishing the sorting result selection model.

In particular, this step may be performed by reward penalty module 150.

In some embodiments, the ranked result selection model is awarded if the predicted standard grammar a and the accurate standard grammar B are consistent, and the ranked result selection model is penalized if the predicted standard grammar a and the accurate standard grammar B are inconsistent.

In some embodiments, the ranking result selection model may be trained using a supervised learning approach. In some embodiments, the ranking result selection model may also be trained using reinforcement learning. Preferably, the ranking result selection model is trained by using a reinforcement learning mode. In many practical problems, such as chess (chess) and general game (shogi), there are thousands of combinations during the playing of players, and if a supervised learning method is used, it is difficult to give a label for each combination in advance. The reinforcement learning can obtain a result by trying to make some behaviors under the condition without any label, and the algorithm can learn what behavior can be selected under what condition to obtain the best result by continuously adjusting the previous behavior through the feedback of the result on the right or wrong. Reinforcement learning algorithms may include, but are not limited to, dynamic programming algorithms, monte carlo algorithms, instantaneous difference algorithms, Q learning algorithms, and the like.

For details on how to reward and punish the ranking result selection model in the reinforcement learning process, see fig. 3, and will not be described here.

It should be noted that the above description of process 200 is intended for purposes of illustration and description only and is not intended to limit the applicability of one or more embodiments of the present disclosure. Various modifications and alterations to flow 200 may occur to those skilled in the art, as guided by one or more of the embodiments described herein. However, such modifications and variations are intended to be within the scope of one or more embodiments of the present disclosure. For example, step 210 may be divided into two sub-steps 210-1 and 210-2. A sorted sequence of sorted model outputs is obtained in step 210-1, and a first result is determined from the sorted sequence using a sorted result selection model in step 210-2.

As shown in fig. 3, a reinforcement learning system generally includes an agent and an environment, and the agent continuously learns and optimizes its strategy through interaction and feedback with the environment. The agent obtains the state of the environment and determines the action or action to be taken according to a certain strategy and aiming at the state of the current environment. Such actions act on the environment, changing the state of the environment, and generating a feedback, also known as a reward or reward score, to the agent.

In some embodiments, the sequence feature obtained in step 210 may be taken as an environment state s, the sorting result selects the model as the agent of the agent, and the selection action performed in the environment state s is taken as an action act.

In some embodiments, if the model is selected as the ranking result of agent agents to select the proper case data as the first result, then a reward is given, otherwise a penalty is given. The positive case data may be a positive sample in the training data. For example, for the extended question sequence B2B1a2B4a6c1, if the corresponding accurate standard question is the B question, then B1, B2, B4 are all positive case data. In the above example, if any of the ranked result selection model outputs b1, b2, b4 is the first result, then a reward is given, otherwise a penalty is given.

In some embodiments, if negative example data is present in the training sample data and the ranking result selection model does not select the negative example data as the first result, then a reward is given, otherwise a penalty is given. For example: for the extended question sequence b2x1a2a4a6c1, x1 is negative case data, if the ranking result selection model does not select x1 as the first result, then a reward is given, otherwise a penalty is given.

Negative case data may be a certain question that has been determined not to correspond to the exact standard question, but at the same time it is not clear to which standard question it belongs. This type of data cannot be utilized in supervised learning because there is no accurate label. In some embodiments, negative case data may be a plurality of rounds of dialog in which the user replies with a question confirming an error in the answer. In some embodiments, negative example data may also be collected in other application scenarios, which are not limited by the description of the present specification. In some embodiments, the immediate reward or penalty for the ranking result selection model may be implemented by a reward function (reward function). The reward function may be fed back to the ranking result selection model: what is the right behavior and what is the wrong behavior. For example, if the ranking result selection model makes a correct selection, the reward function adds a certain score to the ranking result selection model as a reward, and otherwise, reduces a certain score as a penalty. In some embodiments, the long-term reward for the ranked result selection model, i.e. the cumulative reward for the ranked result selection model's long-term behavior, may be implemented by a value function (value function).

The goal of reinforcement learning is to get a best solution through the rewards and penalties for the agent. At each step in the training process of the ranking result selection model, a policy is used to select an action based on the environment state. In some embodiments, each round of training selects the known action that may receive the greatest reward with greater probability, while selecting a randomly generated heuristic action with lesser probability. The combination of the two can lead the sequencing result selection model to learn a strategy which can obtain the long-term optimal return. If the strategy is used, the probability that the standard method A predicted by the sorting result selection model is consistent with the accurate standard method B is larger than the preset probability threshold value, and the model training is finished. For example: the preset probability threshold is 99.5%, and then when the probability that the sorting result selection model selects the correct result is 99.6%, the model training is finished.

As described in one or more embodiments in this specification, in the process of training the ranking result selection model using the reinforcement learning method, the ranking result selection model may obtain a result by trying to make some selections first, and adjust the previous selection behavior through the feedback of whether the result is a true or false, and through such continuous adjustment, the ranking result selection model may learn how to correctly select the optimal strategy of the accurate standard question from the ranking sequence output by the ranking model, thereby improving the accuracy of user query question recognition in the online dialog system.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of training a ranked result selection model output criteria query, the method comprising:

obtaining a sorting sequence output by a sorting model, and determining a first result from the sorting sequence by using a sorting result selection model, wherein the first result corresponds to a predicted standard question method A;

judging an accurate standard question method B corresponding to the sequencing sequence;

and comparing whether the predicted standard question A is consistent with the accurate standard question B, if so, rewarding the sorting result selection model, and otherwise, punishing the sorting result selection model.

2. The method of claim 1, wherein the ordering sequence comprises at least:

an extended query sequence consisting of a plurality of extended queries, a plurality of said extended queries corresponding to at least one standard query.

3. The method of claim 2, wherein the ordering sequence further comprises:

a score sequence consisting of a plurality of scores, the plurality of scores being used to represent the similarity of the plurality of augmented questioning methods to the query question.

4. The method of claim 3, wherein after said obtaining an ordered sequence of ordered model outputs further comprises:

determining the sequence characteristics of the sequencing sequence according to the extended question sequence and the fractional sequence;

and inputting the sequence characteristics into the sorting result selection model.

5. The method of claim 4, wherein the determining sequence characteristics of the ordered sequence from the augmented question sequence and the fractional sequence comprises:

extracting feature vectors of a plurality of the extended interrogations in the extended interrogation sequence by using a feature extraction model;

splicing the feature vector, the score corresponding to the extended question method and the position of the extended question method in the extended question method sequence to obtain a plurality of question method features;

the sequence features are composed of the plurality of question features.

6. The method of claim 5, wherein the comparing whether the predicted criteria A and the accurate criteria B are consistent awards the ranked result selection model, and otherwise penalizing the ranked result selection model comprises:

training the sequencing result selection model in a reinforcement learning mode; the sequence characteristics serve as an environment state s, and the sequencing result selects a model as an agent of the agent; a selection action performed in the context state s as action act;

if the agent selects the proper case data as the first result, giving a reward, otherwise giving a penalty; wherein the due data corresponds to the accurate standard question method B.

7. The method of claim 6, wherein said comparing whether said predicted standard A and said accurate standard B are consistent awards said ranked result selection model, otherwise penalizes said ranked result selection model further comprises:

if negative data exists in the training sample data and the agent does not select the negative data as a first result, giving a reward, and otherwise giving a penalty; wherein the negative case data has been determined not to correspond to the accurate standard question mark B.

8. The method of claim 7, wherein said comparing whether said predicted standard method a and said accurate standard method B are consistent awards said ranked result selection model, otherwise penalizes said ranked result selection model further comprises:

and if the probability of the consistency of the predicted standard question A and the accurate standard question B is greater than a preset probability threshold value, finishing the model training.

9. The method of claim 8, wherein the augmented interrogation sequence comprises at least:

one or more extended interrogations a corresponding to said predicted standard interrogations a_iOne or more extended questions B corresponding to the accurate standard question B_i。

10. A system for training a ranked results selection model output criteria question, the system comprising:

the result determining module is used for acquiring a sequencing sequence output by the sequencing model, and determining a first result from the sequencing sequence by using a sequencing result selection model, wherein the first result corresponds to a predicted standard question method A;

the judging module is used for judging the accurate standard question method B corresponding to the sequencing sequence;

and the reward and punishment module is used for comparing whether the predicted standard question A and the accurate standard question B are consistent, rewarding the sorting result selection model if yes, and punishing the sorting result selection model if not.

11. The system of claim 10, wherein the ordered sequence comprises at least:

12. The system of claim 11, wherein the ordering sequence further comprises:

13. The system of claim 12, further comprising, after the obtaining the ordered sequence of ordered model outputs:

the sequence characteristic determining module is used for determining the sequence characteristics of the sequencing sequence according to the extended question sequence and the fractional sequence;

and the input module is used for inputting the sequence characteristics into the sorting result selection model.

14. The system of claim 13, wherein the determining sequence characteristics of the ordered sequence from the augmented question sequence and the fractional sequence comprises:

the sequence features are composed of the plurality of question features.

15. The system of claim 14, wherein comparing whether the predicted criteria-based query a and the accurate criteria-based query B are consistent, if so, awarding the ranked result selection model, otherwise penalizing the ranked result selection model comprises:

16. The system of claim 15, wherein the comparing whether the predicted standard grammar a and the accurate standard grammar B are consistent awards the ranked result selection model, and otherwise penalizes the ranked result selection model further comprises:

17. The system of claim 16, wherein the comparing whether the predicted standard grammar a and the accurate standard grammar B are consistent awards the ranked result selection model, and otherwise penalizes the ranked result selection model further comprises:

18. The system of claim 17, wherein the augmented interrogation sequence comprises at least:

19. An apparatus for training a ranking result selection model output criteria query, wherein the apparatus comprises at least one processor and at least one memory;

the at least one memory is for storing computer instructions;

the at least one processor is configured to execute at least some of the computer instructions to implement the method of any of claims 1-9.