CN113010667A

CN113010667A - Training method for machine learning decision model by using natural language corpus

Info

Publication number: CN113010667A
Application number: CN201911327987.7A
Authority: CN
Inventors: 李亚伦; 林昀娴; 王道维
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2021-06-22

Abstract

The invention relates to a training method of a natural language corpus for a machine learning decision model, which is implemented by a computer device storing a plurality of natural texts, wherein each natural text is marked as a target decision result in a plurality of decision results and comprises a plurality of reason data related to at least one object to be described of the natural text, and the method comprises the following steps: for each reason data corresponding to each natural text, obtaining a corresponding reason data vector group by utilizing a word segmentation algorithm and a sentence steering vector algorithm; for each natural text, connecting and combining a plurality of data vector groups corresponding to the natural text according to a sequence to form an equivalent vector group; and obtaining a decision model by using a supervised classification algorithm according to the equivalent vector group corresponding to each natural text and the target decision result corresponding to the equivalent vector group. Therefore, an option category questionnaire does not need to be additionally defined, and the accuracy of classification decision can be effectively improved.

Description

Training method for machine learning decision model by using natural language corpus

Technical Field

The invention relates to an artificial intelligence model training method, in particular to a model training method based on machine learning and used for classifying and deciding natural language texts.

Background

Past methods for decision prediction using Machine Learning (Machine Learning) models have relied primarily on artificially labeled typed data.

In the field of non-natural language (including text and voice) processing, for example, the following steps are performed: huang Shi Chun and Shao Xuan Lei, applying the practice 2017 of machine learning prediction court judge-law informatics, and manually marking in a preset option category questionnaire according to various key information described by a court judgment book (for example, if the court judgment book is provided to the year income of the first square of 80 ten thousand, the option related to the year income in the option category questionnaire needs to be selected from 80 to 100 ten thousand), thereby converting the linguistic Data into typed Data serving as modeling training Data, and then establishing a model in a Data Mining mode. However, this method needs to redefine the new option category questionnaire manually for different types of corpora, which makes it difficult to extend the entire training method to a wider range of corpora.

However, in the field of natural language (including text and speech) processing, there are some methods for classifying large-scale corpora, such as lsa (content Semantic indicating), lda (content dichlet allocation) and other topic models, which can classify corpora according to similar topics, but the above methods are only suitable for rough classification, and for approximate topics, effective classification cannot be provided yet for decision prediction.

In view of the above, it is necessary to provide a novel training method for machine learning decision models to overcome the problems faced by the foregoing techniques.

Disclosure of Invention

The invention aims to provide a machine learning decision model training method which is based on a natural language technology, does not need to define an option category questionnaire additionally and effectively improves the accuracy of classification decision.

The invention relates to a training method of a natural language corpus for a machine learning decision model, which is implemented by a computer device.

And step (A) is that for each reason data corresponding to each natural text, a reason data vector group corresponding to the reason data is obtained through the computer device according to the reason data.

And (B) for each natural text, connecting and combining the reason data vector groups corresponding to the natural text into an equivalent vector group according to a first sequence through the computer device.

And (C) obtaining a decision model for marking the unmarked natural text to be decided as one of the decision results by utilizing a supervised classification algorithm through the computer device at least according to the effect vector group corresponding to each natural text and the target decision result corresponding to each natural text.

The invention discloses a training method of a machine learning decision model by using natural language corpora, wherein the step (A) comprises the following steps:

(A-1) for each piece of reason data corresponding to each natural text, obtaining, by the computer device, reason data pre-processing data corresponding to the reason data according to the reason data by using a pre-processing algorithm; and

(A-2) for each pre-processing data of the reason data corresponding to each natural text, obtaining a set of the reason data vectors corresponding to the pre-processing data of the reason data by the computer device and using a sentence vector quantity algorithm.

(A-1) for each reason data corresponding to each natural text, using a machine to read through the computer device according to the reason data to obtain reason voice data corresponding to the reason data; and

(A-2) for each rational data voice corresponding to each natural text, obtaining a rational data vector group corresponding to the rational data voice by the computer device by using a voice turning quantity algorithm.

The invention discloses a training method of a machine learning decision model by using natural language corpus, wherein each natural text also comprises a plurality of neutral data which are marked in advance and do not relate to any object to be described in the natural text, and the method also comprises the following steps before the step (C):

(D) for each neutral data corresponding to each natural text, obtaining a neutral vector group corresponding to the neutral data through the computer device according to the neutral data;

(E) obtaining at least one selected rational data vector group according to the rational data vector group corresponding to a selected natural text selected from the natural texts through the computer device;

(F) for each selected rational data vector group, obtaining a recombinant rational data vector group related to the selected rational data vector group through the computer device according to the selected rational data vector group and any neutral vector group corresponding to all natural texts; and

(G) connecting and combining the at least one recombinant physical data vector group and the physical data vector group which is not selected by the selected natural text into another equivalent vector group according to the first sequence through the computer device; and

in step (C), the computer device obtains the decision model by using a supervised classification algorithm according to the target decision result corresponding to the selected natural text and the other equivalent vector set, as well as the effective vector set corresponding to each natural text.

The invention relates to a training method of a machine learning decision model by using natural language corpus, wherein the reason data corresponding to each natural text comprises positive reason data corresponding to each object of the natural text and having positive meaning, and negative reason data corresponding to each object of the natural text and having negative meaning, wherein:

in the step (A), the data-management vector group corresponding to each natural text comprises a positive data-management vector group converted from positive data-management data of each object and a negative data-management vector group converted from negative data-management data of each object; and

in step (B), for each natural text, the computer device combines and synthesizes the equivalent vector set according to the first sequence according to the positive rational data vector set and the negative rational data vector set corresponding to each object of the natural text.

The method for training the machine learning decision model by using the natural language corpus further comprises the following steps after the step (B):

(H) connecting and combining a data vector group corresponding to a selected natural text selected from the natural texts into another equivalent vector group according to a second sequence by the computer device, wherein the second sequence enables the two groups in the first sequence to respectively correspond to the respective order exchange of positive data vector groups of different selected objects, and enables the two groups in the first sequence to respectively correspond to the respective order exchange of negative data vector groups of different selected objects;

(I) obtaining, by the computer device, the target decision result corresponding to the other set of equivalent vectors of step (H) based on the target decision result corresponding to the selected natural text of step (H); and

in step (C), the computer device obtains the decision model by using a supervised classification algorithm according to the equivalent vector set corresponding to each natural text and the target decision result corresponding to each natural text, as well as according to the other equivalent vector set and the target decision result corresponding to the other equivalent vector set.

The method for training the machine learning decision model by using the natural language corpus further comprises the following steps before the step (C):

(I) obtaining at least one selected reason data according to the reason data corresponding to a selected natural text selected from the natural texts through the computer device;

(J) for each selected reason data, rewriting the selected reason data into a reason data rewriting data corresponding to the selected reason data by using a synonymy rewriting algorithm through the computer device;

(K) for each piece of data rewriting data, obtaining a group of rewriting vectors corresponding to the data rewriting data through the computer device according to the data rewriting data;

(L) combining, by the computer device, the at least one rewritten vector set and the selected physical data vector set for which the natural text is not selected, into another equivalent vector set in accordance with the first sequence; and

in step (C), the decision model is obtained by the computer device not only according to the equivalent vector set corresponding to each natural text and the target decision result corresponding to each natural text, but also according to the other equivalent vector set and the target decision result corresponding to the selected natural text by using a supervised classification algorithm.

The invention relates to a training method of a machine learning decision model by using natural language corpora, wherein each natural text is related to an event and comprises a plurality of pre-marked file category data, and before the step (C), the method also comprises the following steps:

(M) for each document type data corresponding to each natural text, converting the document type data into a document vector by the computer device; and

in step (B), for each natural text, the computer device combines the corresponding reason data vector set and the corresponding document vector of the natural text into the equivalent vector set according to the first sequence.

In step (M), the document type data corresponding to each natural text includes a location information related to the occurrence of the event.

The invention discloses a training method of a machine learning decision model by using natural language corpora, wherein each natural text also comprises a plurality of pre-marked object background data related to an object to be described in the natural text, and the method also comprises the following steps before the step (C):

(N) for each object background data corresponding to each natural text, converting the object background data into an object vector by the computer device; and

in step (B), for each natural text, the computer device combines the corresponding reason data vector set and object vector of the natural text into the equivalent vector set according to the first sequence.

In the step (N), the object background data corresponding to each object of each natural text comprises object gender information.

The invention has the beneficial effects that: the computer device converts a plurality of pieces of pre-marked physical data of at least one object to be described in each natural text into a corresponding physical data vector set, trains the object by taking 'sentence' or 'paragraph' in the natural text as a unit so as to keep the vector represented by the text with substantial meaning, and then enables the trained decision model to effectively improve the accuracy of decision prediction without defining an additional option category questionnaire according to the equivalent vector set corresponding to each natural text and the target decision result corresponding to each natural text.

Drawings

Other features and effects of the present invention will become apparent from the following detailed description of the embodiments with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram illustrating a computer device that performs an embodiment of the present invention of a method for training natural language corpora for machine learning decision models;

FIG. 2 is a flow chart illustrating a standard training procedure of the embodiment;

FIG. 3 is a flow chart illustrating a neutral data augmentation training procedure of the embodiment;

FIG. 4 is a flowchart illustrating a swap data augmentation training process of the present embodiment; and

FIG. 5 is a flowchart illustrating a modified data augmentation training procedure of the embodiment.

Detailed Description

Before the present invention is described in detail, it should be noted that in the following description, similar components are denoted by the same reference numerals.

Referring to fig. 1, a computer device 1 for executing the training method of the natural language corpus for the machine learning decision model according to an embodiment of the present invention includes a storage module 11, a display module 12, and a processing module 13 electrically connected to the storage module 11 and the display module 12.

The storage module 11 stores a plurality of natural texts, and each natural text is marked with a target decision result of the plurality of decision results. Each natural text is related to an event including a plurality of pre-marked reason data related to at least one object to be described by the natural text, a plurality of pre-marked neutral data unrelated to any object to be described by the natural text, a plurality of pre-marked object background data related to the object to be described by the natural text, and a plurality of pre-marked file type data.

Each reason data corresponding to each natural text, namely, in the natural text, can be marked as description content related to positive meaning (favorable) or negative meaning (unfavorable) of the object to be described. Each object to be described corresponding to the natural text has positive meaning as positive data, and each object to be described corresponding to the natural text has negative meaning as negative data.

And each neutral data corresponding to each natural text, namely, the neutral data cannot be marked as the description content of the positive meaning (favorable) or the negative meaning (unfavorable) of at least one object to be described relative to the target decision result in the natural text.

The object background data corresponding to each object of each natural text is background information related to the object in the natural text. For example, the object background data includes, but is not limited to, object gender information, object occupation information, object nationality information, object living place information, object personality trait information, object predecessor information, object age information, object income information, object study time information, and object mood description information, and object growth environment description information.

The document type data corresponding to each natural text, namely, the data which cannot be classified into the reason data, the neutral data and the object background data in the natural text. For example, the document type data includes, but is not limited to, a time information related to the occurrence of the event, a location information related to the occurrence of the event, a publishing location information related to the natural text of the event, a writer information for writing the natural text of the event, a source information related to the natural text, a attribution unit responsible for being related to the event, a law used by a judge for judging the event, and a medical classification of the event.

Specifically, when each natural text is reason data including a single object that is pre-marked and related to the natural text to be described, such as student entrance data. Each of the student admission data (natural texts) includes a reason data for applying for the home background in the student self-transmission, a reason data for applying for the personal interest of the student self-transmission, a reason data for applying for the academic direction in the student self-transmission, a reason data for applying for the team cooperation experience in the student self-transmission, and a reason data for applying for the future learning plan in the student self-transmission, and the decision result corresponding to each of the natural texts includes a result of admission (decision result of admission) indicating that the student applies for admission to the school and a result of non-admission (decision result of non-admission) indicating that the student applies for non-admission to the school, but is not limited to the above examples.

Specifically, each natural text includes data that is pre-marked and associated with a single object to be described in the natural text, such as a medical record. Each medical record (natural text) includes a data associated with a first physiological symptom of the subject and a data associated with a second physiological symptom of the subject, and the decision result corresponding to each natural text includes a decision result indicating that the medical record belongs to a first disease and a decision result indicating that the medical record belongs to a second disease, but the examples are not limited thereto.

For example, when the natural text is a medical record, the content is "sneezing, nasal discharge, stuffy nose, headache, dizziness, sore throat, heavy cough, much white sputum and slight fever", and the natural text is marked as the rationale data of the symptom of the first physiological part; in addition, the contents are that the patient has bad appetite, then finds that the weight is continuously reduced, always wants to have a defecation feeling, and has a bleeding phenomenon, abdominal distension, excessive qi and flatus, and the patient is marked as the reason data of the symptom of the second physiological part; in addition, the content "after taking the medicine, please avoid engaging in dangerous activities that easily cause injury, such as driving, operating machinery, etc., is marked and used as neutral data unrelated to the disease to be described.

Specifically, when each natural text is a plurality of pieces of reason data including a plurality of objects that are marked in advance and are to be described in relation to the natural text, for example, the following are provided: a court decision (e.g., a description of interest or disadvantage to the applicant and a description of interest or disadvantage to a relative person) or a news comment article (e.g., a description of interest or disadvantage to a first political party and a description of interest or disadvantage to a second political party) including a first positive-sense data having a positive sense for a first object to be described in the natural text, a first negative-sense data having a negative sense for a first object to be described in the natural text, a second positive-sense data having a positive sense for a second object to be described in the natural text, and a second negative-sense data having a negative sense for a second object to be described in the natural text, and the plurality of decision results for each natural text include a winning result indicating that the first object wins (a decision result of the first object wins or wins), A losing result indicating the falling or losing of the first object (the decision result of the falling or losing of the first object), and a flat-handed result indicating the flat-handed of the first object and the second object (the decision result of the flat-handed of the first object and the second object), but not limited to the above examples.

For example, when the natural text is a decision book, the content is that "the applicant is stable in the relative ability, the education ability and the support system, and after the relatives are born, the applicant is regarded as the main caretaker of the relatives, the applicant also has a high monitoring desire, and when the relatives visit, the relatives and the applicant naturally interact, and have a stable parent-child attachment relationship", the natural text is marked as the positive meaning of the object to be described to serve as positive reason data; in addition, the content is that the person with violence is less suitable for caring underage children according to the theory of preventing and treating family violence, and the person with violence is marked as the negative meaning of the object to be described as the negative reason data.

In this embodiment, the implementation of the computer device 1 is, for example, a personal computer, but not limited thereto.

The operation details of the computer device 1 will be described below with reference to the embodiment of the present invention in which the natural language corpus is used for the training method of the machine learning decision model, and the training method of the present invention includes a standard training program, a neutral data augmentation training program, a transposed data augmentation training program, and a modified data augmentation training program.

Referring to fig. 2, the standard training procedure is applied to a natural text having a plurality of pieces of reason data of at least one object to be described, and is trained by using a plurality of natural texts stored in the storage module 11, and includes steps 50 to 55.

In step 50, for each piece of data corresponding to each natural text, the processing module 13 performs word segmentation (Tokenize), removal of stop words (Remove words), radical extraction (Stemming), part of speech tagging (POS), named entity tagging (NER), and N-grams (N-grams) according to the data by using a preprocessing algorithm, so as to obtain preprocessed data corresponding to the data. It should be particularly noted that the preprocessing algorithm utilized for the chinese corpus is a Jieba (Jieba) word segmentation kit in Python, but not limited thereto. The preprocessing algorithm utilized for the english corpus is a natural language processing kit (NLTK) suite in Python, but not limited thereto.

In step 51, for each piece of data corresponding to each natural text, the processing module 13 obtains a set of data vectors (the set of vectors is a multi-dimensional vector) corresponding to the data corresponding to each piece of data word segmentation data by using a sentence vector transformation algorithm. It should be noted that the sentence vector algorithm used is Doc2vec algorithm, but not limited thereto.

It is noted that the plurality of data vector sets can also be obtained using a voice vector algorithm. In detail, the processing module 13 only needs to use machine reading to convert each rational data corresponding to each natural text into a rational data voice data, and can also obtain the rational data vector group corresponding to the rational data voice data by using the spech 2Vec algorithm. In addition, the use and training of various preprocessing algorithms, Doc2Vec algorithm and spech 2Vec algorithm are prior art and are not the focus of the present invention, and will not be described herein.

In step 52, for each file type data corresponding to each natural text, the processing module 13 converts the file type data into a file vector.

In step 53, for each object background data corresponding to each natural text, the object background data is converted into an object vector by the computer device.

It should be noted that the processing module 13 converts each document type data and each object background data into the corresponding document vector and the corresponding object vector through a mapping table predefined by a user.

In step 54, for each natural text, the processing module 13 connects and combines the data vector set corresponding to the natural text, the document vector corresponding to the natural text, and the object vector corresponding to the natural text into a first equivalent vector set corresponding to the natural text according to a first sequence that can be defined by a user. In other embodiments, the first set of equivalent vectors may comprise only the set of data vectors; alternatively, the first set of equivalent vectors may include a set of rational data vectors and a set of document vectors; alternatively, the first set of equivalent vectors may include a set of rational data vectors and an object vector.

Specifically, when each natural text is the student's data for entrance including the reason data pre-marked and related to the single object to be described in the natural text, for each natural text, the processing module 13 combines the reason data vector group corresponding to the reason data of the family background in the self-transmission of the application student, the reason data vector group corresponding to the reason data of the personal interest in the self-transmission of the application student, the reason data vector group corresponding to the reason data of the academic direction in the self-transmission of the application student, the reason data vector group corresponding to the reason data of the team cooperation experience in the self-transmission of the application student, the reason data vector group corresponding to the reason data of the future learning plan in the self-transmission of the application student, the file vector, and the object vector into the first equivalent vector group in sequence according to the first sequence defined by the user.

Specifically, when each natural text is the court decision containing the data of the objects to be described which are pre-marked and related to the natural text, the processing module 13 sequentially, for each natural text, the first positive rational data vector set obtained by a sentence-steering amount algorithm corresponding to the first positive rational data of the natural text, the first negative rational data vector set obtained by a sentence-steering amount algorithm corresponding to the first negative rational data of the natural text, the second positive rational data vector set obtained by a sentence-steering amount algorithm corresponding to the second positive rational data of the natural text, the second negative rational data vector set obtained by a sentence-steering amount algorithm corresponding to the second negative rational data of the natural text, the first negative rational data vector set obtained by a sentence-steering amount algorithm corresponding to the first positive rational data of the natural text, and the second negative rational data vector set obtained by a sentence-steering amount algorithm corresponding to the natural text, The document vector and the object vector are combined into the first equivalent vector group. In other words, each natural text is formed by combining the set of data vectors, the set of document vectors, and the set of object vectors into the first equivalent vector set according to the uniform first order, which can be defined by the user, and the examples are not limited thereto.

In step 55, the processing module 13 obtains a decision model for marking an unmarked natural text to be decided as one of decision results by using a supervised classification algorithm according to at least the first equivalent vector set corresponding to each natural text and the target decision result corresponding to each natural text. The processing module 13 can mark the natural text to be decided as a classification result in the decision result and display the classification result in the display module 12. It should be noted that the supervised classification algorithm is an Artificial Neural Network (ANN), but not limited thereto.

It should be noted that, for each natural text, the processing module 13 may also use the set of the rational data vectors corresponding to the natural text as a rational data vector data set, and store the data set in any storage device, so that a future user can directly execute

steps

54 and 55 on any computer device according to the rational data vector data set in the storage device to obtain the decision model.

Referring to fig. 3, the neutral data augmentation training procedure is applied to a natural text having reason data of at least one object to be described, and generates a new equivalent vector set different from the first equivalent vector set corresponding to the natural text by using the natural text stored in the storage module 11, so as to augment vectors required for training the decision model, and includes steps 60 to 64 and 55.

In step 60, for each neutral data corresponding to each natural text, the processing module 13 obtains a neutral preprocessed data corresponding to the neutral data according to the neutral data by using a preprocessing algorithm.

In step 61, for each neutral pre-processing data corresponding to each natural text, the processing module 13 obtains a neutral vector set corresponding to the neutral pre-processing data by using the sentence vector algorithm. Similarly, the processing module 13 can also convert each neutral data corresponding to each natural text into a neutral voice data by machine reading, and can obtain the neutral vector group corresponding to the neutral voice data by using a voice vector quantity algorithm.

In step 62, the processing module 13 obtains at least one selected data vector group according to the data vector group corresponding to a first selected natural text selected from the natural texts.

In step 63, for each selected rational data vector set, the processing module 13 obtains a re-set rational data vector set associated with the selected rational data vector set according to the selected rational data vector set and any neutral vector set corresponding to all natural texts. In particular, the processing module 13 averages the selected set of physical data vectors with any of the neutral sets of vectors to obtain the set of physical data vectors.

In step 64, the processing module 13 combines the at least one set of physical data vectors, the set of physical data vectors that is not selected from the first selected natural text, the file vector corresponding to the first selected natural text, and the object vector corresponding to the first selected natural text into a second equivalent vector set according to the first sequence. In other embodiments, the second equivalent vector set may only include the at least one set of rational data vectors and the set of rational data vectors not selected by the first selected natural text; or, the first equivalent vector set may include the at least one re-set of physical data vectors, a physical data vector set not selected by the first selected natural text, and a document vector corresponding to the first selected natural text; alternatively, the first equivalent vector set may include the at least one set of rational data vectors, a set of rational data vectors not selected from the first selected natural text, and an object vector corresponding to the first selected natural text.

It should be particularly noted that, on the premise that all neutral pre-processing data do not affect the objective decision result (classification result of any natural text) corresponding to any natural text, the neutral vector set is converted from the corresponding neutral pre-processing data, and therefore the objective decision result corresponding to the second equivalent vector set is the objective decision result corresponding to the first selected natural text.

Specifically, when each natural text is the court decision containing data of a plurality of objects to be described and pre-marked and related to the natural text, in step 62, the processing module 13 uses the first positive direction rational data vector set and the first negative direction rational data vector set corresponding to the first selected natural text as the at least one selected rational data vector set; next, in step 63, the processing module 13 obtains two re-assembly data vector sets respectively corresponding to the first positive direction data vector set and the first negative direction data vector set according to the first positive direction data vector set, the first negative direction data vector set, and any neutral vector set; next, in step 64, the processing module 13 concatenates and combines the re-assembly physical data vector group corresponding to the first positive physical data vector group, the re-assembly physical data vector group corresponding to the first negative physical data vector group, the second positive physical data vector group corresponding to the first selected natural text, the second negative physical data vector group corresponding to the first selected natural text, the document vector corresponding to the first selected natural text, and the object vector corresponding to the first selected natural text according to the first order, thereby generating a second equivalent vector group different from the second equivalent vector group corresponding to the natural text. The target decision result corresponding to the second equivalent vector group is the target decision result corresponding to the first selected natural text. Therefore, the second equivalent vector set and the corresponding target decision result can be used as a new training data.

Finally, in step 55 of the standard training procedure, the processing module 13 can obtain the decision model by using a supervised classification algorithm according to the first equivalent vector set corresponding to each natural text and the target decision result corresponding to each natural text, as well as according to the second equivalent vector set and the target decision result corresponding to the first selected natural text. Similarly, the processing module 13 may also use the target decision result corresponding to the at least one set of data vectors, the set of data vectors not selected from the first selected natural text, and the first selected natural text as a neutral augmented data set, and store the data vector data set and the neutral augmented data set in any storage device, so that a future user can directly execute

steps

64 and 55 on any computer device according to the data vector data set and the neutral augmented data set in the storage device to obtain the decision model.

Referring to fig. 4, the exchange data augmentation training procedure is applied to a natural text having a plurality of reason data of an object to be described, and generates a new equivalent vector set different from the first equivalent vector set corresponding to the natural text by using the natural text stored in the storage module 11 to augment vectors required in training the decision model, and includes steps 70 to 71 and 55.

In step 70, the processing module 13 combines the positive direction data vector group and the negative direction data vector group corresponding to each object in a second selected natural text selected from the natural texts, the file vector corresponding to the second selected natural text, and the object vector corresponding to the second selected natural text into a third equivalent vector group according to a second sequence, where the second sequence is to exchange orders of the positive direction data vector groups of the two groups in the first sequence respectively corresponding to different selected objects, and exchange orders of the negative direction data vector groups of the two groups in the first sequence respectively corresponding to different selected objects. In other embodiments, the third equivalent vector set may only include the corresponding data vector set in the second selected natural text; or, the third equivalent vector set may include a corresponding set of rational data vectors in the second selected natural text and a corresponding document vector of the second selected natural text; alternatively, the third set of equivalent vectors may include a set of rational data vectors corresponding to the second selected natural text and an object vector corresponding to the second selected natural text.

Take four sets of reason vectors (two objects) as an example: the processing module 13 combines and combines the first positive direction justification vector group, the first negative direction justification vector group, the second positive direction justification vector group, and the second negative direction justification vector group corresponding to the second selected natural text into the third equivalent vector group according to the second order, and the second order orders the first positive direction justification vector group and the second positive direction justification vector group in the first order and orders the first negative direction justification vector group and the second negative direction justification vector group.

In detail, the first forward data vector set of the first equivalent vector set and the second forward data vector set of the third equivalent vector set corresponding to the second selected natural text both represent a forward data vector set associated with a first party, a first negative rational data vector set of the first equivalent vector set and a second negative rational data vector set of the third equivalent vector set corresponding to the second selected natural text both represent a negative rational data vector set associated with a first party, the second forward rational data vector set of the first equivalent vector set and the first forward rational data vector set of the third equivalent vector set corresponding to the second selected natural text both represent forward rational data vector sets associated with a second party, the second negative-direction data-management vector set of the first equivalent vector set and the first negative-direction data-management vector set of the third equivalent vector set corresponding to the second selected natural text both represent negative-direction data-management vector sets associated with a second party. Changing the first positive direction data vector group originally corresponding to the first party into the second positive direction data vector group and changing the first negative direction data vector group originally corresponding to the first party into the second negative direction data vector group by changing the sequence; and, the second positive-direction rational data vector set originally corresponding to the second party is changed into the first positive-direction rational data vector set, and the second negative-direction rational data vector set originally corresponding to the second party is changed into the first negative-direction rational data vector set, so as to generate the third equivalent vector set.

Take six reason vector sets (three objects) as an example: the processing module 13 first obtains the first positive direction data-conditioning vector group, the first negative direction data-conditioning vector group, the second positive direction data-conditioning vector group, the second negative direction data-conditioning vector group, the third positive direction data-conditioning vector group, and the third negative direction data-conditioning vector group corresponding to another second selected natural text selected from natural texts in the same manner, and then the processing module 13 connects and combines the data-conditioning vector groups corresponding to the another second selected natural text into another third equivalent vector group according to a third sequence, wherein the third sequence enables the respective orders of the positive direction data-conditioning vector groups of the two groups respectively corresponding to different selected objects in the first sequence to be reversed, and enables the respective orders of the negative direction data-conditioning vector groups of the two groups respectively corresponding to different selected objects in the first sequence to be reversed.

In step 71, the processing module 13 obtains the target decision result corresponding to the third equivalent vector set according to the target decision result corresponding to the second selected natural text. Similarly, the processing module 13 may also use the third equivalent vector group obtained after the conversion and the corresponding target decision result as a conversion amplification data set, and store the data vector data set and the conversion amplification data set in any storage device, so that a future user can directly execute step 55 on any computer device according to the data vector data set and the conversion amplification data set in the storage device to obtain the decision model.

Take four sets of rational vectors (two objects) as an example: the goal decision result corresponding to the second selected natural text is a win or lose result related to the win or lose of the first party and the second party, and when the goal decision result corresponding to the second selected natural text indicates that the first party wins, the goal decision result corresponding to the third equivalent vector group is modified to indicate that the second party wins; when the target decision result corresponding to the second selected natural text indicates that the second party wins, modifying the target decision result corresponding to the third equivalent vector group to indicate that the first party wins; and when the target decision result corresponding to the second selected natural text indicates that the two parties are smooth, the target decision result is not changed.

Taking six rational data vector sets (three objects) as an example, if the target decision result corresponding to the other second selected natural text indicates that the first party wins or the second party wins, the first party wins instead of the second party or the second party wins instead of the first party, so as to serve as the target decision result corresponding to the other third equivalent vector set, but if the target decision result corresponding to the other second selected natural text indicates that the third party wins, no correction is needed, and the target decision result corresponding to the other second selected natural text is directly used as the target decision result corresponding to the other third equivalent vector set.

In particular, when each natural text is the court decision containing data of objects pre-marked and related to the natural text to be described, the first positive rational data vector set of the first equivalent vector set (corresponding to the second selected natural text) represents a positive description related to the first party (e.g., applicant), the first negative rational data vector set of the first equivalent vector set represents a negative description related to the first party, the second positive rational data vector set of the first equivalent vector set represents a positive description related to the second party (e.g., opposite person), the second negative rational data vector set of the first equivalent vector set represents a negative description related to the second party, and after swapping, the second positive rational data vector set of the third equivalent vector set of step 70 represents a positive description related to the first party (e.g., applicant), a second negative rational data vector set of the third equivalent vector set represents a negative description associated with the first party, a first positive rational data vector set of the third equivalent vector set represents a positive description associated with the second party (e.g., opposite person), and a first negative rational data vector set of the third equivalent vector set represents a negative description associated with the second party, in which case a third equivalent vector set different from the first equivalent vector set corresponding to the natural text can be generated; in addition, the target decision result is corrected in step 71, so that the third equivalent vector set and the corresponding target decision result can be used as a new training data.

In other words, step 70 exchanges the applicant's positive and negative descriptions of the court decisions (corresponding to the second selected natural text) with those of the opponent, and when the court decisions determine that the applicant wins, the new court decisions (the third set of equivalent vectors) are generated, and because both positive and negative descriptions are exchanged, the decision results are made to be changed to the opponent's wins in step 71; similarly, when the court judgment determines the relative complaint, the judgment result of the generated new court judgment (the third equivalent vector group) is used to determine the complaint of the applicant in step 71; when the court judgment is successful, the judgment result of the generated new court judgment (the third equivalent vector group) is the original judgment in step 71.

Finally, in step 55 of the standard training procedure, the processing module 13 can obtain the decision model by using a supervised classification algorithm according to the third equivalent vector set and the target decision result corresponding to the third equivalent vector set, in addition to the first equivalent vector set corresponding to each natural text and the target decision result corresponding to each natural text.

It should be noted that, when the natural text belongs to the class of student self-transmission or medical case history, which does not involve comparison of multiple objects, and there is no "positive or negative" reason data, the technique of "the exchange data augmentation training program" cannot be used, and the decision model can only be trained using the "neutral data augmentation training program" and the "rewritten data augmentation training program".

Referring to fig. 5, the rewriting data augmentation training procedure is applied to a natural text having the rationale data of at least one object to be described, and generates a new equivalent vector set different from the first equivalent vector set corresponding to the natural text by using the natural text stored in the storage module 11, so as to augment the vectors required for training the decision model, and includes steps 80 to 84 and 55.

In step 80, the processing module 13 obtains at least one selected data according to the data corresponding to a third selected natural text selected from the natural texts.

In step 81, for each selected data, the processing module 13 rewrites the selected data into a data rewrite data corresponding to the selected data by using a synonymy rewrite algorithm. In other embodiments, the processing module 13 may also use computer translation to translate the selected rationale data into any foreign language (e.g., english), and then translate the selected rationale data back into a text (the text is chinese) to obtain the rationale rewrite data. It should be noted that, in this embodiment, the synonymous rewriting algorithm used is EDA NLP for Chinese, but not limited thereto.

In step 82, for each piece of data overwrite data, the processing module 13 obtains pre-overwrite processing data corresponding to the data overwrite data by a pre-processing algorithm based on the data overwrite data.

In step 83, for each pre-rewriting processing data, the processing module 13 obtains a rewriting vector set corresponding to the pre-rewriting processing data by using a sentence vector algorithm according to the pre-rewriting processing data. Similarly, the processing module 13 can also use machine reading to convert each piece of data rewrite data corresponding to each piece of natural text into a piece of rewrite voice data, and use a voice vector transformation algorithm to obtain the rewrite vector group corresponding to the rewrite voice data.

In step 84, the processing module 13 combines the at least one rewritten vector set and the data vector set not selected by the third selected natural text, the file vector corresponding to the third selected natural text, and the object vector corresponding to the third selected natural text into a fourth equivalent vector set according to the first sequence. In other embodiments, the fourth equivalent vector set may only include the at least one rewritten vector set and the data vector set not selected by the third selected natural text; or, the fourth equivalent vector set may include the at least one rewrite vector set, a physical data vector set from which the third selected natural text is not selected, and the document vector corresponding to the third selected natural text; alternatively, the fourth set of equivalent vectors may include the at least one rewrite vector, the set of physical data vectors from which the third selected natural text has not been selected, and the object vector corresponding to the third selected natural text.

It should be noted that, in this embodiment, a synonymy rewrite algorithm is used to rewrite each selected reason data corresponding to the third selected natural text, which only generates a difference in textual description and does not change the semantic meaning itself. In other embodiments, after each selected rational data corresponding to the third selected natural text is translated into any foreign language by computer translation, the third selected natural text is translated back into the original text, only the difference in the description of the words is generated, and the semantic meaning itself is not changed. Therefore, on the premise that the semantic meaning itself is not changed and the target decision result of the corresponding natural text (the classification result of the corresponding natural text) is not affected, the target decision result corresponding to the fourth equivalent vector set including the at least one rewrite vector set converted from the at least one pre-rewrite processing data is reasonably the same as the target decision result corresponding to the third selected natural text.

Specifically, when each natural text is the court decision containing the rational data of the objects to be described and pre-marked and related to the natural text, in step 80, the processing module 13 uses the first positive rational data and the first negative rational data corresponding to the third selected natural text as the at least one selected rational data; then, in step 81, the processing module 13 obtains a first positive direction data rewriting data and a first negative direction data rewriting data corresponding to the first positive direction data and the first negative direction data respectively by using a synonymy rewriting algorithm, wherein each data rewriting data has only a difference in text description compared with the corresponding data rewriting data, and the semantic meaning itself is not changed; then, in steps 82 and 83, the processing module 13 obtains a first positive rewrite vector set and a first negative rewrite vector set corresponding to the first positive-direction data rewrite data and the first negative-direction data rewrite data, respectively, wherein the rewrite vector set corresponding to each pre-rewrite processing data is different (the vector sets converted by different sentences) from the data vector set corresponding to the selected data before rewriting; next, in step 84, the processing module 13 concatenates and combines the first positive rewriting vector set, the first negative rewriting vector set, the second positive rewriting vector set corresponding to the third selected natural text, the second negative rewriting vector set corresponding to the third selected natural text, the file vector corresponding to the third selected natural text, and the object vector corresponding to the third selected natural text according to the first order, thereby generating the fourth equivalent vector set different from the first equivalent vector set corresponding to the natural text. Wherein the target decision result corresponding to the fourth equivalent vector group is the target decision result corresponding to the third selected natural text. Therefore, the fourth set of equivalent vectors and the corresponding target decision result can be used as a new training data.

Finally, in step 55 of the standard training procedure, the processing module 13 can obtain the decision model by using a supervised classification algorithm according to the target decision result corresponding to the fourth equivalent vector set and the third selected natural text, in addition to the first equivalent vector set corresponding to each natural text and the target decision result corresponding to each natural text. Similarly, the processing module 13 may also use the target decision result corresponding to the at least one rewriting vector set, the data vector set that is not selected from the third selected natural text, and the third selected natural text as a rewriting augmented data set, and store the data vector data set and the rewriting augmented data set in any storage device, so that a future user can directly execute

steps

84 and 55 on any computer device according to the data vector data set and the rewriting augmented data set in the storage device to obtain the decision model.

In summary, the training method of the invention for machine learning decision models using natural language corpora is applicable to many different types of corpora, and by using the standard training program, the pre-labeled reason data of each natural text is converted into vectors and used as training data, and the trained decision model has better accuracy without additional definition of an option category questionnaire; in addition, the training data required by the neutral data amplification training program, the exchange data amplification training program and the rewriting data amplification training program can be amplified to make up for the problem of low machine learning efficiency caused by insufficient original data, and the exchange data amplification training program can effectively relieve misleading of the decision process caused by deviation sampling of the original training data, further make up for the bias probably caused by bias data in general machine learning, and better meet the requirements of social fairness and justice in the aspect of application of decision judgment for machine learning. Therefore, the object of the present invention can be achieved.

The above description is only for the preferred embodiment of the present invention, and it is not intended to limit the scope of the present invention, and any person skilled in the art can make further modifications and variations without departing from the spirit and scope of the present invention, therefore, the scope of the present invention should be determined by the claims of the present application.

Claims

1. A method for training a machine learning decision model using natural language corpus, implemented by a computer device, wherein the computer device stores a plurality of natural texts, each natural text is labeled as a target decision result of a plurality of decision results and comprises a plurality of pre-labeled reason data related to at least one object to be described in the natural text, the method comprising: the training method of the machine learning decision model by using the natural language corpus comprises the following steps:

a: for each piece of reason data corresponding to each natural text, obtaining a reason data vector group corresponding to the reason data through the computer device according to the reason data;

b: for each natural text, connecting and combining the reason data vector groups corresponding to the natural texts into an equivalent vector group according to a first sequence through the computer device; and

c: and obtaining a decision model for marking the unmarked natural text to be decided as one of the decision results by the computer device by utilizing a supervised classification algorithm at least according to the equivalent vector group corresponding to each natural text and the target decision result corresponding to each natural text.

2. The method for training a machine learning decision model according to the natural language corpus of claim 1, wherein: the step A comprises the following steps:

a-1: for each piece of reason data corresponding to each natural text, acquiring, by the computer device, reason data preprocessing data corresponding to the reason data by using a preprocessing algorithm according to the reason data; and

a-2: and for each piece of pre-processing data corresponding to each natural text, obtaining the data vector group corresponding to the pre-processing data through the computer device by using a sentence vector algorithm.

3. The method for training a machine learning decision model according to the natural language corpus of claim 1, wherein: the step A comprises the following steps:

a-1: for each reason data corresponding to each natural text, using a machine to read according to the reason data through the computer device to obtain reason voice data corresponding to the reason data; and

a-2: and for each piece of data voice data corresponding to each natural text, obtaining the data vector group corresponding to the data voice data by the computer device by utilizing a voice turning quantity algorithm.

4. The method for training a machine learning decision model according to the natural language corpus of claim 1, wherein: each natural text also includes a plurality of neutral data that are pre-marked and do not relate to any object that the natural text is intended to describe, wherein, before step C, the method further includes the following steps:

d: for each neutral data corresponding to each natural text, obtaining a neutral vector group corresponding to the neutral data through the computer device according to the neutral data;

e: obtaining, by the computer device, at least one selected rational data vector group according to the rational data vector group corresponding to a selected natural text selected from the natural texts;

f: for each selected rational data vector set, obtaining, by the computer device, a re-set rational data vector set associated with the selected rational data vector set according to the selected rational data vector set and any neutral vector set corresponding to all natural texts; and

g: combining at least one set of physical data vectors and a set of physical data vectors of the selected natural text which are not selected into another set of equivalent vectors according to the first sequence by the computer device; and

in step C, the decision model is obtained by the computer device not only according to the set of equivalent vectors corresponding to each natural text and the target decision result corresponding to each natural text, but also according to the other set of equivalent vectors and the target decision result corresponding to the selected natural text by using a supervised classification algorithm.

5. The method for training a machine learning decision model according to the natural language corpus of claim 1, wherein: the reason data corresponding to each natural text comprises positive reason data corresponding to each object of the natural text and having a positive meaning, and negative reason data corresponding to each object of the natural text and having a negative meaning, wherein:

in the step A, the data-management vector group corresponding to each natural text comprises a positive data-management vector group converted from positive data-management data of each object and a negative data-management vector group converted from negative data-management data of each object; and

in step B, for each natural text, the computer device combines and synthesizes the equivalent vector set according to the first order according to the positive rational data vector set and the negative rational data vector set corresponding to each object of the natural text.

6. The method for training a machine learning decision model according to the natural language corpus of claim 5, wherein: after the step B, the method also comprises the following steps:

h: connecting and combining the data-conditioning vector groups corresponding to the selected natural texts selected from the natural texts into another equivalent vector group according to a second sequence by the computer device, wherein the second sequence enables the two groups in the first sequence to respectively correspond to the respective order exchange of the positive data-conditioning vector groups of different selected objects, and enables the two groups in the first sequence to respectively correspond to the respective order exchange of the negative data-conditioning vector groups of different selected objects;

i: obtaining, by the computer device, the target decision result corresponding to the other set of equivalent vectors of step H according to the target decision result corresponding to the selected natural text of step H; and

in step C, the computer device obtains the decision model by using a supervised classification algorithm according to the equivalent vector set corresponding to each natural text and the target decision result corresponding to each natural text, as well as according to the other equivalent vector set and the target decision result corresponding to the other equivalent vector set.

7. The method for training a machine learning decision model according to the natural language corpus of claim 1, wherein: before the step C, the method also comprises the following steps:

i: obtaining, by the computer device, at least one selected rationale data according to the rationale data corresponding to a selected natural text selected from the natural texts;

j: for each selected reason data, rewriting the selected reason data into reason data rewriting data corresponding to the selected reason data by using a synonymy rewriting algorithm through the computer device;

k: for each piece of data rewriting data, obtaining, by the computer device, a group of rewriting vectors corresponding to the data rewriting data according to the data rewriting data;

l: connecting and combining at least one rewriting vector group and the selected natural text unselected reason data vector group into another equivalent vector group according to the first sequence through the computer device; and

8. The method for training a machine learning decision model according to the natural language corpus of claim 1, wherein: each natural text is related to an event and comprises a plurality of pre-marked file category data, wherein before the step C, the method further comprises the following steps:

m: converting the file category data into file vectors by the computer device for each file category data corresponding to each natural text; and

in step B, for each natural text, the computer device connects and combines the data vector set and the file vector corresponding to the natural text into the equivalent vector set according to the first order.

9. The method for training a machine learning decision model according to the natural language corpus of claim 8, wherein: in step M, the file category data corresponding to each natural text includes location information of the relevant event.

10. The method for training a machine learning decision model according to the natural language corpus of claim 1, wherein: each natural text further comprises a plurality of pre-marked object background data related to the object to be described in the natural text, wherein before the step C, the following steps are further included:

n: for each object background data corresponding to each natural text, converting the object background data into an object vector through the computer device; and

in step B, for each natural text, the computer device connects and combines the set of physical vectors and the set of object vectors corresponding to the natural text into the set of equivalent vectors according to the first order.

11. The method for training a machine learning decision model according to the natural language corpus of claim 10, wherein: in step N, the object context data corresponding to each object of each natural text includes object gender information.

12. A method for training a natural language corpus for a machine-learned decision model, the method implemented by a computer device having a plurality of vector data sets stored therein, each vector data set labeled as a target decision result of a plurality of decision results and comprising a plurality of data vector sets, each data vector set generated from one of speech and a natural sentence associated with an object to be described, the method comprising: comprises the following steps:

a: for each vector data set, connecting and combining the data-management vector groups corresponding to the vector data sets into an equivalent vector group according to a first sequence through the computer device; and

b: and obtaining a decision model for marking the unmarked data set to be decided as one of the decision results by the computer device by utilizing a supervised classification algorithm at least according to the equivalent vector group corresponding to each vector data set and the target decision result corresponding to each vector data set.