CN110458207A

CN110458207A - A kind of corpus Intention Anticipation method, corpus labeling method and electronic equipment

Info

Publication number: CN110458207A
Application number: CN201910669701.7A
Authority: CN
Inventors: 陈鑫; 肖龙源; 蔡振华; 李稀敏; 刘晓葳; 谭玉坤
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2019-11-15

Abstract

The present invention relates to natural language processing techniques, provide a kind of corpus Intention Anticipation method, and the method includes step: being based on pretreated sample, training obtains N number of prediction model；It is based respectively on each prediction model to predict corpus to be predicted, obtains N number of prediction result；Preset rules are matched based on N number of prediction result, determine the corresponding intent information of the corpus to be predicted；Wherein, the N is the odd number more than or equal to 3；The preset rules include: if there are identical prediction results in N number of prediction result, and identical number is greater than N/2, then it is determined that the identical prediction result is the corresponding intent information of the corpus to be predicted.Based on method provided by the present embodiment, the Intention Anticipation to corpus is realized, and improve prediction accuracy, so as to which duplicate artificial treatment work is greatly reduced.In addition, the present invention also provides a kind of corpus labeling method and electronic equipments.

Description

Corpus intention prediction method, corpus labeling method and electronic equipment

Technical Field

The present invention relates to natural language processing technologies, and in particular, to a corpus intent prediction method, a corpus tagging method, and an electronic device.

Background

The corpus is a basic resource for linguistic research of the corpus and is also a main resource of an empirical language research method. The traditional corpus is mainly applied to the aspects of lexicography, language teaching, traditional language research, statistics or example-based research in natural language processing and the like. With the development of internet big data and artificial intelligence technology, the corpus is also widely applied.

The language database has three characteristics, and language materials which are actually appeared in the practical use of languages, such as user messages and customer service conversations which are directly obtained from web pages, are stored in the language database; the corpus is a basic resource bearing linguistic knowledge, but is not equal to the linguistic knowledge; the real corpus can be useful resources only after being processed, the processing of the real corpus can comprise dirty data removal, semantic labeling, part of speech labeling and the like, when the corpus is labeled, each corpus data is often labeled mainly by manpower, and a large amount of manpower is consumed for labeling the repeated corpus because the corpus data often comprises a large amount of repeated corpus data.

Taking the training corpus of the intent recognition classifier as an example, a large amount of labeled corpus is required when a supervised learning algorithm is used to train a medical and cosmetic industry intent recognition classifier. Most of the labeling work is mainly manually marked, under most of conditions, the corpus is not processed in advance, a large amount of repeated data exists, and if the repeated data are not filtered, the labeling efficiency is influenced, and the manpower is wasted.

Disclosure of Invention

In order to solve the above problem, an embodiment of the present invention provides a corpus intent prediction method, including: training to obtain N prediction models based on the preprocessed samples; predicting the linguistic data to be predicted respectively based on each prediction model to obtain N prediction results; matching preset rules based on the N prediction results, and determining intention information corresponding to the linguistic data to be predicted; wherein N is an odd number of 3 or more; and the preset rule comprises the step of determining that the same prediction result is the intention information corresponding to the corpus to be predicted if the same prediction result exists in the N prediction results and the same number is larger than N/2.

In one implementation, the method for pretreating the sample comprises: collecting initial corpus data; performing intention recognition on the initial corpus data based on a regular expression; selecting N equal parts of the initial corpus data containing the target intention; and performing word segmentation on the N equal parts of initial corpus data, and performing text vectorization to obtain N equal parts of samples.

In one implementation, the method for performing intent recognition on the initial corpus data based on a regular expression includes: collecting intention information and corresponding keywords; and constructing the regular expression based on the target intention and the corresponding key words.

In one embodiment, the method for selecting N equal portions of the initial corpus data containing the target intent comprises: determining the target intention contained in all the initial corpus data; and respectively dividing the initial corpus data containing the same target intention into N equal parts, and respectively selecting one part from the initial corpus data containing different target intentions to merge to obtain the N equal parts of the initial corpus data containing the target intention.

In one embodiment, the method for training N prediction models based on the preprocessed samples includes: constructing N initial prediction models based on different algorithms; and training each initial prediction model based on the preprocessed samples to obtain the N prediction models.

In one implementation, the method further comprises the steps of: periodically carrying out iterative training on each prediction model; when the accuracy of each prediction model exceeds a preset threshold, the iterative training can be quitted; if the same number is smaller than N/2, recording the samples and the results of manual identification corresponding to the samples as iterative samples of each prediction model; if the same number is larger than N/2, recording the sample and the same prediction result as iteration samples of the prediction models with different prediction results.

Therefore, the corpus intent prediction method provided by the invention can realize automatic prediction of corpus data and obtain corresponding intent information, thereby saving labor cost and improving data processing efficiency. The corpus intention prediction method provided by the invention can predict the corpus to be predicted based on N prediction models, and determines the intention information of the corpus to be predicted by a voting system based on the prediction result so as to improve the accuracy of the prediction result. Furthermore, in the process of constructing the N prediction models, different algorithms are selected to construct the initial prediction model, and the training samples are preprocessed to ensure the balance of the samples, so that the accuracy of the prediction result is improved. Meanwhile, through periodic iteration, the prediction precision of the prediction model can be continuously improved, the accuracy of the prediction result can be ensured, and the method can adapt to the expansion requirement of the prediction corpus.

In addition, the invention also provides a corpus labeling method, which comprises the following steps: based on the corpus intention prediction method, carrying out intention prediction on the original corpus to obtain intention information; and labeling the linguistic data to be processed based on the intention information. Thereby providing an auxiliary reference for manual annotation.

The present invention further provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the corpus intent prediction method described above.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a flow chart illustrating a method for predicting corpus intent according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a sample preprocessing method according to a first embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The first embodiment of the present invention is a corpus intent prediction method, which will be described in detail with reference to the drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating a corpus intent prediction method according to a first embodiment of the present invention.

As shown in fig. 1, the corpus preprocessing method provided by the present invention includes the following steps:

and 101, training to obtain N prediction models based on the preprocessed samples.

In the embodiment of the present invention, intent recognition is mainly implemented by relying on a plurality of prediction models, where N is an odd number greater than or equal to 3, in the construction process of the prediction models, the acquisition of training samples may be based on the method shown in fig. 2, and fig. 2 is a flowchart of a sample preprocessing method in the first embodiment of the present invention.

As shown in fig. 2, the method for preprocessing the sample may comprise the following steps:

step 201, collecting initial corpus data.

The corpus data can be obtained from a network, a service database and other ways, preferably, the corpus data related to an application scene can be selected as initial corpus data based on the requirements of actual application, and after the initial corpus data is obtained, the initial corpus data can be subjected to operations such as screening, cleaning and the like to filter invalid data.

And 202, performing intention identification on the initial corpus data based on the regular expression.

Since the initial corpus data may contain non-target data, that is, data not containing target intention information. Specifically, in an actual application scenario, effective intention information is limited, and what is called effective means that a machine can process the intention information, so that the intention information can be realized based on a regular expression when the intention of the initial corpus data is recognized.

The method for constructing the regular expression can comprise the following steps: and collecting intention information and corresponding keywords, and constructing the regular expression based on the target intention and the corresponding keywords.

For example, the corpus containing query price intentions may include the following keywords: expense, cost, presumably, generally, need, total, possibility, and how much, and how much, then a corresponding regular expression may be constructed:

Based on this, corpora (i.e., industry corpora) in the application scene can be summarized manually or in other ways, and keywords corresponding to each target intention information are obtained through summarization, so that regular expressions for identifying the target intention information are constructed, and the initial corpus data is identified based on the regular expressions respectively to determine the target intention information corresponding to the initial corpus data, wherein the target intention can be selected based on all intentions contained in the initial corpus data, and can also be set based on actual requirements. The intention and the keyword are collected based on the industry linguistic data, so that the application scene can be more suitable, the target intention can be quickly obtained, and the prediction result of the trained model is in the target range.

In other embodiments of the present invention, the regular expression may be obtained based on a larger range of corpus data summary, and the collection of the keywords is more complete, so that the recognition accuracy of the regular expression can be improved.

Step 203, selecting N equal parts of the initial corpus data containing the target intent.

Through the recognition of the regular expressions, the target intention corresponding to each initial corpus data can be determined from the initial corpus data, so that the initial corpus data can be screened based on the target intention, and the specific process can comprise the following steps: determining target intents contained in all initial corpus data; and respectively dividing the initial corpus data containing the same target intention into N equal parts, and respectively selecting one part from the initial corpus data containing different target intentions to merge to obtain N equal parts of initial corpus data containing the target intention.

For example, 10000 pieces of initial corpus data are identified by regular expressions, and then it is determined that the target intent contained in 4000 pieces of initial corpus data is "inquiry price", the target intent contained in 2000 pieces of initial corpus data is "preferential query", the target intent contained in 3000 pieces of initial corpus data is "product consultation", the target intent contained in 400 pieces of initial corpus data is "after-sale consultation", and 600 pieces of invalid data, that is, data not containing the target intent. If N is equal to 4, the number of each target intention type may be divided into 4 equal parts, and then merged into one data, that is, each data includes 1000 corpora for inquiry price, 500 corpora for preferential inquiry, 750 corpora for product inquiry, and 100 corpora for after-sale inquiry, thereby obtaining 4 equal parts of initial corpora data including target intention. The initial corpus data is equally divided, and the integrity of the sample can be ensured to a certain extent.

And 204, performing word segmentation on the N equal parts of initial corpus data, and performing text vectorization to obtain the N equal parts of samples.

After N equal parts of initial corpus data are obtained, word segmentation and text vectorization are respectively carried out on each initial corpus data to obtain N equal parts of samples for training a prediction model.

By the method described in the above steps 201 to 204, the training sample can be preprocessed, so that the effectiveness and integrity of the sample can be improved.

After obtaining N equal portions of the preprocessed samples, the prediction model may be trained based on the samples, which specifically includes:

first, N initial prediction models are constructed based on different algorithms.

Wherein, the initial prediction model can be respectively constructed based on a two-classification, a multi-classification or a deep learning algorithm, and comprises the following steps: naive Bayes, support vector machines, random forests, xgboost, convolutional neural networks, etc., the specific selection of the algorithm can be based on actual requirements, and the embodiment of the invention is not limited at all.

Then, training each initial prediction model based on the preprocessed samples respectively to obtain the N prediction models.

Specifically, the N equal samples can be used for training an initial prediction model respectively, that is, the training samples used by the initial prediction models are different, but the number and the included target intent are consistent. The specific training method may be an existing model training method selected as needed, and the embodiment of the present invention is not limited in any way.

By the method, N prediction models can be obtained by training based on the preprocessed samples.

And step 102, predicting the linguistic data to be predicted respectively based on each prediction model to obtain N prediction results.

When the corpus to be predicted is predicted, the corpus to be predicted can be predicted based on the N prediction models respectively, so that N prediction results are obtained.

And 103, matching a preset rule based on the N prediction results, and determining intention information corresponding to the corpus to be predicted.

It can be understood that, since the N prediction models are respectively constructed based on different algorithms and training samples are different, prediction accuracy of the models may have a certain difference, and therefore, the N obtained prediction results may have differences, for example, different or all the same, and therefore, in order to improve accuracy of the prediction results, a voting system may be adopted to determine intention information corresponding to the corpus to be predicted.

Specifically, the preset rules may include: and if the same prediction results exist in the N prediction results and the same number is larger than N/2, determining the same prediction results as the intention information corresponding to the linguistic data to be predicted.

That is, based on the N prediction results, a voting system may be adopted to determine the intention information corresponding to the corpus to be predicted, but in order to further ensure the accuracy of the result, a threshold may be set to determine whether there is a correct intention in the N prediction results, in this embodiment, the threshold is N/2, that is, when more than half of the prediction results are the same, the intention information of the corpus to be predicted can only be determined, and if the same number in the prediction results does not exceed half, the correct intention information cannot be determined, and the prediction is considered to be invalid.

In the provided embodiment of the present invention, the prediction result for each corpus to be predicted can be recorded for subsequent model iteration.

Specifically, since the N prediction models are trained based on the initial samples, that is, put into use, there is a large improvement space in the accuracy of prediction, and in order to further ensure the accuracy of the prediction result, iterative training may be periodically performed on each prediction model. The samples used for each iterative training may include the samples used for the initial training, as well as a record derived from the prediction.

Specifically, in the prediction result, if the same number is smaller than N/2, the samples and the results of manual identification corresponding to the samples are recorded as iteration samples of each prediction model; and if the same number is larger than N/2, recording the samples and the same prediction result as iteration samples of prediction models with different prediction results.

Therefore, iterative training is respectively carried out on each prediction model based on the samples, when the accuracy of each prediction model exceeds a preset threshold value, the iterative training can be quitted, and a new round of corpus prediction is started. It can be understood that, in a new round of corpus prediction, an iterative sample may also be obtained for the next iterative training, and the addition of a new type of corpus to be predicted may also expand the prediction range of the prediction model to a certain extent.

The setting of the iteration cycle may be based on a fixed time period, or may be determined based on an actual data amount or the number of iteration samples obtained from the prediction result.

If the prediction results of the prediction models are basically consistent after several rounds of iterative training, the iteration can be stopped.

By the corpus preprocessing method, the corpus to be predicted can be predicted to obtain corresponding intention information, so that the cost of manual identification can be saved, and the data processing efficiency is improved.

Based on the same inventive concept, a second embodiment of the present invention provides a corpus tagging method. The method may specifically comprise:

first, the corpus to be processed is subjected to intent information identification to obtain corresponding intent information, wherein a specific method for the intent information identification may refer to the corpus intent prediction method provided in the embodiment of fig. 1, and thus, details thereof are not repeated herein.

And then, labeling the linguistic data to be processed based on the obtained intention information.

By the method provided by the embodiment, automatic identification and automatic labeling of the corpus intentions can be realized, so that labeled corpus data can be obtained and can be directly used in other application scenes or used for manual labeling reference, and the manual labeling speed is improved.

Another embodiment of the invention relates to an electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the corpus intent prediction method in the embodiment shown in fig. 1.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

Yet another embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program, when executed by a processor, implements the above-described method embodiments.

Those skilled in the art can understand that all or part of the steps in the method according to the above embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for predicting corpus intent, the method comprising the steps of:

training to obtain N prediction models based on the preprocessed samples;

predicting the linguistic data to be predicted respectively based on each prediction model to obtain N prediction results;

matching preset rules based on the N prediction results, and determining intention information corresponding to the linguistic data to be predicted;

wherein N is an odd number of 3 or more;

the preset rule comprises:

and if the same prediction results exist in the N prediction results and the same number is larger than N/2, determining that the same prediction results are the intention information corresponding to the corpus to be predicted.

2. The method of claim 1, wherein the sample is pre-processed by a method comprising:

collecting initial corpus data;

performing intention recognition on the initial corpus data based on a regular expression;

selecting N equal parts of the initial corpus data containing the target intention;

and performing word segmentation on the N equal parts of initial corpus data, and performing text vectorization to obtain N equal parts of samples.

3. The method according to claim 2, wherein the method for performing intent recognition on the initial corpus data based on regular expressions comprises:

collecting intention information and corresponding keywords;

and constructing the regular expression based on the target intention and the corresponding key words.

4. The method of claim 2, wherein said selecting N equal portions of said initial corpus data containing a target intent comprises:

determining the target intention contained in all the initial corpus data;

and respectively dividing the initial corpus data containing the same target intention into N equal parts, and respectively selecting one part from the initial corpus data containing different target intentions to merge to obtain the N equal parts of the initial corpus data containing the target intention.

5. The method of claim 1, wherein the training of the N predictive models based on the preprocessed samples comprises:

constructing N initial prediction models based on different algorithms;

and training each initial prediction model based on the preprocessed samples to obtain the N prediction models.

6. The method of claim 1, further comprising the steps of:

periodically carrying out iterative training on each prediction model;

when the accuracy of each prediction model exceeds a preset threshold, the iterative training can be quitted;

if the same number is smaller than N/2, recording the samples and the results of manual identification corresponding to the samples as iterative samples of each prediction model;

if the same number is larger than N/2, recording the sample and the same prediction result as iteration samples of the prediction models with different prediction results.

7. A corpus tagging method, comprising the steps of:

the corpus intent prediction method according to any one of claims 1 to 6, wherein intent prediction is performed on a corpus to be processed to obtain the intent information;

and labeling the linguistic data to be processed based on the intention information.

8. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the corpus intent prediction method according to any of claims 1 to 6.