CN110188204B

CN110188204B - Extended corpus mining method and device, server and storage medium

Info

Publication number: CN110188204B
Application number: CN201910501365.5A
Authority: CN
Inventors: 周辉阳
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2022-10-04
Anticipated expiration: 2039-06-11
Also published as: CN110188204A

Abstract

The application provides a corpus mining method, a corpus mining device, a server and a storage medium, wherein whether a corpus is a fuzzy corpus in a target field or not is determined according to the grade of the corpus in the target field based on a pre-trained corpus prediction model (namely, a first candidate corpus which may or may not belong to the target field); if the corpus is a first candidate corpus of the target field, expanding the first candidate corpus through a living activated corpus set to obtain a living activated second candidate corpus with the highest similarity to the first candidate corpus; so as to determine whether the corpus candidate (the corpus candidate includes the second corpus candidate) really belongs to the expanded corpus of the target domain through the binary classification model. According to the method and the device, keywords, standard corpora or standard templates do not need to be matched one by one, so that compared with the prior art, time consumption can be reduced, the efficiency of mining the extended corpora can be improved, and deep mining of the extended corpora is realized on the basis of expansion of the second corpora which are activated and have the highest similarity to the first candidate corpora.

Description

Extended corpus mining method and device, server and storage medium

Technical Field

The present invention relates to the technical field of corpus mining, and more particularly, to a method, an apparatus, a server, and a storage medium for expanding corpus mining.

Background

In the field construction process, the field prediction model takes a very important role, and the field prediction model can predict the field to which the corpus belongs, so that a technical basis is provided for product intellectualization. The capability of the domain prediction model usually depends on the corpus sample, the expanded corpus in the branch of the corpus sample plays a decisive role in the generalization and recall capability of the domain prediction model, and the expanded corpus refers to the corpus which belongs to a certain domain but is uncommon in the domain.

In the process of expanding corpora in the field of mining in the prior art, a keyword mining technology, a corpus similarity mining technology and a template similarity mining technology are commonly used. The keyword mining technology mainly uses entities in the field as keywords, and recalls the expanded corpus through the keywords (for example, the keywords in the music field are "head", and the expanded corpus that is possibly recalled through the keyword mining technology is "next song"); the corpus similarity mining technology mainly comprises the steps of determining a corpus as an extended corpus of a field when the corpus is determined to be matched with any standard corpus in the field; the template similarity mining technology mainly comprises the steps of replacing entities in the corpus with variables to obtain a corpus template, and determining the corpus as the expanded corpus in the field when the corpus template is matched with any standard template in a template library of the field.

Although the prior art can realize the mining of the extended corpus, the following problems generally exist: 1. the keywords, the standard corpora or the standard templates need to be matched one by one, the time consumption is long, and the efficiency of expanding corpus mining is low; 2. the excavated extended corpus tends to be homogeneous, that is, the excavated extended corpus approaches to a keyword, a standard corpus in a corpus or a standard template in a template library, and the extended corpus cannot be deeply excavated.

Disclosure of Invention

In view of the above, to solve the above problems, the present invention provides an extended corpus mining method, apparatus, server and storage medium, so as to implement deep mining of extended corpus on the basis of reducing extended corpus mining time consumption and improving mining efficiency. The technical scheme is as follows:

an extended corpus mining method, comprising:

determining whether the corpus belongs to a first corpus candidate of a target field according to the score of the corpus in the target field of a pre-trained field prediction model;

if the corpus belongs to a first corpus candidate of the target field, determining a second corpus candidate with the highest similarity with the first corpus candidate from at least one corpus of a raw corpus set;

and determining whether the language material candidates are the extended language material of the target field by utilizing a pre-trained two-classification model of the target field, wherein the two-classification model is obtained by training a classification algorithm by taking the language material belonging to the target field as a positive sample and the language material not belonging to the target field as a negative sample, and the language material candidates comprise the second language material candidates.

An extended corpus mining device, comprising:

the first corpus candidate determining unit is used for determining whether the corpus belongs to the first corpus candidate of the target field according to the score of a pre-trained field prediction model on the corpus in the target field;

a second corpus candidate determining unit, configured to determine, if the corpus belongs to a first corpus candidate in the target field, a second corpus candidate with a highest similarity to the first corpus candidate from at least one corpus of a raw corpus set;

and the extended corpus determining unit is used for determining whether a candidate corpus is the extended corpus of the target field by using a pre-trained two-classification model of the target field, wherein the two-classification model is obtained by training a classification algorithm by using the corpus belonging to the target field as a positive sample and the corpus not belonging to the target field as a negative sample, and the candidate corpus comprises the second candidate corpus.

A server, comprising: at least one memory and at least one processor; the memory stores a program, the processor calls the program stored in the memory, and the program is used for realizing the extended corpus mining method.

A storage medium having stored therein computer-executable instructions for performing the extended corpus mining method.

The application provides a corpus mining method, a corpus mining device, a server and a storage medium, wherein whether a corpus is a fuzzy corpus in a target field or not is determined according to the grade of the corpus in the target field based on a pre-trained corpus prediction model (namely, a first candidate corpus which may or may not belong to the target field); if the corpus is a first candidate corpus of the target field, expanding the first candidate corpus through a living activated corpus set to obtain a living activated second candidate corpus with the highest similarity to the first candidate corpus; to determine whether the corpus candidate (including the second corpus candidate) really belongs to the expanded corpus of the target domain through the binary model. According to the method and the device, keywords, standard corpora or standard templates do not need to be matched one by one, so that compared with the prior art, time consumption can be reduced, the efficiency of mining the extended corpora can be improved, and deep mining of the extended corpora is realized on the basis of expansion of the second corpora which are activated and have the highest similarity to the first candidate corpora.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a block diagram of a hardware structure of a server according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for generating a domain prediction model according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a domain prediction model verification method according to an embodiment of the present application;

fig. 4 is a flowchart of a method for generating a classification model of a target domain according to an embodiment of the present application;

fig. 5 is a flowchart of an extended corpus mining method according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for determining whether a corpus belongs to a first corpus candidate in a target domain according to a pre-trained domain prediction model for scoring the corpus in the target domain according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for determining whether a corpus candidate is an expanded corpus of a target domain using a pre-trained binary model of the target domain according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an extended corpus mining device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example (b):

the embodiment of the application provides an extended corpus mining method, and the extended corpus mining method can solve the problems that extended corpus mining is long in time consumption and low in mining efficiency when extended corpus mining is achieved in the prior art, and the mined extended corpus is likely to be homogeneous with keywords, standard corpus or standard templates, and mining is not deep.

In order to facilitate understanding of the method for mining the extended corpus provided in the embodiment of the present application, the extended corpus will now be described first.

The corpus can be understood as a search sentence of the user, including voice, text, picture input, etc. of the user.

An expanded corpus refers to a corpus that belongs to a certain domain, but is not common in that domain. For example, for the music field, the linguistic data is commonly "i want to listen to a song", "play a popular song", and "come music". Hereupon, the common keywords in the music field are as follows: "head", "song", "listen", "music", "play", however, in real life, the query and requirements of people are varied, and it is impossible to ask people to listen to songs, and one thousand hamlet for one thousand of them, and even for the same requirement, people have thousands of words. One can see that the user's intention to say these words is clearly a desire to listen to the song, but does not contain any common keywords. The first sentence, "i feel the weather when jazz compares," is particularly easy to catch in the weather domain because it also has the keyword "weather" in the weather domain. Therefore, the method has important functions of expanding the corpus excavation, clearing the domain boundary and improving the accuracy of the domain classification of the corpus by the domain prediction model.

The development of the expanded corpora plays a significant role in the construction of the intelligent degree of the intelligent product. If people hope that the intelligent product can understand the user more, understand more and get close to the psychoacoustics of the user and understand the real appeal of the user in different contexts, deeper and efficient mining of the expanded corpora becomes a necessary path.

The above description relates to a domain prediction model, which can be regarded as a semantic classifier, and a classifier for predicting which domain and intention the corpus belongs to is learned by using a deep learning algorithm.

And inputting the linguistic data into a domain prediction model, wherein the domain prediction model can obtain the probability of the linguistic data in different domains respectively, so as to further determine the domain to which the linguistic data belongs.

The foregoing embodiment describes the basic content of an extended corpus mining method, and further explains the extended corpus mining method.

The method for mining the expanded corpus provided by the embodiment of the application can be applied to a server, wherein the server can be a service device which provides service for a user at a network side, can be a server cluster formed by a plurality of servers, and can also be a single server.

Optionally, fig. 1 shows a block diagram of a hardware structure of a server, and referring to fig. 1, the hardware structure of the server may include: a processor 11, a communication interface 12, a memory 13 and a communication bus 14;

in the embodiment of the present invention, the number of the processor 11, the communication interface 12, the memory 13, and the communication bus 14 may be at least one, and the processor 11, the communication interface 12, and the memory 13 complete mutual communication through the communication bus 14;

the processor 11 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 13 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program, the processor may invoke the program stored in the memory, and the program is operable to:

determining whether the corpus belongs to a first corpus candidate of the target field according to the score of the corpus in the target field of the pre-trained field prediction model;

if the corpus belongs to a first corpus candidate of a target field, determining a second corpus candidate with the highest similarity to the first corpus candidate from at least one corpus of the raw corpus set;

and determining whether the candidate corpus is the extended corpus of the target field by utilizing a pre-trained two-classification model of the target field, wherein the two-classification model is obtained by training a classification algorithm by taking the corpus belonging to the target field as a positive sample and the corpus not belonging to the target field as a negative sample, and the candidate corpus comprises a second candidate corpus.

Alternatively, the detailed function and the extended function of the program may be described with reference to the following.

In order to facilitate understanding of the extended corpus mining method applied to the server, a detailed description will now be given of the extended corpus mining method provided in the embodiment of the present application.

In the implementation process of the extended corpus mining method provided by the embodiment of the application, a pre-trained domain prediction model and a pre-trained binary classification model are required, and a generation method of the domain prediction model and the binary classification model is explained first.

The domain prediction model may be considered a semantic classifier that is used to predict the domain to which the corpus belongs. The pre-trained domain prediction model can be generated through the generation process of the domain prediction model.

Fig. 2 is a flowchart of a method for generating a domain prediction model according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes:

s201, obtaining at least one training sample, wherein the at least one training sample comprises corpora belonging to each field in a plurality of fields;

the domain prediction model may be considered to include a plurality of submodels, where different submodels correspond to different domains. When the corpus is predicted through the domain prediction model, the probability that the corpus belongs to the domain can be obtained for each domain, and then the domain with the maximum probability is determined as the domain to which the corpus belongs.

When a domain prediction model is generated, at least one training sample is required to be obtained, each training sample can be regarded as a corpus, and the at least one training sample comprises the corpora belonging to each of a plurality of domains. For example, when the plurality of fields are a weather field, a music field, and a geographic field, the obtained at least one training sample includes a corpus of the weather field, a corpus of the music field, and a corpus of the geographic field.

S202, training a logistic regression algorithm to be trained based on at least one training sample to obtain an initial field prediction model;

in the embodiment of the present application, a logistic regression algorithm to be trained may be trained based on at least one training sample to obtain an initial domain prediction model, where the initial domain prediction model may implement prediction of the domain to which speech belongs, but in order to improve accuracy of prediction of the domain to which speech belongs, the embodiment of the present application may further train the initial domain prediction model to obtain a domain prediction model more accurate in prediction of the domain to which speech belongs, and the specific training process refers to steps S203 to S207 below.

S203, obtaining at least one corpus sample;

in the embodiment of the present application, at least one corpus sample may be obtained, and the corpus sample may be considered as a corpus input into an intelligent product by a user applying the intelligent product when the intelligent product is applied.

S204, detecting whether the score of the initial domain prediction model on the corpus sample in the domain is within a second threshold range, wherein the second threshold range is related to a threshold value of the initial domain prediction model on the domain;

in the embodiment of the present application, the initial domain prediction model obtained by training the logistic regression algorithm to be trained based on at least one training sample may provide a gate threshold of the initial domain prediction model for each domain. For example, when the multiple fields are a weather field, a music field, and a geographic field, the obtained at least one training sample includes a corpus of the weather field, a corpus of the music field, and a corpus of the geographic field, and a logistic regression algorithm to be trained is trained based on the at least one training sample to obtain an initial field prediction model, which provides a threshold value of the weather field, a threshold value of the music field, and a threshold value of the geographic field. For example, the threshold value for the weather domain is 0.6, the threshold value for the music domain is 0.7, and the threshold value for the geographical domain is 0.4.

The method for mining the expanded corpuses provided by the embodiment of the application aims to find the expanded corpuses in the field, namely to find the corpuses which belong to one field but are uncommon. Then in what interval it may be distributed? The researchers found that the domain threshold is related to the gate threshold of the initial domain prediction model to the domain, and is an interval near the gate threshold of the domain. For example, the threshold of the weather domain is 0.6, that is, the linguistic data in the weather domain with the probability of being around 0.6 are all fuzzy and are difficult to distinguish, and may belong to the weather domain or not. The corpus in this interval may be the extended corpus that we need, and we need to acquire them at this time. For example, we can preset an interval threshold value of 0.1 floating up and down, and then the second threshold value range related to the gate threshold value of the weather field is 0.5-0.7; the second threshold value range associated with the threshold value of the gate of the music field is 0.6-0.8; the second threshold range associated with the geographic domain gate threshold is 0.3-0.5. The above is only a preferred mode of the interval threshold provided in the embodiment of the present application, and the inventor may set the specific value of the interval threshold according to his own requirement, for example, to be set to 0.11, 0.2, 0.25, and the like, which is not limited herein.

The corpus samples are input into the initial domain prediction model, and the scores of the initial domain prediction model on the corpus samples in the weather domain (namely, the probability that the corpus samples belong to the weather domain, for example, 0.55), the scores of the initial domain prediction model on the corpus samples in the music domain (namely, the probability that the corpus samples belong to the music domain, for example, 0.9), and the scores of the initial domain prediction model on the corpus samples in the geographic domain (namely, the probability that the corpus samples belong to the geographic domain, for example, 0.45) are obtained.

S205, if the score of the initial domain prediction model on the corpus sample in the domain is within a second threshold range, determining the corpus sample as a target corpus sample of the domain;

based on the above detailed description of step S204, it can be seen that: the grade of the initial domain prediction model to the corpus sample in the weather domain is 0.55, the second threshold range related to the gate threshold value of the weather domain is 0.5-0.7, and then the grade of the initial domain prediction model to the corpus sample in the weather domain is within 0.5-0.7 of the second threshold range related to the gate threshold value of the weather domain, and then the corpus sample is determined as the target corpus sample of the weather domain; the score of the initial domain prediction model on the corpus sample in the music domain is 0.9, the second threshold range related to the gate threshold value of the music domain is 0.6-0.8, and if the score of the initial domain prediction model on the corpus sample in the music domain is not within the second threshold range 0.6-0.8 related to the gate threshold value of the music domain, the corpus sample is determined not to be the target corpus sample in the music domain; and the initial domain prediction model scores the corpus sample in the geographic domain at 0.45, and the second threshold range related to the threshold value of the geographic domain is 0.3-0.5, so that the initial domain prediction model scores the corpus sample in the geographic domain within the second threshold range 0.3-0.5 related to the threshold value of the geographic domain, and the corpus sample is determined to be the target corpus sample of the geographic domain.

Further, in this embodiment, if the score of the initial domain prediction model for the corpus sample in the domain is not within the second threshold range, it is determined that the corpus sample is not the target corpus sample of the domain, and then a training sample corresponding to the corpus sample is not generated.

S206, responding to the calibration operation of the user to the field to which the target corpus sample belongs, and generating a training sample corresponding to the target corpus sample;

based on the above detailed description of step S205, it can be seen that: the method can determine that the corpus sample is a target corpus sample of a weather field and determine that the corpus sample is a target corpus sample of a geographic field; the content can be displayed, whether a target corpus sample is really the target corpus in the weather field is determined by a user, if so, the target corpus sample is calibrated to belong to the weather field, and correspondingly, a training sample corresponding to the target corpus sample can be generated in response to the calibration operation of the user on the target corpus sample, wherein the training sample is the corpus sample calibrated to belong to the weather field; and, whether the target corpus sample is really the target corpus in the geographic field can be determined by the user, if so, the target corpus sample is calibrated to belong to the geographic field, and correspondingly, a training sample corresponding to the target corpus sample can be generated in response to the calibration operation of the user on the target corpus sample, wherein the training sample is the corpus sample calibrated to belong to the geographic field.

In this embodiment of the application, if the user determines that the target corpus sample belongs to both the weather field and the geographic field, a training sample corresponding to the weather field and a training sample corresponding to the geographic field may be generated based on the target corpus sample.

And S207, updating and training the initial domain prediction model based on the generated training samples to obtain a pre-trained domain prediction model.

According to the method for generating the domain prediction model, after the training sample is generated, the initial domain prediction model can be further updated and trained according to the generated training sample, so that the pre-trained domain prediction model is obtained.

Furthermore, in order to improve the efficiency of processing the corpus by the pre-trained domain prediction model provided in the embodiment of the present application, a memory optimization mode, a multi-process starting mode, and the like may be further employed.

The domain prediction model pre-trained according to the embodiment of the application can realize excavation of the extended corpora of the target domain, and after the extended corpora of the target domain are excavated, the extended corpora of the target domain can be further determined to be training samples, so that the domain prediction model is further updated and trained based on the determined training samples.

In this embodiment, the target domain may be a weather domain, a music domain, a geographic domain, or the like, and after the extended corpus of the music domain is mined, the extended corpus of the music domain may be determined as a training sample, so as to perform further update training on the domain prediction model based on the training sample.

Furthermore, after the pre-trained domain prediction model is generated, the generated domain prediction model can be further verified to verify whether the output result of the domain prediction model is accurate.

Fig. 3 is a flowchart of a domain prediction model verification method according to an embodiment of the present disclosure.

As shown in fig. 3, the method includes:

s301, obtaining at least one test corpus, wherein the test corpus carries field information;

in the embodiment of the application, the determined extension corpus of the target field is used as the test corpus to verify the field prediction model. At this time, the second domain indicated by the domain information carried by the extended corpus of the target domain is the target domain.

S302, scoring the test corpus in each field according to a pre-trained field prediction model, and predicting a first field to which the test corpus belongs;

in the embodiment of the application, the test corpus can be input into the domain prediction model to obtain the scores of the test corpus in each domain, and then the domain with the highest score is determined as the first domain to which the test corpus belongs.

For example, if the pre-trained domain prediction model is obtained by training a logistic regression algorithm according to the corpus of the music domain, the corpus of the geographic domain, and the corpus of the weather domain, after the test corpus (the second domain indicated by the domain information carried in the test corpus is the music domain) is input into the pre-trained domain prediction model, the obtained result includes: the scoring 1 of the test corpus in the music field, the scoring 2 of the test corpus in the weather field and the scoring 3 of the test corpus in the local field; if the score 2 in the score 1, the score 2 and the score 3 is the highest, the first field to which the test corpus belongs can be considered as the weather field, and the first field (the weather field) is found to be different from the second field (the music field) through comparison, which indicates that the output result of the pre-trained field prediction model is inaccurate and further training is needed.

If the pre-trained domain prediction model is obtained by training a logistic regression algorithm according to the corpus of the music domain, the corpus of the geographic domain, and the corpus of the weather domain, after the test corpus (the second domain indicated by the domain information carried by the test corpus is the music domain) is input into the pre-trained domain prediction model, the obtained result includes: the method comprises the following steps of 1, testing the score of a corpus in a music field, 2, testing the score of the corpus in a weather field, and 3, testing the score of the corpus in a local field; if the score 1 is the highest among the score 1, the score 2 and the score 3, the first field to which the test corpus belongs can be considered as the music field, and the first field (the music field) and the second field (the music field) are found to be the same through comparison, so that the output result of the pre-trained field prediction model is accurate.

S303, verifying the domain prediction model based on the first domain to which the predicted test corpus belongs and the second domain indicated by the domain information carried by the test corpus.

According to the method and the device, the pre-trained domain prediction model can be verified through at least one test statement so as to find the problem of the domain prediction model in real time, the accuracy of the output result of the domain prediction model is guaranteed, and the accuracy of the extended corpus mining method provided by the embodiment of the application is improved.

The above embodiments provide a generating method of the corpus prediction model, and now, a generating method of the two-class model in the target domain will be described in detail.

Fig. 4 is a flowchart of a method for generating a classification model of a target domain according to an embodiment of the present application.

As shown in fig. 4, the method includes:

s401, obtaining corpora belonging to a target field and corpora not belonging to the target field;

in the embodiment of the present application, the target domain may be a music domain, a weather domain, a geographic domain, or the like. The embodiment of the application can generate the two classification models corresponding to the target fields aiming at different target fields, namely the two classification models of the target fields. For example, a binary model of the music domain may be generated, a binary model of the weather domain may be generated, a binary model of the geographic domain may be generated, and so on.

When generating the two-classification model of the target field, firstly, a training sample is required to be obtained, and at this time, the training sample is a corpus belonging to the target field and a corpus not belonging to the target field.

S402, taking the linguistic data belonging to the target field as a positive sample and the linguistic data not belonging to the target field as a negative sample, and training a classification algorithm to obtain a two-classification model of the target field.

In the embodiment of the present application, when generating the two classification models in the target field, the corpus belonging to the target field may be regarded as a positive sample, and the corpus not belonging to the target field may be regarded as a negative sample, and then the classification algorithm is trained according to the positive sample and the negative sample to obtain the two classification models in the target field.

The classification algorithm may be an Xgboost (eXtreme Gradient Boosting) algorithm, which is only a preferred manner of the classification algorithm provided in the embodiment of the present application, and the inventor may set specific contents related to the classification algorithm according to his own requirements, which is not limited herein. For example, the classification algorithm may be a bert algorithm, a SVM (Support Vector Machine) algorithm, a LR (Logistic Regression) algorithm, an LSTM (Long Short-Term Memory) algorithm, or the like.

Furthermore, in the method for mining the extended corpus provided in the embodiment of the present application, the mining of the extended corpus in the target field may be implemented by using a two-class model in the target field, and after the extended corpus in the target field is mined, the two-class model in the target field may be updated and trained by further using the extended corpus in the target field as a positive sample.

The foregoing embodiment describes in detail the generation processes of the pre-trained domain prediction model and the two-class model of the target domain provided in the embodiment of the present application, and now describes in detail an extended corpus mining method provided in the embodiment of the present application from the viewpoint of mining extended corpuses of the target domain based on the pre-trained domain prediction model and the two-class model of the target domain.

Fig. 5 is a flowchart of an extended corpus mining method according to an embodiment of the present application.

As shown in fig. 5, the method includes:

s501, determining whether the corpus belongs to a first corpus candidate of a target field according to the score of the corpus in the target field of a pre-trained field prediction model;

in this embodiment of the application, the corpus may be a corpus that is input into the intelligent product by a user who applies the intelligent product when the intelligent product is applied.

When the extended corpora of the target field are excavated, the corpora can be input into the pre-trained field prediction model, and the grade of the corpora in the target field by the field prediction model can be obtained. That is, the domain prediction model may output a probability that the corpus belongs to the target domain. For example, when the target domain is a music domain, the corpus may be input into a pre-trained domain prediction model to obtain the score of the corpus in the music domain by the domain prediction model. Namely, the probability that the corpus belongs to the music field is obtained; further, based on the probability that the corpus belongs to the music domain, it may be determined whether the corpus belongs to a first corpus candidate of the music domain.

In the embodiment of the present application, the manner of determining whether the corpus belongs to the first corpus candidate in the music field may be: determining a gate threshold value of a pre-trained domain prediction model for the music domain, and generating a first threshold value range related to the gate threshold value of the music domain according to a preset up-down floating interval threshold value; whether the score of the pre-trained domain prediction model on the corpus in the music domain is within a first threshold range is detected, if yes, the corpus is determined to belong to a first corpus candidate in the music domain, and if not, the corpus is determined not to belong to the first corpus candidate in the music domain.

For example, when it is determined that the gate threshold of the pre-trained domain prediction model for the music domain is 0.5, and the score of the pre-trained domain prediction model for the corpus in the music domain is 0.45, if the preset up-down floating interval threshold is 0.1, the generated first threshold range related to the gate threshold of the music domain is 0.4-0.6, and at this time, the score of the pre-trained domain prediction model for the corpus in the music domain is 0.45, and is within 0.4-0.6 of the first threshold range related to the gate threshold of the music domain, it is determined that the corpus is the first candidate corpus of the music domain.

S502, if the corpus belongs to a first corpus candidate in a target field, determining a second corpus candidate with the highest similarity to the first corpus candidate from at least one corpus in the raw corpus set;

in order to improve the depth of the extended corpus mining, after determining that the corpus is the first corpus candidate in the target domain, we can recall more raw corpora based on the first corpus candidate, and then improve the depth of the extended corpus mining based on the raw corpora.

Specifically, in the embodiment of the present application, a raw activated corpus may be set, where a corpus in the raw activated corpus is a partially raw activated corpus, and the raw activated corpus includes at least one corpus. In the embodiment of the present application, the sources of the corpora in the living corpus set may be corpora crawled from dog-searching question-answer pairs, corpora crawled from hundredth question-answer pairs, and living chat sentences provided by some open-source platforms. The living corpus can be updated regularly or in real time, so that the living corpus is closer to the daily living sentences of people.

After determining that the corpus is a first corpus candidate in the target domain, a second corpus candidate with the highest similarity to the first corpus candidate may be determined from at least one corpus of the living corpus set through ES (elastic search, search server) retrieval.

ES: the ElasticSearch is a search server based on Lucene, provides a full-text search engine with distributed multi-user capability, is based on a RESTful web interface, can achieve real-time search, and is stable, reliable, rapid, convenient to install and use.

S503, determining whether the corpus candidate is the expanded corpus of the target field by using a pre-trained two-classification model of the target field, wherein the two-classification model is obtained by training a classification algorithm by using the corpus belonging to the target field as a positive sample and the corpus not belonging to the target field as a negative sample, and the corpus candidate comprises a second corpus candidate.

In the embodiment of the present application, after determining that the corpus is a first corpus candidate in the target field and determining a second corpus candidate with the highest similarity to the first corpus candidate from the living corpus, it may be determined whether the second corpus candidate is an extended corpus in the target field by using a pre-trained binary classification model in the target field.

Specifically, the pre-trained binary model of the target field provides a threshold value for the target field, the second corpus candidate is input into the pre-trained binary model of the target field, and a score of the target field for the second corpus candidate by the binary model of the target field (i.e., a probability that the second corpus candidate belongs to the target field) is obtained.

In the embodiment of the present application, after determining that the second corpus candidate is the expanded corpus of the target field based on the two-class model of the target field, the user may further determine whether the second corpus candidate is really the expanded corpus of the target field, so as to further ensure the accuracy of the mined expanded corpus.

In this embodiment of the present application, the pre-trained binary model of the target field provides a threshold value for the target field, and further, the extended corpus mining method provided in this embodiment of the present application may further input the first corpus candidate into the pre-trained binary model of the target field to obtain a score (i.e., a probability that the first corpus candidate belongs to the target field) of the first corpus candidate in the target field by the binary model of the target field, when the score is greater than the threshold value, the first corpus candidate may be considered as the extended corpus of the target field, and when the score is not greater than the threshold value, the first corpus candidate may be considered as the extended corpus of the target field.

In the embodiment of the application, after the first candidate corpus is determined to be the extended corpus of the target field based on the binary model of the target field, a user may further determine whether the first candidate corpus is really the extended corpus of the target field, so as to further ensure the accuracy of the mined extended corpus.

The application may determine whether the corpus candidate is an expanded corpus of the target domain using a pre-trained binary model of the target domain, where the corpus candidate includes the second corpus candidate (i.e., a second corpus candidate may be regarded as a corpus candidate), or the corpus candidate includes the first corpus candidate and the second corpus candidate (i.e., a first corpus candidate may be regarded as a corpus candidate and a second corpus candidate may be regarded as a corpus candidate).

In this embodiment of the present application, when the corpus candidate includes the second corpus candidate, after the second corpus candidate is determined as the expanded corpus by the binary model in the target field, the user may further determine whether the second corpus candidate determined as the expanded corpus by the binary model in the target field is really the expanded corpus in the target field, and determine whether the first corpus candidate is really the expanded corpus in the target field by the user, so as to further ensure the accuracy of the excavated expanded corpus.

In order to explain an extended corpus mining method provided in the embodiment of the present application more clearly, a method for determining whether a corpus belongs to a first corpus candidate in a target field according to a score of the corpus in the target field according to a pre-trained field prediction model in the extended corpus mining method provided in the embodiment of the present application is described in detail.

Fig. 6 is a flowchart of a method for determining whether a corpus belongs to a first corpus candidate in a target domain according to a score of a pre-trained domain prediction model for the corpus in the target domain according to an embodiment of the present application.

As shown in fig. 6, the method includes:

s601, inputting the corpus into a pre-trained domain prediction model to obtain the grade of the corpus in the target domain by the domain prediction model;

s602, detecting whether the score of the corpus in the target field by the field prediction model is within a first threshold range; if the score of the corpus in the target domain by the domain prediction model is within the first threshold range, executing step S603; if the score of the corpus in the target domain is not within the first threshold range by the domain prediction model, executing step S604;

in an embodiment of the application, the first threshold range is related to a threshold of the domain prediction model to the target domain.

S603, determining a first candidate corpus of which the corpus belongs to the target field;

s604, determining that the corpus does not belong to the first corpus candidate of the target field.

In order to explain the method for mining the extended corpus provided in the embodiment of the present application more clearly, a method for determining whether a corpus candidate is an extended corpus of a target field by using a pre-trained two-class model of the target field provided in the embodiment of the present application is described in detail.

Fig. 7 is a flowchart of a method for determining whether a corpus candidate is an expanded corpus of a target domain by using a pre-trained binary model of the target domain according to an embodiment of the present application.

As shown in fig. 7, the method includes:

s701, inputting the candidate corpus into a two-classification model of a pre-trained target field to obtain the grade of the two-classification model on the candidate corpus;

s702, detecting whether the score of the two classification models on the candidate corpus is larger than a threshold value of the two classification models on a target field; if the score of the two classification models on the candidate corpus is larger than the threshold value of the two classification models on the target field, executing the step S703; if the score of the binary classification model on the candidate corpus is not greater than the threshold value of the binary classification model on the target field, executing step S704;

s703, determining the candidate corpus as an extended corpus of the target field;

s704, determining that the candidate corpus is not the extended corpus of the target field.

The application provides a corpus mining method, which is characterized in that whether a corpus is a fuzzy corpus in a target field or not is determined on the basis of the grade of the corpus in the target field of a pre-trained corpus prediction model (namely, a first candidate corpus which may or may not belong to the target field); if the corpus is a first corpus candidate of the target field, expanding the first corpus candidate through a living corpus set to obtain a living second corpus candidate with the highest similarity to the first corpus candidate; to determine whether the corpus candidate (including the second corpus candidate) really belongs to the expanded corpus of the target domain through the binary model. According to the method and the device, keywords, standard corpora or standard templates do not need to be matched one by one, so that compared with the prior art, time consumption can be reduced, the efficiency of mining the extended corpora can be improved, and deep mining of the extended corpora is realized on the basis of expansion of the second corpora which are activated and have the highest similarity to the first candidate corpora.

As shown in fig. 8, the apparatus includes:

a first corpus candidate determining unit 81, configured to determine whether a corpus belongs to a first corpus candidate in a target domain according to a score of a pre-trained domain prediction model for the corpus in the target domain;

a second corpus candidate determining unit 82, configured to determine, if a corpus belongs to a first corpus candidate in a target domain, a second corpus candidate with a highest similarity to the first corpus candidate from at least one corpus in the raw corpus set;

and an expanded corpus determining unit 83, configured to determine whether a candidate corpus is an expanded corpus of the target field by using a pre-trained binary model of the target field, where the binary model is obtained by training a classification algorithm with a corpus belonging to the target field as a positive sample and a corpus not belonging to the target field as a negative sample, and the candidate corpus includes a second candidate corpus.

In this embodiment of the present application, preferably, the first corpus candidate determining unit includes:

the first scoring unit is used for inputting the linguistic data into a pre-trained domain prediction model to obtain the scoring of the linguistic data in a target domain by the domain prediction model;

the first detection unit is used for detecting whether the score of the domain prediction model on the corpus in the target domain is within a first threshold range, and the first threshold range is related to a threshold value of the domain prediction model on the target domain;

the first determining unit is used for determining a first candidate corpus of the corpus belonging to the target field if the score of the corpus in the target field by the field prediction model is within a first threshold range;

and the second determining unit is used for determining the first candidate corpus of which the corpus does not belong to the target field if the score of the corpus in the target field by the field prediction model is not within the first threshold range.

In this embodiment, preferably, the expanded corpus determining unit includes:

the second scoring unit is used for inputting the candidate corpus into a two-classification model of a pre-trained target field to obtain the scoring of the candidate corpus by the two-classification model;

the second detection unit is used for detecting whether the score of the binary model on the candidate corpus is larger than a threshold value of the binary model on the target field;

the third determining unit is used for determining the corpus candidate as the expanded corpus of the target field if the score of the binary classification model on the corpus candidate is greater than the threshold value of the binary classification model on the target field;

and the fourth determining unit is used for determining that the language candidate material is not the expanded language material of the target field if the grade of the binary classification model on the language candidate material is not larger than the threshold value of the binary classification model on the target field.

Further, the extended corpus mining device provided in the embodiment of the present application further includes a domain prediction model generating unit, including:

the device comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is used for obtaining at least one training sample, and the at least one training sample comprises corpora belonging to each field in a plurality of fields;

the initial domain prediction model generation unit is used for training the logistic regression algorithm to be trained on the basis of at least one training sample to obtain an initial domain prediction model;

the second obtaining unit is used for obtaining at least one corpus sample;

the third detection unit is used for detecting whether the score of the initial domain prediction model on the corpus sample in the domain is within a second threshold range, and the second threshold range is related to a threshold value of the initial domain prediction model on the domain;

a fifth determining unit, configured to determine the corpus sample as a target corpus sample of the domain if the score of the initial domain prediction model on the corpus sample in the domain is within a second threshold range;

the training sample generating unit is used for responding to the calibration operation of a user on the field to which the target corpus sample belongs and generating a training sample corresponding to the target corpus sample;

and the domain prediction model generation subunit is used for updating and training the initial domain prediction model based on the generated training samples to obtain a pre-trained domain prediction model.

Further, an extended corpus mining device provided in an embodiment of the present application further includes:

and the domain prediction model updating unit is used for determining the expanded corpus as a training sample and updating and training the domain prediction model based on the determined training sample.

Further, the extended corpus mining device provided in the embodiment of the present application further includes a domain prediction model verification unit, including:

the third acquisition unit is used for acquiring at least one test corpus, and the test corpus carries field information;

the prediction unit is used for respectively grading the test corpus in each field according to the pre-trained field prediction model and predicting the first field to which the test corpus belongs;

and the verification unit is used for verifying the field prediction model based on the first field to which the predicted test corpus belongs and the second field indicated by the field information carried by the test corpus.

In this embodiment of the present application, preferably, the second corpus candidate determining unit is specifically configured to determine, by searching through the search server, a second corpus candidate with the highest similarity to the first corpus candidate from at least one corpus in the living corpus set.

Furthermore, an embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and the computer-executable instructions are used to execute the extended corpus mining method according to the embodiment.

For a detailed description of the program stored in the storage medium provided in the embodiments of the present application, reference may be made to the above embodiments, which are not described herein again.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An extended corpus mining method, comprising:

and determining whether the corpus candidate is the expanded corpus of the target field by using a pre-trained two-classification model of the target field, wherein the two-classification model is obtained by training a classification algorithm by using the corpus belonging to the target field as a positive sample and the corpus not belonging to the target field as a negative sample, and the corpus candidate comprises the second corpus candidate.

2. The method according to claim 1, wherein said determining whether said corpus belongs to a first corpus candidate in a target domain according to a pre-trained domain prediction model scoring corpus in said target domain comprises:

inputting the corpus into a pre-trained domain prediction model to obtain the grade of the corpus in a target domain by the domain prediction model;

detecting whether the grade of the corpus in a target field of the field prediction model is within a first threshold range, wherein the first threshold range is related to a threshold value of the field prediction model to the target field;

if the score of the corpus in the target field by the field prediction model is within the first threshold range, determining that the corpus belongs to a first corpus candidate of the target field;

and if the score of the corpus in the target field of the field prediction model is not within the first threshold range, determining that the corpus does not belong to a first candidate corpus of the target field.

3. The method according to claim 1, wherein the determining whether the corpus candidate is the corpus of the target domain by using the pre-trained binary model of the target domain comprises:

inputting the corpus candidates into a pre-trained two-classification model of the target field to obtain scores of the two-classification model on the corpus candidates;

detecting whether the scores of the two classification models on the candidate corpus are larger than a threshold value of the two classification models on the target field;

if the score of the two classification models on the corpus candidate is larger than the threshold value of the two classification models on the target field, determining the corpus candidate as the expanded corpus of the target field;

and if the grade of the two classification models to the corpus candidate is not larger than the threshold value of the two classification models to the target field, determining that the corpus candidate is not the expanded corpus of the target field.

4. The method of claim 1, further comprising a domain prediction model generation process comprising:

obtaining at least one training sample, wherein the at least one training sample comprises corpora belonging to each field in a plurality of fields respectively;

training a logistic regression algorithm to be trained based on the at least one training sample to obtain an initial field prediction model;

obtaining at least one corpus sample;

detecting whether the score of the initial domain prediction model on the corpus sample in the domain is within a second threshold range, wherein the second threshold range is related to a threshold value of the initial domain prediction model on the domain;

if the grade of the initial domain prediction model to the corpus sample in the domain is within the second threshold range, determining the corpus sample as a target corpus sample of the domain;

responding to the calibration operation of a user to the field to which the target corpus sample belongs, and generating a training sample corresponding to the target corpus sample;

and updating and training the initial domain prediction model based on the generated training samples to obtain a pre-trained domain prediction model.

5. The method of claim 4, further comprising:

and determining the extended corpus as a training sample, and updating and training the field prediction model based on the determined training sample.

6. The method of any one of claims 1-5, further comprising:

obtaining at least one test corpus, wherein the test corpus carries field information;

according to the scores of the pre-trained domain prediction model on the test corpus in each domain, predicting a first domain to which the test corpus belongs;

and verifying the domain prediction model based on the predicted first domain to which the test corpus belongs and the predicted second domain indicated by the domain information carried by the test corpus.

7. The method according to claim 1, wherein said determining a second corpus candidate having a highest similarity to said first corpus candidate from at least one corpus of said raw corpus comprises: and searching at least one corpus in the living corpus by a search server to determine a second corpus candidate with the highest similarity to the first corpus candidate.

8. An extended corpus mining device, comprising:

a second corpus candidate determining unit, configured to determine, if the corpus belongs to a first corpus candidate in the target field, a second corpus candidate with a highest similarity to the first corpus candidate from at least one corpus in a raw corpus set;

9. A server, comprising: at least one memory and at least one processor; the memory stores a program, and the processor calls the program stored in the memory, and the program is used for realizing the extended corpus mining method according to any one of claims 1 to 7.

10. A storage medium having stored thereon computer-executable instructions for performing the method of extended corpus mining of any one of claims 1-7.