CN110188204A

CN110188204A - A kind of extension corpora mining method, apparatus, server and storage medium

Info

Publication number: CN110188204A
Application number: CN201910501365.5A
Authority: CN
Inventors: 周辉阳
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2019-08-30
Anticipated expiration: 2039-06-11
Also published as: CN110188204B

Abstract

The application provides a kind of corpora mining method, apparatus, server and storage medium, based on the corpus prediction model of pre-training to corpus the scoring of target domain determine corpus whether be fuzzy corpus in target domain (i.e., first candidate corpus, which, which may belong to target domain, may also be not belonging to target domain)；Pass through life-stylize corpus if corpus is the first candidate corpus of target domain to be extended the first candidate corpus, obtains the second candidate corpus with the first highest life-stylize of candidate corpus similarity；To determine whether candidate corpus (candidate corpus includes the second candidate corpus) really belongs to the extension corpus of target domain by two disaggregated models.The application does not need to match keyword, standard corpus or standard form one by one, therefore, time-consuming improve can be reduced compared with the existing technology extends corpora mining efficiency, and based on the expansion of the second corpus to the first highest life-stylize of candidate corpus similarity, realize the deep excavation to extension corpus.

Description

A kind of extension corpora mining method, apparatus, server and storage medium

Technical field

The present invention relates to corpora mining technical field, more specifically to a kind of extension corpora mining method, apparatus, Server and storage medium.

Background technique

In the process of construction of field, domain prediction model occupies very important role, and domain prediction model can be predicted Corpus fields provide technical foundation for the intelligence of product.The ability of domain prediction model tends to rely on corpus sample, Corpus is extended in this branch of corpus sample has conclusive effect to the generalization of domain prediction model and the ability of recalling, and expands Corpus is opened up to refer to belonging to some field, but in the uncommon corpus in the field.

For the prior art during extension corpus of excavation applications, the most commonly used is keyword digging technology, corpus are similar Spend digging technology and template similarity digging technology.Wherein, keyword digging technology is mainly using the entity in field as key Word recalls extension corpus by keyword and (for example, the keyword of music field is " head ", passes through keyword digging technology possibility The extension corpus recalled is " carrying out a song ")；Corpus similarity digging technology is mainly in the corpus for determining corpus and field When any standard corpus in library matches, determine that corpus is the extension corpus in the field；Template similarity digging technology is mainly Entity in corpus is substituted for variable and obtains corpus template, any standard template in the template library in corpus template and field When matching, determine that corpus is the extension corpus in the field.

Although the prior art may be implemented to extension corpus excavation, but usually there are the following problems: 1, need by One matching keyword, standard corpus or standard form, take a long time, extend corpora mining low efficiency；2, the extension language excavated Material tends to homogeneity, that is, the extension corpus excavated level off to keyword, in standard corpus or template library in corpus Standard form can not achieve the deep excavation to extension corpus.

Summary of the invention

In view of this, to solve the above problems, the present invention provide a kind of extension corpora mining method, apparatus, server and Storage medium, to realize the deep digging to extension corpus on the basis of reducing time-consuming extension corpora mining, raising digging efficiency Pick.Technical solution is as follows:

A kind of extension corpora mining method, comprising:

Whether the corpus, which belongs to institute, is determined in the scoring of target domain to corpus according to the domain prediction model of pre-training State the first candidate corpus of target domain；

If the corpus belongs to the first candidate corpus of the target domain, from least one corpus of life-stylize corpus Middle determination one and the candidate corpus of the first candidate corpus similarity highest second；

Determine whether candidate corpus is the target domain using two disaggregated models of the target domain of pre-training Corpus is extended, two disaggregated model is not belonging to the target domain using the corpus for belonging to the target domain as positive sample The corpus sample training sorting algorithm that is negative obtains, and candidate's corpus includes the described second candidate corpus.

A kind of extension corpora mining device, comprising:

First candidate corpus determination unit, for being commented in target domain according to the domain prediction model of pre-training corpus Divide the first candidate corpus for determining whether the corpus belongs to the target domain；

Second candidate corpus determination unit, if belonging to the first candidate corpus of the target domain for the corpus, from One and the described first candidate language of candidate corpus similarity highest second are determined at least one corpus of life-stylize corpus Material；

Corpus determination unit is extended, two disaggregated models for the target domain using pre-training determine candidate corpus Whether be the target domain extension corpus, two disaggregated model to belong to the corpus of the target domain as positive sample, The corpus sample training sorting algorithm that is negative for being not belonging to the target domain obtains, and candidate's corpus includes described second candidate Corpus.

A kind of server, comprising: at least one processor and at least one processor；The memory is stored with program, The processor calls the program of the memory storage, and described program is for realizing the extension corpora mining method.

A kind of storage medium is stored with computer executable instructions in the storage medium, and the computer is executable to be referred to It enables for executing the extension corpora mining method.

The application provides a kind of corpora mining method, apparatus, server and storage medium, the corpus prediction based on pre-training Model to corpus the scoring of target domain determine corpus whether be in target domain fuzzy corpus (that is, the first candidate corpus, The corpus, which may belong to target domain, may also be not belonging to target domain)；If corpus is the first candidate corpus of target domain The first candidate corpus is extended by life-stylize corpus, is obtained and the first highest life-stylize of candidate corpus similarity Second candidate corpus；To determine whether candidate corpus (candidate corpus includes the second candidate corpus) really belongs to by two disaggregated models In the extension corpus of target domain.The application does not need to match keyword, standard corpus or standard form one by one, therefore, relatively Time-consuming improve can be reduced in the prior art and extends corpora mining efficiency, and based on highest to the first candidate corpus similarity The expansion of second corpus of life-stylize realizes the deep excavation to extension corpus.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is a kind of hardware block diagram of server provided by the embodiments of the present application；

Fig. 2 is a kind of generation method flow chart of domain prediction model provided by the embodiments of the present application；

Fig. 3 is a kind of domain prediction model verification method flow chart provided by the embodiments of the present application；

Fig. 4 is a kind of generation method flow chart of two disaggregated models of target domain provided by the embodiments of the present application；

Fig. 5 is a kind of extension corpora mining method flow diagram provided by the embodiments of the present application；

Fig. 6 be a kind of domain prediction model according to pre-training provided by the embodiments of the present application to corpus in target domain Whether the determining corpus that scores belongs to the method flow diagram of the first candidate corpus of target domain；

Fig. 7 is that a kind of two disaggregated models of target domain using pre-training provided by the embodiments of the present application determine candidate language Material whether be target domain extension corpus method flow diagram；

Fig. 8 is a kind of structural schematic diagram for extending corpora mining device provided by the embodiments of the present application.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Embodiment:

The embodiment of the present application provides a kind of extension corpora mining method, is dug based on extension corpus provided by the embodiments of the present application Pick method can when realizing extension corpora mining, time-consuming, digging efficiency for existing extension corpora mining to avoid the prior art It is low, and the extension corpus excavated and keyword, standard corpus or standard form tend to homogeneity, excavates not deep asks Topic.

For the ease of to it is provided by the embodiments of the present application it is a kind of extend corpora mining method understanding, now first to extension corpus It is illustrated.

Corpus can be understood as the search statement of user, voice, text comprising user, picture input etc..

Corpus is extended to refer to belonging to some field, but in the uncommon corpus in the field.For example, being led for music Domain, it has often been said that corpus be usually " I wants to listen song ", " broadcasting hit song " " carrying out a piece of music " ... it can be seen that sound The common keyword in happy field is as follows: " ", " song ", " song ", " listening ", " music ", " broadcasting " ... but it is raw in reality In work, the way to put questions and demand of people be it is diversified, we can not require everybody that song is listened so to say, 1,000 human eyes In have 1,000 hamlets, even a same demand, people also have thousands of kinds of sayings.People are in different scenes Under have identical demand, but have different sayings, for example somebody wants to listen to music, he may say that " I feels the rank of nobility Scholar is relatively suitble to this weather ", " my mood is more low, comes some cheerful and light-hearted ", " I prefers the wind of Bruce Lattice " ... are it can be seen that user says that these intention is clearly intended to listen song, but not comprising any common Keyword.First " I feels this weather when jazz compares ", this word is particularly easy to be caught by weather field, because it There are also the keyword in weather field " weather ".It can be seen that it is pre- to promote field for drawing a clear territorial limit for the excavation of extension corpus The accuracy that model classifies to corpus fields is surveyed to play a very important role.

The excavation of extension corpus occupies very important effect for the construction of the intelligence degree of intellectual product.If I Wish intellectual product can more understand user, can be more clearly understood that, be close to the users heartfelt wishes, can be appreciated that user is true under different context Real demand, then just becoming the only way which must be passed deeper into, the efficient extension corpus that excavates.

Domain prediction model above it has been related to, domain prediction model may be considered semantic classifiers, with depth Practise the classifier that the prediction corpus that algorithm learns out belongs to which field, intention.

Corpus is input to domain prediction model, the available corpus of domain prediction model is respectively in the general of different field Rate further determines that out field belonging to corpus with this.

A kind of basic content for extending corpora mining method is illustrated in above-described embodiment, now from the digging of extension corpus Pick mode further progress illustrates.

A kind of extension corpora mining method provided by the embodiments of the present application can be applied to server, which can be net Network side provides the service equipment of service for user, may be the server cluster of multiple servers composition, it is also possible to separate unit Server.

Optionally, Fig. 1 shows the hardware block diagram of server, and referring to Fig.1, the hardware configuration of server can wrap It includes: processor 11, communication interface 12, memory 13 and communication bus 14；

In embodiments of the present invention, processor 11, communication interface 12, memory 13, communication bus 14 quantity can be with For at least one, and processor 11, communication interface 12, memory 13 complete mutual communication by communication bus 14；

Processor 11 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention Road etc.；

Memory 13 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile Memory) etc., a for example, at least magnetic disk storage；

Wherein, memory is stored with program, the program that processor can call memory to store, and program is used for:

Whether corpus, which belongs to target neck, is determined in the scoring of target domain to corpus according to the domain prediction model of pre-training The candidate corpus of the first of domain；

If corpus belongs to the first candidate corpus of target domain, one is determined from least one corpus of life-stylize corpus A and the first candidate corpus of candidate corpus similarity highest second；

Using two disaggregated models of the target domain of pre-training determine candidate corpus whether be target domain extension corpus, Two disaggregated models using the corpus for belonging to target domain as positive sample, be not belonging to target domain corpus be negative sample training classification calculate Method obtains, and candidate corpus includes the second candidate corpus.

Optionally, the refinement function of program and extension function can refer to and be described below.

For the ease of the understanding to the extension corpora mining method for being suitable for above-mentioned server, now the embodiment of the present application is mentioned A kind of extension corpora mining method supplied describes in detail.

A kind of extension corpora mining method provided by the embodiments of the present application needs to use the neck of pre-training in implementation process Two disaggregated models of domain prediction model and pre-training are now first said from the generation method of domain prediction model and two disaggregated models It is bright.

Domain prediction model may be considered semantic classifiers, be used to predict field belonging to corpus.It is pre- by field The domain prediction model of pre-training can be generated in the generating process for surveying model.

It is referring to fig. 2 a kind of generation method flow chart of domain prediction model provided by the embodiments of the present application.

As shown in Fig. 2, this method comprises:

S201, obtain at least one training sample, at least one training sample include be belonging respectively in multiple fields it is each The corpus in field；

Domain prediction model can consider to include many submodels, and the corresponding field of different submodels is different.Logical When crossing domain prediction model and predicting corpus, the probability that corpus belongs to the field can be obtained for each field, in turn The field of maximum probability is determined as field belonging to corpus.

When generating domain prediction model, need to obtain at least one training sample, each training sample may be considered One corpus includes the corpus that each field is belonging respectively in multiple fields at least one training sample.For example, multiple fields When for weather field, music field, geographic territory, at least one training sample of acquisition includes the corpus in weather field, music The corpus in field and the corpus of geographic territory.

S202, it trained logistic regression algorithm is treated based at least one training sample is trained, it is pre- to obtain initial field Survey model；

In the embodiment of the present application, trained logistic regression algorithm can be treated based at least one training sample to be instructed Practice, to obtain initial domain prediction model, which may be implemented the prediction to corpus fields, still In order to improve the accuracy of its prediction to corpus fields, the embodiment of the present application can also be further pre- to the initial field It surveys model to be trained to obtain to the more accurate domain prediction model of corpus fields prediction, specific training process is referring to following Step S203-S207.

S203, at least one corpus sample is obtained；

In the embodiment of the present application, at least one available corpus sample, corpus sample may be considered intellectual product In use, the corpus inputted using the user of intellectual product into intellectual product.

Whether the scoring of S204, the initial domain prediction model of detection to corpus sample in field is located at second threshold range Interior, second threshold range is related to door threshold value of the initial domain prediction model to field；

In the embodiment of the present application, trained logistic regression algorithm is treated based at least one training sample to be trained to obtain Initial domain prediction model can provide the initial domain prediction model respectively to the door threshold value in each field.For example, more When a field is weather field, music field, geographic territory, at least one training sample of acquisition includes the language in weather field Material, the corpus of music field and geographic territory corpus, based at least one training sample treat trained logistic regression algorithm into Row training obtains initial domain prediction model, which provides the door threshold value in weather field, music field The door threshold value of door threshold value and geographic territory.For example, the door threshold value in weather field is 0.6, the door threshold value of music field is 0.7, ground The door threshold value in reason field is 0.4.

A kind of extension corpora mining method provided by the embodiments of the present application is the extension corpus in order to find field, that is, It looks for and belongs to a field but uncommon corpus.May so it be distributed in what section? researcher passes through research It was found that it is related with door threshold value of the initial domain prediction model to field, it is the section of the door Near Threshold positioned at field.For example, The door threshold value in weather field is 0.6, and it is very that corpus near 0.6 of probability that is in weather field is all very fuzzy Indistinguishable corpus is likely to belong to weather field, it is also possible to be not belonging to weather field.The corpus in this section may be The extension corpus that we need, we need them to get at this time.For example, we can preset the area floated up and down Between threshold value 0.1, then second threshold range relevant to the door threshold value in weather field be 0.5-0.7；With the door threshold value of music field Relevant second threshold range is 0.6-0.8；Second threshold range relevant to the door threshold value of geographic territory is 0.3-0.5.With Upper is only the preferred embodiment of interval threshold provided by the embodiments of the present application, and the specific value inventor in relation to interval threshold can root It is configured according to the demand of oneself, for example is arranged to 0.11,0.2,0.25 etc., it is not limited here.

Corpus sample is input to initial domain prediction model, obtains initial domain prediction model to corpus sample in weather Field scoring (that is, corpus sample belongs to the probability in weather field, for example, 0.55), initial domain prediction model is to corpus sample This music field scoring (that is, corpus sample belongs to the probability of music field, for example, 0.9), initial domain prediction model pair Corpus sample geographic territory scoring (that is, corpus sample belongs to the probability of geographic territory, for example, 0.45).

If S205, initial scoring of the domain prediction model to corpus sample in field are located within the scope of second threshold, by language Material sample is determined as the target corpus sample in field；

Based on known to the above-mentioned detailed description to step S204: initial domain prediction model is to corpus sample in weather field Scoring be 0.55, second threshold range relevant to the door threshold value in weather field is 0.5-0.7, then initial domain prediction model Scoring to corpus sample in weather field is located in second threshold range 0.5-0.7 relevant to the door threshold value in weather field, Corpus sample is then determined as to the target corpus sample in weather field；Initial domain prediction model is to corpus sample in music field Scoring be 0.9, second threshold range relevant to the door threshold value of music field is 0.6-0.8, then initial domain prediction model Second threshold range 0.6-0.8 relevant to the door threshold value of music field is not in the scoring of music field to corpus sample It is interior, it is determined that corpus sample is not the target corpus sample of music field；Initial domain prediction model is to corpus sample in geography The scoring in field is 0.45, and second threshold range relevant to the door threshold value of geographic territory is 0.3-0.5, then initial domain prediction Scoring of the model to corpus sample in geographic territory is located at second threshold range 0.3-0.5 relevant to the door threshold value of geographic territory It is interior, it is determined that corpus sample is the target corpus sample of geographic territory.

Further, in the embodiment of the present application, if initial domain prediction model to corpus sample field scoring not Within the scope of second threshold, determining corpus sample not is the target corpus sample in field, and then does not generate and the corpus sample Corresponding training sample.

S206, response user generate corresponding with target corpus sample the proving operation of target corpus sample fields Training sample；

Based on known to the above-mentioned detailed description to step S205: can determine that corpus sample is the target corpus in weather field Sample and determining corpus sample are the target corpus sample of geographic territory；It can show this content, target language is determined by user Whether material sample is really the target corpus in weather field, if so, demarcating the target corpus sample belongs to weather field, accordingly , user can be responded to the proving operation of the target corpus sample, generate training sample corresponding with the target corpus sample, The training sample is the corpus sample for being demarcated as belonging to weather field；Also, target corpus sample can also be determined by user Whether really it is the target corpus of geographic territory, if so, demarcating the target corpus sample belongs to geographic territory, correspondingly, can be with User is responded to the proving operation of the target corpus sample, generates training sample corresponding with the target corpus sample, the training Sample is the corpus sample for being demarcated as belonging to geographic territory.

In the embodiment of the present application, if user determines that the target corpus sample had not only belonged to weather field but also belonged to geographical neck Domain can then generate a training sample corresponding with weather field based on the target corpus sample and generate one and geography The corresponding training sample in field.

S207, training is updated to initial domain prediction model based on training sample generated, obtains pre-training Domain prediction model.

A kind of domain prediction model generating method provided by the embodiments of the present application can also be into after generating training sample One step is updated training to initial domain prediction model according to the training sample of generation, to obtain the domain prediction mould of pre-training Type.

Further, in order to improve a kind of domain prediction model of pre-training provided by the embodiments of the present application to corpus Treatment effeciency, can also be using modes such as memory optimization, enabling multi-process.

The digging of the extension corpus to target domain may be implemented in domain prediction model based on the embodiment of the present application pre-training Pick, also, after excavating the extension corpus of target domain, further the extension corpus of target domain can also be determined as instructing Practice sample, is trained with carrying out further update to domain prediction model based on identified training sample.

In the embodiment of the present application, target domain can be weather field, music field, geographic territory etc., excavate Out after the extension corpus of music field, the extension corpus of the music field can be determined as training sample, to be based on the training Sample carries out further update to domain prediction model and trains.

Further, the embodiment of the present application, can also be further to institute after the domain prediction model for generating pre-training Whether the domain prediction model of generation is verified, accurate with the output result for verifying domain prediction model.

Fig. 3 is a kind of domain prediction model verification method flow chart provided by the embodiments of the present application.

As shown in figure 3, this method comprises:

S301, at least one testing material is obtained, testing material carries realm information；

In the embodiment of the present application, using the extension corpus for the target domain determined as testing material, for realizing pair The verification of domain prediction model.At this point, the second field of the instruction of realm information entrained by the extension corpus of target domain is The target domain.

S302, according to the domain prediction model of pre-training to testing material respectively in the scoring in each field, prediction test First field belonging to corpus；

In the embodiment of the present application, testing material can be input to domain prediction model, testing material is obtained and exist respectively The scoring in each field, and then the highest field that will score is determined as the first field belonging to testing material.

For example, if the domain prediction model of pre-training is according to the corpus of music field, the corpus of geographic territory and weather What the corpus training logistic regression algorithm in field obtained, then by testing material, (realm information that the testing material carries refers to The second field shown be music field) be input to the domain prediction model of pre-training after, obtained result includes: that testing material exists The scoring 1 of music field, scoring 3 of the testing material in the scoring 2 in weather field and testing material field in the ground；If commenting 2 highests that score are divided in 1, scoring 2 and scoring 3 to pass through ratio it may be considered that the first field belonging to testing material is weather field It is different compared with discovery the first field (weather field) and the second field (music field), then illustrate the domain prediction model of pre-training Result inaccuracy is exported, needs further to be trained.

If the domain prediction model of pre-training is according to the corpus of music field, the corpus of geographic territory and weather field Corpus training logistic regression algorithm obtains, then by testing material (the of the realm information instruction that the testing material carries Two fields are music field) be input to the domain prediction model of pre-training after, obtained result includes: testing material leads in music The scoring 1 in domain, scoring 3 of the testing material in the scoring 2 in weather field and testing material field in the ground；If scoring 1 is commented Divide in 2 and scoring 31 highest that scores, it may be considered that the first field belonging to testing material is music field, by comparing discovery First field (music field) and the second field (music field) are identical, then illustrate the output knot of the domain prediction model of pre-training Fruit is accurate.

The of the realm information instruction that first field belonging to S303, the testing material based on prediction and testing material carry Two fields verify domain prediction model.

The embodiment of the present application can be verified by domain prediction model of at least one test statement to pre-training, with Immediately the problem of discovery domain prediction model, guarantee the accuracy of domain prediction model output result, and then it is real to improve the application A kind of accuracy of extension corpora mining method of example offer is provided.

Above mentioned embodiment provide the generating modes of corpus prediction model, now to the life of two disaggregated models of target domain It is described in detail at method.

It is referring to fig. 4 a kind of generation method flow chart of two disaggregated models of target domain provided by the embodiments of the present application.

As shown in figure 4, this method comprises:

S401, acquisition belong to the corpus of target domain and are not belonging to the corpus of target domain；

In the embodiment of the present application, target domain can be music field, can be weather field, or geography neck Domain etc..The embodiment of the present application can generate two disaggregated models corresponding with the target domain for different target domains, that is, Two disaggregated models of the target domain.For example, two disaggregated models of music field can be generated, the two of weather field can be generated Two disaggregated models of geographic territory etc. can be generated in disaggregated model.

When generating two disaggregated model of target domain, it is necessary first to obtain training sample, the training sample is to belong at this time In target domain corpus and be not belonging to the corpus of target domain.

S402, using the corpus for belonging to target domain as positive sample, be not belonging to the corpus of target domain as negative sample, it is right Sorting algorithm is trained, and obtains two disaggregated models of target domain.

In the embodiment of the present application, when generating two disaggregated model of target domain, it can will belong to the language of target domain Material regards positive sample as, and the corpus that will not belong to target domain regards negative sample as, so according to positive sample and negative sample to point Class algorithm is trained, to obtain two disaggregated models of target domain.

Wherein, sorting algorithm can be calculated for Xgboost (eXtreme Gradient Boosting, extreme gradient are promoted) Method, above is only the preferred embodiment of sorting algorithm provided by the embodiments of the present application, the particular content invention in relation to sorting algorithm People can be configured according to their own needs, it is not limited here.For example, sorting algorithm can be bert algorithm, SVM (Support Vector Machine, support vector machines) algorithm, LR (Logistic Regression) algorithm, LSTM (Long Short-Term Memory, shot and long term memory network) algorithm etc..

Further, a kind of extension corpora mining method provided by the embodiments of the present application, can use the two of target domain Disaggregated model realizes the excavation of the extension corpus to target domain, also, after excavating the extension corpus of target domain, may be used also The extension corpus of target domain is further updated instruction as two disaggregated models of the positive sample to the current target domain Practice.

Two classification of the above-described embodiment to the domain prediction model and target domain of pre-training provided by the embodiments of the present application The generating process of model is described in detail, now from two classification moulds of domain prediction model and target domain based on pre-training The angle that type excavates the extension corpus of target domain, to a kind of extension corpora mining method provided by the embodiments of the present application It is described in detail.

Fig. 5 is a kind of extension corpora mining method flow diagram provided by the embodiments of the present application.

As shown in figure 5, this method comprises:

S501, whether corpus, which belongs to mesh, is determined in the scoring of target domain to corpus according to the domain prediction model of pre-training The candidate corpus of the first of mark field；

In the embodiment of the present application, corpus can be intellectual product in use, using the user of intellectual product to intelligence The corpus inputted in product.

When excavating the extension corpus of target domain, corpus can be input to the domain prediction model of pre-training, it can be with Obtain domain prediction model to corpus target domain scoring.That is, domain prediction model, which can export corpus, belongs to target neck The probability in domain.For example, corpus can be input to the domain prediction model of pre-training, obtained when target domain is music field To domain prediction model to corpus music field scoring.That is, obtaining the probability that corpus belongs to music field；In turn, it is based on Corpus belongs to the probability of music field, can determine whether the corpus belongs to the first candidate corpus of music field.

In the embodiment of the present application, determining whether corpus belongs to the mode of the first candidate corpus of music field can be with are as follows: Door threshold value of the domain prediction model to music field for determining pre-training is generated according to the preset interval threshold to float up and down First threshold range relevant to the door threshold value of music field；The domain prediction model of detection pre-training is to corpus in music field Scoring whether be located in first threshold range, if so, determining that corpus belongs to the first candidate corpus of music field, if it is not, really Attribute material is not belonging to the first candidate corpus of music field.

For example, being 0.5 in door threshold value of the domain prediction model for determining pre-training to music field, the field of pre-training is pre- Survey model when the scoring of music field is 0.45, if the preset interval threshold to float up and down is 0.1, generates corpus First threshold range relevant to the door threshold value of music field is 0.4-0.6, and the domain prediction model of pre-training is to corpus at this time The scoring of music field be 0.45 be located at first threshold range relevant to the door threshold value of music field be in 0.4-0.6, then Illustrate the first candidate corpus that the corpus is music field.

If S502, corpus belong to the first candidate corpus of target domain, from least one corpus of life-stylize corpus Determine one and the first candidate corpus of candidate corpus similarity highest second；

In order to improve the deep-going of extension corpora mining, after determining the first candidate corpus that corpus is target domain, I The corpus of more life-stylizes can be recalled based on the first candidate corpus, and then extension language is improved based on the corpus of life-stylize Expect the deep-going excavated.

Specifically, a life-stylize corpus can be set in the embodiment of the present application, the corpus in life-stylize corpus is inclined The corpus of life-stylize includes at least one corpus in the life-stylize corpus.In the embodiment of the present application in life-stylize corpus The source of corpus can be flat from the corpus crawled in search dog question and answer pair, the corpus crawled from Baidu's question and answer pair, some open sources The chat sentence for the life-stylize that platform provides.The life-stylize corpus can periodically update or real-time update, so that it is more sticked on It is bordering on current daily life sentence.

After determining the first candidate corpus that corpus is target domain, ES (ElasticSearch, search clothes can be passed through Business device) retrieval determining one and first candidate corpus similarity highest second from least one corpus of life-stylize corpus Candidate corpus.

ES:ElasticSearch is the search server based on Lucene, it provides a distributed multi-user The full-text search engine of ability is based on RESTful web interface, can reach real-time search, stablizes, reliably, quickly, installation makes With conveniently.

S503, determined using two disaggregated models of the target domain of pre-training candidate corpus whether be target domain extension Corpus, using the corpus for belonging to target domain as positive sample, the corpus for being not belonging to target domain is negative sample training two disaggregated models Sorting algorithm obtains, and candidate corpus includes the second candidate corpus.

In the embodiment of the present application, in the first candidate corpus for determining that corpus is target domain, from life-stylize corpus It determines with after the first candidate corpus of candidate corpus similarity highest second, can use two points of the target domain of pre-training Class model determine the second candidate corpus whether be target domain extension corpus.

Specifically, two disaggregated models of the target domain of pre-training provide the door threshold value to the target domain, second is waited Select corpus to be input to two disaggregated models of the target domain of the pre-training, obtain two disaggregated models of the target domain to this second Candidate corpus the target domain scoring (that is, probability that the second candidate corpus belongs to the target domain), it is big in the scoring When this threshold value, it is believed that the second candidate corpus is the extension corpus of the target domain, is not more than the door in the scoring When threshold value, it is believed that the second candidate corpus is not the extension corpus of the target domain.

In the embodiment of the present application, determine that the second candidate corpus is target domain in two disaggregated models based on target domain Extension corpus after, can also further by user determine the second candidate corpus whether be really target domain extension language Material, to be further ensured that the accuracy for the extension corpus excavated.

In the embodiment of the present application, two disaggregated models of the target domain of pre-training provide the door threshold value to the target domain, Further, it is pre- can also to be input to this by a kind of extension corpora mining method provided by the embodiments of the present application for the first candidate corpus Two disaggregated models of trained target domain obtain two disaggregated models of the target domain to the first candidate corpus in the target The scoring (that is, probability that the first candidate corpus belongs to the target domain) in field can be with when the scoring is greater than this threshold value Think that the first candidate corpus is the extension corpus of the target domain, when the scoring is not more than this threshold value, it is believed that should First candidate corpus is not the extension corpus of the target domain.

In the embodiment of the present application, determine that the first candidate corpus is target domain in two disaggregated models based on target domain Extension corpus after, can also further by user determine the first candidate corpus whether be really target domain extension language Material, to be further ensured that the accuracy for the extension corpus excavated.

Two disaggregated models that the application can use the target domain of pre-training determine whether candidate corpus is target domain Extension corpus, wherein candidate corpus includes the second candidate corpus (that is, one second candidate corpus can regard a time as Select corpus), alternatively, candidate corpus includes the first candidate corpus and the second candidate corpus (that is, one first candidate corpus can be seen At being a candidate corpus, one second candidate corpus can also regard a candidate corpus as).

In the embodiment of the present application, when candidate corpus includes the second candidate corpus, if two disaggregated models of target domain After determining the second candidate corpus for extension corpus, further it can also determine that two disaggregated models by target domain are true by user Whether the second candidate corpus for being set to extension corpus is really the extension corpus of target domain, and determines the first time by user Select whether corpus is really the extension corpus of target domain, to be further ensured that the accuracy for the extension corpus excavated.

In order to be more clearly illustrated to a kind of extension corpora mining method provided by the embodiments of the present application, now to this Apply the domain prediction model according to pre-training in a kind of extension corpora mining method of embodiment offer to corpus in target The method whether determining corpus of the scoring in field belongs to the first candidate corpus of target domain is described in detail.

Fig. 6 be a kind of domain prediction model according to pre-training provided by the embodiments of the present application to corpus in target domain Whether the determining corpus that scores belongs to the method flow diagram of the first candidate corpus of target domain.

As shown in fig. 6, this method comprises:

S601, the domain prediction model that corpus is input to pre-training obtain domain prediction model and lead to corpus in target The scoring in domain；

Whether the scoring of S602, detection field prediction model to corpus in target domain is located in first threshold range；If Scoring of the domain prediction model to corpus in target domain is located in first threshold range, executes step S603；If domain prediction Scoring of the model to corpus in target domain is not in first threshold range, executes step S604；

In the embodiment of the present application, first threshold range is related to door threshold value of the domain prediction model to target domain.

S603, determine that corpus belongs to the first candidate corpus of target domain；

S604, determine that corpus is not belonging to the first candidate corpus of target domain.

In order to be more clearly illustrated to a kind of extension corpora mining method provided by the embodiments of the present application, now to this A kind of two disaggregated models of target domain using pre-training that application embodiment provides determine whether candidate corpus is target neck The method of the extension corpus in domain is described in detail.

Fig. 7 is that a kind of two disaggregated models of target domain using pre-training provided by the embodiments of the present application determine candidate language Material whether be target domain extension corpus method flow diagram.

As shown in fig. 7, this method comprises:

S701, two disaggregated models that candidate corpus is input to the target domain of pre-training, obtain two disaggregated models to time Select the scoring of corpus；

Whether S702, two disaggregated models of detection are greater than two disaggregated models to the door threshold of target domain to the scoring of candidate corpus Value；If two disaggregated models are greater than two disaggregated models to the door threshold value of target domain to the scoring of candidate corpus, step S703 is executed； If two disaggregated models are to the scoring of candidate corpus no more than two disaggregated models to the door threshold value of target domain, execution step S704；

S703, determine that candidate corpus is the extension corpus of target domain；

S704, determine that candidate corpus is not the extension corpus of target domain.

The application provides a kind of corpora mining method, and the corpus prediction model based on pre-training is to corpus in target domain It scores and determines whether corpus is fuzzy corpus in target domain (that is, the first candidate corpus, the corpus may belong to target domain It may also be not belonging to target domain)；Pass through life-stylize corpus to first if corpus is the first candidate corpus of target domain Candidate corpus is extended, and obtains the second candidate corpus with the first highest life-stylize of candidate corpus similarity；To pass through two Disaggregated model determines whether candidate corpus (candidate corpus includes the second candidate corpus) really belongs to the extension corpus of target domain. The application does not need to match keyword, standard corpus or standard form one by one, can reduce time-consuming accordingly, with respect to the prior art Improve extension corpora mining efficiency, and the expansion based on the second corpus to the first highest life-stylize of candidate corpus similarity It fills, realizes the deep excavation to extension corpus.

As shown in figure 8, the device includes:

First candidate corpus determination unit 81, for the domain prediction model according to pre-training to corpus in target domain Score the first candidate corpus for determining whether corpus belongs to target domain；

Second candidate corpus determination unit 82, if belonging to the first candidate corpus of target domain for corpus, from life-stylize One and the first candidate corpus of candidate corpus similarity highest second are determined at least one corpus of corpus；

Corpus determination unit 83 is extended, two disaggregated models for the target domain using pre-training determine that candidate corpus is The no extension corpus for target domain, two disaggregated models are not belonging to target domain using the corpus for belonging to target domain as positive sample The corpus sample training sorting algorithm that is negative obtain, candidate corpus includes the second candidate corpus.

In the embodiment of the present application, it is preferred that the first candidate corpus determination unit, comprising:

First scoring unit obtains domain prediction model pair for corpus to be input to the domain prediction model of pre-training Scoring of the corpus in target domain；

Whether first detection unit, the scoring for detection field prediction model to corpus in target domain are located at the first threshold It is worth in range, first threshold range is related to door threshold value of the domain prediction model to target domain；

First determination unit, if the scoring for domain prediction model to corpus in target domain is located at first threshold range It is interior, determine that corpus belongs to the first candidate corpus of target domain；

Second determination unit, if the scoring for domain prediction model to corpus in target domain is not at first threshold In range, determine that corpus is not belonging to the first candidate corpus of target domain.

In the embodiment of the present application, it is preferred that extension corpus determination unit, comprising:

Second scoring unit, two disaggregated models of the target domain for candidate corpus to be input to pre-training obtain two Scoring of the disaggregated model to candidate corpus；

Whether second detection unit is greater than two disaggregated models to mesh to the scoring of candidate corpus for detecting two disaggregated models The door threshold value in mark field；

Third determination unit, if being greater than two disaggregated models to target domain to the scoring of candidate corpus for two disaggregated models Door threshold value, determine candidate corpus be target domain extension corpus；

4th determination unit, if being led no more than two disaggregated models to target for two disaggregated models to the scoring of candidate corpus The door threshold value in domain determines that candidate's corpus is not the extension corpus of target domain.

Further, a kind of extension corpora mining device provided by the embodiments of the present application further includes that domain prediction model generates Unit, comprising:

First acquisition unit, for obtaining at least one training sample, at least one training sample includes in multiple fields It is belonging respectively to the corpus in each field；

Initial domain prediction model generation unit, for treating trained logistic regression algorithm based at least one training sample It is trained, obtains initial domain prediction model；

Second acquisition unit, for obtaining at least one corpus sample；

Third detection unit, for detecting whether scoring of the initial domain prediction model to corpus sample in field is located at In two threshold ranges, second threshold range is related to door threshold value of the initial domain prediction model to field；

5th determination unit, if being located at second threshold for initial scoring of the domain prediction model to corpus sample in field In range, corpus sample is determined as to the target corpus sample in field；

Training sample generation unit, for responding user to the proving operation of target corpus sample fields, generate with The corresponding training sample of target corpus sample；

Domain prediction model generates subelement, for being carried out based on training sample generated to initial domain prediction model Training is updated, the domain prediction model of pre-training is obtained.

Further, a kind of extension corpora mining device provided by the embodiments of the present application, further includes:

Domain prediction model modification unit is determined as training sample for that will extend corpus, based on identified trained sample This is updated training to domain prediction model.

Further, a kind of extension corpora mining device provided by the embodiments of the present application, further includes domain prediction model school Verification certificate member, comprising:

Third acquiring unit, for obtaining at least one testing material, testing material carries realm information；

Predicting unit, for according to the domain prediction model of pre-training to testing material respectively in the scoring in each field, Predict the first field belonging to testing material；

Verification unit, the realm information carried for the first field belonging to the testing material based on prediction and testing material The second field indicated verifies domain prediction model.

In the embodiment of the present application, it is preferred that the second candidate corpus determination unit is specifically used for examining by search server Rope determines one and the first candidate language of candidate corpus similarity highest second from least one corpus of life-stylize corpus Material.

Further, the embodiment of the present application also provides a kind of computer readable storage medium, the computer-readable storage Computer executable instructions are stored in medium, the computer executable instructions are for executing expansion involved by above-described embodiment Open up corpora mining method.

The detailed description of program in relation to storing in storage medium provided by the embodiments of the present application can refer to above-described embodiment, This will not be repeated here.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.

The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments in the case where not departing from core of the invention thought or scope.Therefore, originally Invention is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein Consistent widest scope.

Claims

1. a kind of extension corpora mining method characterized by comprising

Whether the corpus, which belongs to the mesh, is determined in the scoring of target domain to corpus according to the domain prediction model of pre-training The candidate corpus of the first of mark field；

If the corpus belongs to the first candidate corpus of the target domain, from least one corpus of life-stylize corpus really Fixed one and the candidate corpus of the first candidate corpus similarity highest second；

Using two disaggregated models of the target domain of pre-training determine candidate corpus whether be the target domain extension Corpus, two disaggregated model are not belonging to the corpus of the target domain using the corpus for belonging to the target domain as positive sample The sample training sorting algorithm that is negative obtains, and candidate's corpus includes the described second candidate corpus.

2. the method according to claim 1, wherein described exist to corpus according to the domain prediction model of pre-training The scoring of target domain determines whether the corpus belongs to the first candidate corpus of the target domain, comprising:

The corpus is input to the domain prediction model of pre-training, obtains the domain prediction model to the corpus in target The scoring in field；

Detect whether scoring of the domain prediction model to the corpus in target domain is located in first threshold range, it is described First threshold range is related to door threshold value of the domain prediction model to the target domain；

If scoring of the domain prediction model to the corpus in target domain is located in the first threshold range, institute is determined Predicate material belongs to the first candidate corpus of the target domain；

If the domain prediction model is not in the first threshold range corpus in the scoring of target domain, really The fixed corpus is not belonging to the first candidate corpus of the target domain.

3. the method according to claim 1, wherein two classification of the target domain using pre-training Model determine candidate corpus whether be the target domain extension corpus, comprising:

Candidate corpus is input to two disaggregated models of the target domain of pre-training, obtains two disaggregated model to described The scoring of candidate corpus；

It detects two disaggregated model and the target is led to whether the scoring of the candidate corpus is greater than two disaggregated model The door threshold value in domain；

If two disaggregated model is greater than two disaggregated model to the door of the target domain to the scoring of the candidate corpus Threshold value determines that the candidate corpus is the extension corpus of the target domain；

If two disaggregated model is to the scoring of the candidate corpus no more than two disaggregated model to the target domain Door threshold value determines that the candidate corpus is not the extension corpus of the target domain.

4. the method according to claim 1, wherein further including domain prediction model generating process, the process packet It includes:

At least one training sample is obtained, at least one described training sample includes that each neck is belonging respectively in multiple fields The corpus in domain；

Trained logistic regression algorithm is treated based at least one described training sample to be trained, and obtains initial domain prediction mould Type；

Obtain at least one corpus sample；

Detect whether scoring of the initial domain prediction model to the corpus sample in field is located within the scope of second threshold, The second threshold range and the initial domain prediction model are related to the door threshold value in the field；

If scoring of the initial domain prediction model to the corpus sample in the field is located at the second threshold range It is interior, the corpus sample is determined as to the target corpus sample in the field；

User is responded to the proving operation of the target corpus sample fields, is generated corresponding with the target corpus sample Training sample；

Training is updated to the initial domain prediction model based on training sample generated, the field for obtaining pre-training is pre- Survey model.

5. according to the method described in claim 4, it is characterized by further comprising:

The extension corpus is determined as training sample, the domain prediction model is carried out more based on identified training sample New training.

6. method described in -5 any one according to claim 1, which is characterized in that further include:

At least one testing material is obtained, the testing material carries realm information；

The testing material institute is predicted respectively in the scoring in each field to testing material according to the domain prediction model of pre-training The first field belonged to；

The of the realm information instruction that first field belonging to the testing material based on prediction and the testing material carry Two fields verify the domain prediction model.

7. the method according to claim 1, wherein in described at least one corpus from life-stylize corpus really Fixed one and the candidate corpus of the first candidate corpus similarity highest second, comprising: retrieved by search server from life It activates and determines one and the described first candidate corpus of candidate corpus similarity highest second at least one corpus of corpus.

8. a kind of extension corpora mining device characterized by comprising

First candidate corpus determination unit, it is true in the scoring of target domain to corpus for the domain prediction model according to pre-training Whether the fixed corpus belongs to the first candidate corpus of the target domain；

Second candidate corpus determination unit, if belonging to the first candidate corpus of the target domain for the corpus, from life Change and determines one and the described first candidate corpus of candidate corpus similarity highest second at least one corpus of corpus；

Corpus determination unit is extended, whether two disaggregated models for the target domain using pre-training determine candidate corpus For the extension corpus of the target domain, two disaggregated model does not belong to using the corpus for belonging to the target domain as positive sample It is obtained in the corpus of the target domain sample training sorting algorithm that is negative, candidate's corpus includes the described second candidate language Material.

9. a kind of server characterized by comprising at least one processor and at least one processor；The memory is deposited Program is contained, the processor calls the program of the memory storage, and described program is any for realizing such as claim 1-7 Extension corpora mining method described in one.

10. a kind of storage medium, which is characterized in that be stored with computer executable instructions, the calculating in the storage medium Machine executable instruction requires extension corpora mining method described in 1-7 any one for perform claim.