CN110188204A - A kind of extension corpora mining method, apparatus, server and storage medium - Google Patents
A kind of extension corpora mining method, apparatus, server and storage medium Download PDFInfo
- Publication number
- CN110188204A CN110188204A CN201910501365.5A CN201910501365A CN110188204A CN 110188204 A CN110188204 A CN 110188204A CN 201910501365 A CN201910501365 A CN 201910501365A CN 110188204 A CN110188204 A CN 110188204A
- Authority
- CN
- China
- Prior art keywords
- corpus
- candidate
- domain
- target domain
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a kind of corpora mining method, apparatus, server and storage medium, based on the corpus prediction model of pre-training to corpus the scoring of target domain determine corpus whether be fuzzy corpus in target domain (i.e., first candidate corpus, which, which may belong to target domain, may also be not belonging to target domain);Pass through life-stylize corpus if corpus is the first candidate corpus of target domain to be extended the first candidate corpus, obtains the second candidate corpus with the first highest life-stylize of candidate corpus similarity;To determine whether candidate corpus (candidate corpus includes the second candidate corpus) really belongs to the extension corpus of target domain by two disaggregated models.The application does not need to match keyword, standard corpus or standard form one by one, therefore, time-consuming improve can be reduced compared with the existing technology extends corpora mining efficiency, and based on the expansion of the second corpus to the first highest life-stylize of candidate corpus similarity, realize the deep excavation to extension corpus.
Description
Technical field
The present invention relates to corpora mining technical field, more specifically to a kind of extension corpora mining method, apparatus,
Server and storage medium.
Background technique
In the process of construction of field, domain prediction model occupies very important role, and domain prediction model can be predicted
Corpus fields provide technical foundation for the intelligence of product.The ability of domain prediction model tends to rely on corpus sample,
Corpus is extended in this branch of corpus sample has conclusive effect to the generalization of domain prediction model and the ability of recalling, and expands
Corpus is opened up to refer to belonging to some field, but in the uncommon corpus in the field.
For the prior art during extension corpus of excavation applications, the most commonly used is keyword digging technology, corpus are similar
Spend digging technology and template similarity digging technology.Wherein, keyword digging technology is mainly using the entity in field as key
Word recalls extension corpus by keyword and (for example, the keyword of music field is " head ", passes through keyword digging technology possibility
The extension corpus recalled is " carrying out a song ");Corpus similarity digging technology is mainly in the corpus for determining corpus and field
When any standard corpus in library matches, determine that corpus is the extension corpus in the field;Template similarity digging technology is mainly
Entity in corpus is substituted for variable and obtains corpus template, any standard template in the template library in corpus template and field
When matching, determine that corpus is the extension corpus in the field.
Although the prior art may be implemented to extension corpus excavation, but usually there are the following problems: 1, need by
One matching keyword, standard corpus or standard form, take a long time, extend corpora mining low efficiency;2, the extension language excavated
Material tends to homogeneity, that is, the extension corpus excavated level off to keyword, in standard corpus or template library in corpus
Standard form can not achieve the deep excavation to extension corpus.
Summary of the invention
In view of this, to solve the above problems, the present invention provide a kind of extension corpora mining method, apparatus, server and
Storage medium, to realize the deep digging to extension corpus on the basis of reducing time-consuming extension corpora mining, raising digging efficiency
Pick.Technical solution is as follows:
A kind of extension corpora mining method, comprising:
Whether the corpus, which belongs to institute, is determined in the scoring of target domain to corpus according to the domain prediction model of pre-training
State the first candidate corpus of target domain;
If the corpus belongs to the first candidate corpus of the target domain, from least one corpus of life-stylize corpus
Middle determination one and the candidate corpus of the first candidate corpus similarity highest second;
Determine whether candidate corpus is the target domain using two disaggregated models of the target domain of pre-training
Corpus is extended, two disaggregated model is not belonging to the target domain using the corpus for belonging to the target domain as positive sample
The corpus sample training sorting algorithm that is negative obtains, and candidate's corpus includes the described second candidate corpus.
A kind of extension corpora mining device, comprising:
First candidate corpus determination unit, for being commented in target domain according to the domain prediction model of pre-training corpus
Divide the first candidate corpus for determining whether the corpus belongs to the target domain;
Second candidate corpus determination unit, if belonging to the first candidate corpus of the target domain for the corpus, from
One and the described first candidate language of candidate corpus similarity highest second are determined at least one corpus of life-stylize corpus
Material;
Corpus determination unit is extended, two disaggregated models for the target domain using pre-training determine candidate corpus
Whether be the target domain extension corpus, two disaggregated model to belong to the corpus of the target domain as positive sample,
The corpus sample training sorting algorithm that is negative for being not belonging to the target domain obtains, and candidate's corpus includes described second candidate
Corpus.
A kind of server, comprising: at least one processor and at least one processor;The memory is stored with program,
The processor calls the program of the memory storage, and described program is for realizing the extension corpora mining method.
A kind of storage medium is stored with computer executable instructions in the storage medium, and the computer is executable to be referred to
It enables for executing the extension corpora mining method.
The application provides a kind of corpora mining method, apparatus, server and storage medium, the corpus prediction based on pre-training
Model to corpus the scoring of target domain determine corpus whether be in target domain fuzzy corpus (that is, the first candidate corpus,
The corpus, which may belong to target domain, may also be not belonging to target domain);If corpus is the first candidate corpus of target domain
The first candidate corpus is extended by life-stylize corpus, is obtained and the first highest life-stylize of candidate corpus similarity
Second candidate corpus;To determine whether candidate corpus (candidate corpus includes the second candidate corpus) really belongs to by two disaggregated models
In the extension corpus of target domain.The application does not need to match keyword, standard corpus or standard form one by one, therefore, relatively
Time-consuming improve can be reduced in the prior art and extends corpora mining efficiency, and based on highest to the first candidate corpus similarity
The expansion of second corpus of life-stylize realizes the deep excavation to extension corpus.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of hardware block diagram of server provided by the embodiments of the present application;
Fig. 2 is a kind of generation method flow chart of domain prediction model provided by the embodiments of the present application;
Fig. 3 is a kind of domain prediction model verification method flow chart provided by the embodiments of the present application;
Fig. 4 is a kind of generation method flow chart of two disaggregated models of target domain provided by the embodiments of the present application;
Fig. 5 is a kind of extension corpora mining method flow diagram provided by the embodiments of the present application;
Fig. 6 be a kind of domain prediction model according to pre-training provided by the embodiments of the present application to corpus in target domain
Whether the determining corpus that scores belongs to the method flow diagram of the first candidate corpus of target domain;
Fig. 7 is that a kind of two disaggregated models of target domain using pre-training provided by the embodiments of the present application determine candidate language
Material whether be target domain extension corpus method flow diagram;
Fig. 8 is a kind of structural schematic diagram for extending corpora mining device provided by the embodiments of the present application.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Embodiment:
The embodiment of the present application provides a kind of extension corpora mining method, is dug based on extension corpus provided by the embodiments of the present application
Pick method can when realizing extension corpora mining, time-consuming, digging efficiency for existing extension corpora mining to avoid the prior art
It is low, and the extension corpus excavated and keyword, standard corpus or standard form tend to homogeneity, excavates not deep asks
Topic.
For the ease of to it is provided by the embodiments of the present application it is a kind of extend corpora mining method understanding, now first to extension corpus
It is illustrated.
Corpus can be understood as the search statement of user, voice, text comprising user, picture input etc..
Corpus is extended to refer to belonging to some field, but in the uncommon corpus in the field.For example, being led for music
Domain, it has often been said that corpus be usually " I wants to listen song ", " broadcasting hit song " " carrying out a piece of music " ... it can be seen that sound
The common keyword in happy field is as follows: " ", " song ", " song ", " listening ", " music ", " broadcasting " ... but it is raw in reality
In work, the way to put questions and demand of people be it is diversified, we can not require everybody that song is listened so to say, 1,000 human eyes
In have 1,000 hamlets, even a same demand, people also have thousands of kinds of sayings.People are in different scenes
Under have identical demand, but have different sayings, for example somebody wants to listen to music, he may say that " I feels the rank of nobility
Scholar is relatively suitble to this weather ", " my mood is more low, comes some cheerful and light-hearted ", " I prefers the wind of Bruce
Lattice " ... are it can be seen that user says that these intention is clearly intended to listen song, but not comprising any common
Keyword.First " I feels this weather when jazz compares ", this word is particularly easy to be caught by weather field, because it
There are also the keyword in weather field " weather ".It can be seen that it is pre- to promote field for drawing a clear territorial limit for the excavation of extension corpus
The accuracy that model classifies to corpus fields is surveyed to play a very important role.
The excavation of extension corpus occupies very important effect for the construction of the intelligence degree of intellectual product.If I
Wish intellectual product can more understand user, can be more clearly understood that, be close to the users heartfelt wishes, can be appreciated that user is true under different context
Real demand, then just becoming the only way which must be passed deeper into, the efficient extension corpus that excavates.
Domain prediction model above it has been related to, domain prediction model may be considered semantic classifiers, with depth
Practise the classifier that the prediction corpus that algorithm learns out belongs to which field, intention.
Corpus is input to domain prediction model, the available corpus of domain prediction model is respectively in the general of different field
Rate further determines that out field belonging to corpus with this.
A kind of basic content for extending corpora mining method is illustrated in above-described embodiment, now from the digging of extension corpus
Pick mode further progress illustrates.
A kind of extension corpora mining method provided by the embodiments of the present application can be applied to server, which can be net
Network side provides the service equipment of service for user, may be the server cluster of multiple servers composition, it is also possible to separate unit
Server.
Optionally, Fig. 1 shows the hardware block diagram of server, and referring to Fig.1, the hardware configuration of server can wrap
It includes: processor 11, communication interface 12, memory 13 and communication bus 14;
In embodiments of the present invention, processor 11, communication interface 12, memory 13, communication bus 14 quantity can be with
For at least one, and processor 11, communication interface 12, memory 13 complete mutual communication by communication bus 14;
Processor 11 may be a central processor CPU or specific integrated circuit ASIC (Application
Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention
Road etc.;
Memory 13 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile
Memory) etc., a for example, at least magnetic disk storage;
Wherein, memory is stored with program, the program that processor can call memory to store, and program is used for:
Whether corpus, which belongs to target neck, is determined in the scoring of target domain to corpus according to the domain prediction model of pre-training
The candidate corpus of the first of domain;
If corpus belongs to the first candidate corpus of target domain, one is determined from least one corpus of life-stylize corpus
A and the first candidate corpus of candidate corpus similarity highest second;
Using two disaggregated models of the target domain of pre-training determine candidate corpus whether be target domain extension corpus,
Two disaggregated models using the corpus for belonging to target domain as positive sample, be not belonging to target domain corpus be negative sample training classification calculate
Method obtains, and candidate corpus includes the second candidate corpus.
Optionally, the refinement function of program and extension function can refer to and be described below.
For the ease of the understanding to the extension corpora mining method for being suitable for above-mentioned server, now the embodiment of the present application is mentioned
A kind of extension corpora mining method supplied describes in detail.
A kind of extension corpora mining method provided by the embodiments of the present application needs to use the neck of pre-training in implementation process
Two disaggregated models of domain prediction model and pre-training are now first said from the generation method of domain prediction model and two disaggregated models
It is bright.
Domain prediction model may be considered semantic classifiers, be used to predict field belonging to corpus.It is pre- by field
The domain prediction model of pre-training can be generated in the generating process for surveying model.
It is referring to fig. 2 a kind of generation method flow chart of domain prediction model provided by the embodiments of the present application.
As shown in Fig. 2, this method comprises:
S201, obtain at least one training sample, at least one training sample include be belonging respectively in multiple fields it is each
The corpus in field;
Domain prediction model can consider to include many submodels, and the corresponding field of different submodels is different.Logical
When crossing domain prediction model and predicting corpus, the probability that corpus belongs to the field can be obtained for each field, in turn
The field of maximum probability is determined as field belonging to corpus.
When generating domain prediction model, need to obtain at least one training sample, each training sample may be considered
One corpus includes the corpus that each field is belonging respectively in multiple fields at least one training sample.For example, multiple fields
When for weather field, music field, geographic territory, at least one training sample of acquisition includes the corpus in weather field, music
The corpus in field and the corpus of geographic territory.
S202, it trained logistic regression algorithm is treated based at least one training sample is trained, it is pre- to obtain initial field
Survey model;
In the embodiment of the present application, trained logistic regression algorithm can be treated based at least one training sample to be instructed
Practice, to obtain initial domain prediction model, which may be implemented the prediction to corpus fields, still
In order to improve the accuracy of its prediction to corpus fields, the embodiment of the present application can also be further pre- to the initial field
It surveys model to be trained to obtain to the more accurate domain prediction model of corpus fields prediction, specific training process is referring to following
Step S203-S207.
S203, at least one corpus sample is obtained;
In the embodiment of the present application, at least one available corpus sample, corpus sample may be considered intellectual product
In use, the corpus inputted using the user of intellectual product into intellectual product.
Whether the scoring of S204, the initial domain prediction model of detection to corpus sample in field is located at second threshold range
Interior, second threshold range is related to door threshold value of the initial domain prediction model to field;
In the embodiment of the present application, trained logistic regression algorithm is treated based at least one training sample to be trained to obtain
Initial domain prediction model can provide the initial domain prediction model respectively to the door threshold value in each field.For example, more
When a field is weather field, music field, geographic territory, at least one training sample of acquisition includes the language in weather field
Material, the corpus of music field and geographic territory corpus, based at least one training sample treat trained logistic regression algorithm into
Row training obtains initial domain prediction model, which provides the door threshold value in weather field, music field
The door threshold value of door threshold value and geographic territory.For example, the door threshold value in weather field is 0.6, the door threshold value of music field is 0.7, ground
The door threshold value in reason field is 0.4.
A kind of extension corpora mining method provided by the embodiments of the present application is the extension corpus in order to find field, that is,
It looks for and belongs to a field but uncommon corpus.May so it be distributed in what section? researcher passes through research
It was found that it is related with door threshold value of the initial domain prediction model to field, it is the section of the door Near Threshold positioned at field.For example,
The door threshold value in weather field is 0.6, and it is very that corpus near 0.6 of probability that is in weather field is all very fuzzy
Indistinguishable corpus is likely to belong to weather field, it is also possible to be not belonging to weather field.The corpus in this section may be
The extension corpus that we need, we need them to get at this time.For example, we can preset the area floated up and down
Between threshold value 0.1, then second threshold range relevant to the door threshold value in weather field be 0.5-0.7;With the door threshold value of music field
Relevant second threshold range is 0.6-0.8;Second threshold range relevant to the door threshold value of geographic territory is 0.3-0.5.With
Upper is only the preferred embodiment of interval threshold provided by the embodiments of the present application, and the specific value inventor in relation to interval threshold can root
It is configured according to the demand of oneself, for example is arranged to 0.11,0.2,0.25 etc., it is not limited here.
Corpus sample is input to initial domain prediction model, obtains initial domain prediction model to corpus sample in weather
Field scoring (that is, corpus sample belongs to the probability in weather field, for example, 0.55), initial domain prediction model is to corpus sample
This music field scoring (that is, corpus sample belongs to the probability of music field, for example, 0.9), initial domain prediction model pair
Corpus sample geographic territory scoring (that is, corpus sample belongs to the probability of geographic territory, for example, 0.45).
If S205, initial scoring of the domain prediction model to corpus sample in field are located within the scope of second threshold, by language
Material sample is determined as the target corpus sample in field;
Based on known to the above-mentioned detailed description to step S204: initial domain prediction model is to corpus sample in weather field
Scoring be 0.55, second threshold range relevant to the door threshold value in weather field is 0.5-0.7, then initial domain prediction model
Scoring to corpus sample in weather field is located in second threshold range 0.5-0.7 relevant to the door threshold value in weather field,
Corpus sample is then determined as to the target corpus sample in weather field;Initial domain prediction model is to corpus sample in music field
Scoring be 0.9, second threshold range relevant to the door threshold value of music field is 0.6-0.8, then initial domain prediction model
Second threshold range 0.6-0.8 relevant to the door threshold value of music field is not in the scoring of music field to corpus sample
It is interior, it is determined that corpus sample is not the target corpus sample of music field;Initial domain prediction model is to corpus sample in geography
The scoring in field is 0.45, and second threshold range relevant to the door threshold value of geographic territory is 0.3-0.5, then initial domain prediction
Scoring of the model to corpus sample in geographic territory is located at second threshold range 0.3-0.5 relevant to the door threshold value of geographic territory
It is interior, it is determined that corpus sample is the target corpus sample of geographic territory.
Further, in the embodiment of the present application, if initial domain prediction model to corpus sample field scoring not
Within the scope of second threshold, determining corpus sample not is the target corpus sample in field, and then does not generate and the corpus sample
Corresponding training sample.
S206, response user generate corresponding with target corpus sample the proving operation of target corpus sample fields
Training sample;
Based on known to the above-mentioned detailed description to step S205: can determine that corpus sample is the target corpus in weather field
Sample and determining corpus sample are the target corpus sample of geographic territory;It can show this content, target language is determined by user
Whether material sample is really the target corpus in weather field, if so, demarcating the target corpus sample belongs to weather field, accordingly
, user can be responded to the proving operation of the target corpus sample, generate training sample corresponding with the target corpus sample,
The training sample is the corpus sample for being demarcated as belonging to weather field;Also, target corpus sample can also be determined by user
Whether really it is the target corpus of geographic territory, if so, demarcating the target corpus sample belongs to geographic territory, correspondingly, can be with
User is responded to the proving operation of the target corpus sample, generates training sample corresponding with the target corpus sample, the training
Sample is the corpus sample for being demarcated as belonging to geographic territory.
In the embodiment of the present application, if user determines that the target corpus sample had not only belonged to weather field but also belonged to geographical neck
Domain can then generate a training sample corresponding with weather field based on the target corpus sample and generate one and geography
The corresponding training sample in field.
S207, training is updated to initial domain prediction model based on training sample generated, obtains pre-training
Domain prediction model.
A kind of domain prediction model generating method provided by the embodiments of the present application can also be into after generating training sample
One step is updated training to initial domain prediction model according to the training sample of generation, to obtain the domain prediction mould of pre-training
Type.
Further, in order to improve a kind of domain prediction model of pre-training provided by the embodiments of the present application to corpus
Treatment effeciency, can also be using modes such as memory optimization, enabling multi-process.
The digging of the extension corpus to target domain may be implemented in domain prediction model based on the embodiment of the present application pre-training
Pick, also, after excavating the extension corpus of target domain, further the extension corpus of target domain can also be determined as instructing
Practice sample, is trained with carrying out further update to domain prediction model based on identified training sample.
In the embodiment of the present application, target domain can be weather field, music field, geographic territory etc., excavate
Out after the extension corpus of music field, the extension corpus of the music field can be determined as training sample, to be based on the training
Sample carries out further update to domain prediction model and trains.
Further, the embodiment of the present application, can also be further to institute after the domain prediction model for generating pre-training
Whether the domain prediction model of generation is verified, accurate with the output result for verifying domain prediction model.
Fig. 3 is a kind of domain prediction model verification method flow chart provided by the embodiments of the present application.
As shown in figure 3, this method comprises:
S301, at least one testing material is obtained, testing material carries realm information;
In the embodiment of the present application, using the extension corpus for the target domain determined as testing material, for realizing pair
The verification of domain prediction model.At this point, the second field of the instruction of realm information entrained by the extension corpus of target domain is
The target domain.
S302, according to the domain prediction model of pre-training to testing material respectively in the scoring in each field, prediction test
First field belonging to corpus;
In the embodiment of the present application, testing material can be input to domain prediction model, testing material is obtained and exist respectively
The scoring in each field, and then the highest field that will score is determined as the first field belonging to testing material.
For example, if the domain prediction model of pre-training is according to the corpus of music field, the corpus of geographic territory and weather
What the corpus training logistic regression algorithm in field obtained, then by testing material, (realm information that the testing material carries refers to
The second field shown be music field) be input to the domain prediction model of pre-training after, obtained result includes: that testing material exists
The scoring 1 of music field, scoring 3 of the testing material in the scoring 2 in weather field and testing material field in the ground;If commenting
2 highests that score are divided in 1, scoring 2 and scoring 3 to pass through ratio it may be considered that the first field belonging to testing material is weather field
It is different compared with discovery the first field (weather field) and the second field (music field), then illustrate the domain prediction model of pre-training
Result inaccuracy is exported, needs further to be trained.
If the domain prediction model of pre-training is according to the corpus of music field, the corpus of geographic territory and weather field
Corpus training logistic regression algorithm obtains, then by testing material (the of the realm information instruction that the testing material carries
Two fields are music field) be input to the domain prediction model of pre-training after, obtained result includes: testing material leads in music
The scoring 1 in domain, scoring 3 of the testing material in the scoring 2 in weather field and testing material field in the ground;If scoring 1 is commented
Divide in 2 and scoring 31 highest that scores, it may be considered that the first field belonging to testing material is music field, by comparing discovery
First field (music field) and the second field (music field) are identical, then illustrate the output knot of the domain prediction model of pre-training
Fruit is accurate.
The of the realm information instruction that first field belonging to S303, the testing material based on prediction and testing material carry
Two fields verify domain prediction model.
The embodiment of the present application can be verified by domain prediction model of at least one test statement to pre-training, with
Immediately the problem of discovery domain prediction model, guarantee the accuracy of domain prediction model output result, and then it is real to improve the application
A kind of accuracy of extension corpora mining method of example offer is provided.
Above mentioned embodiment provide the generating modes of corpus prediction model, now to the life of two disaggregated models of target domain
It is described in detail at method.
It is referring to fig. 4 a kind of generation method flow chart of two disaggregated models of target domain provided by the embodiments of the present application.
As shown in figure 4, this method comprises:
S401, acquisition belong to the corpus of target domain and are not belonging to the corpus of target domain;
In the embodiment of the present application, target domain can be music field, can be weather field, or geography neck
Domain etc..The embodiment of the present application can generate two disaggregated models corresponding with the target domain for different target domains, that is,
Two disaggregated models of the target domain.For example, two disaggregated models of music field can be generated, the two of weather field can be generated
Two disaggregated models of geographic territory etc. can be generated in disaggregated model.
When generating two disaggregated model of target domain, it is necessary first to obtain training sample, the training sample is to belong at this time
In target domain corpus and be not belonging to the corpus of target domain.
S402, using the corpus for belonging to target domain as positive sample, be not belonging to the corpus of target domain as negative sample, it is right
Sorting algorithm is trained, and obtains two disaggregated models of target domain.
In the embodiment of the present application, when generating two disaggregated model of target domain, it can will belong to the language of target domain
Material regards positive sample as, and the corpus that will not belong to target domain regards negative sample as, so according to positive sample and negative sample to point
Class algorithm is trained, to obtain two disaggregated models of target domain.
Wherein, sorting algorithm can be calculated for Xgboost (eXtreme Gradient Boosting, extreme gradient are promoted)
Method, above is only the preferred embodiment of sorting algorithm provided by the embodiments of the present application, the particular content invention in relation to sorting algorithm
People can be configured according to their own needs, it is not limited here.For example, sorting algorithm can be bert algorithm, SVM
(Support Vector Machine, support vector machines) algorithm, LR (Logistic Regression) algorithm, LSTM (Long
Short-Term Memory, shot and long term memory network) algorithm etc..
Further, a kind of extension corpora mining method provided by the embodiments of the present application, can use the two of target domain
Disaggregated model realizes the excavation of the extension corpus to target domain, also, after excavating the extension corpus of target domain, may be used also
The extension corpus of target domain is further updated instruction as two disaggregated models of the positive sample to the current target domain
Practice.
Two classification of the above-described embodiment to the domain prediction model and target domain of pre-training provided by the embodiments of the present application
The generating process of model is described in detail, now from two classification moulds of domain prediction model and target domain based on pre-training
The angle that type excavates the extension corpus of target domain, to a kind of extension corpora mining method provided by the embodiments of the present application
It is described in detail.
Fig. 5 is a kind of extension corpora mining method flow diagram provided by the embodiments of the present application.
As shown in figure 5, this method comprises:
S501, whether corpus, which belongs to mesh, is determined in the scoring of target domain to corpus according to the domain prediction model of pre-training
The candidate corpus of the first of mark field;
In the embodiment of the present application, corpus can be intellectual product in use, using the user of intellectual product to intelligence
The corpus inputted in product.
When excavating the extension corpus of target domain, corpus can be input to the domain prediction model of pre-training, it can be with
Obtain domain prediction model to corpus target domain scoring.That is, domain prediction model, which can export corpus, belongs to target neck
The probability in domain.For example, corpus can be input to the domain prediction model of pre-training, obtained when target domain is music field
To domain prediction model to corpus music field scoring.That is, obtaining the probability that corpus belongs to music field;In turn, it is based on
Corpus belongs to the probability of music field, can determine whether the corpus belongs to the first candidate corpus of music field.
In the embodiment of the present application, determining whether corpus belongs to the mode of the first candidate corpus of music field can be with are as follows:
Door threshold value of the domain prediction model to music field for determining pre-training is generated according to the preset interval threshold to float up and down
First threshold range relevant to the door threshold value of music field;The domain prediction model of detection pre-training is to corpus in music field
Scoring whether be located in first threshold range, if so, determining that corpus belongs to the first candidate corpus of music field, if it is not, really
Attribute material is not belonging to the first candidate corpus of music field.
For example, being 0.5 in door threshold value of the domain prediction model for determining pre-training to music field, the field of pre-training is pre-
Survey model when the scoring of music field is 0.45, if the preset interval threshold to float up and down is 0.1, generates corpus
First threshold range relevant to the door threshold value of music field is 0.4-0.6, and the domain prediction model of pre-training is to corpus at this time
The scoring of music field be 0.45 be located at first threshold range relevant to the door threshold value of music field be in 0.4-0.6, then
Illustrate the first candidate corpus that the corpus is music field.
If S502, corpus belong to the first candidate corpus of target domain, from least one corpus of life-stylize corpus
Determine one and the first candidate corpus of candidate corpus similarity highest second;
In order to improve the deep-going of extension corpora mining, after determining the first candidate corpus that corpus is target domain, I
The corpus of more life-stylizes can be recalled based on the first candidate corpus, and then extension language is improved based on the corpus of life-stylize
Expect the deep-going excavated.
Specifically, a life-stylize corpus can be set in the embodiment of the present application, the corpus in life-stylize corpus is inclined
The corpus of life-stylize includes at least one corpus in the life-stylize corpus.In the embodiment of the present application in life-stylize corpus
The source of corpus can be flat from the corpus crawled in search dog question and answer pair, the corpus crawled from Baidu's question and answer pair, some open sources
The chat sentence for the life-stylize that platform provides.The life-stylize corpus can periodically update or real-time update, so that it is more sticked on
It is bordering on current daily life sentence.
After determining the first candidate corpus that corpus is target domain, ES (ElasticSearch, search clothes can be passed through
Business device) retrieval determining one and first candidate corpus similarity highest second from least one corpus of life-stylize corpus
Candidate corpus.
ES:ElasticSearch is the search server based on Lucene, it provides a distributed multi-user
The full-text search engine of ability is based on RESTful web interface, can reach real-time search, stablizes, reliably, quickly, installation makes
With conveniently.
S503, determined using two disaggregated models of the target domain of pre-training candidate corpus whether be target domain extension
Corpus, using the corpus for belonging to target domain as positive sample, the corpus for being not belonging to target domain is negative sample training two disaggregated models
Sorting algorithm obtains, and candidate corpus includes the second candidate corpus.
In the embodiment of the present application, in the first candidate corpus for determining that corpus is target domain, from life-stylize corpus
It determines with after the first candidate corpus of candidate corpus similarity highest second, can use two points of the target domain of pre-training
Class model determine the second candidate corpus whether be target domain extension corpus.
Specifically, two disaggregated models of the target domain of pre-training provide the door threshold value to the target domain, second is waited
Select corpus to be input to two disaggregated models of the target domain of the pre-training, obtain two disaggregated models of the target domain to this second
Candidate corpus the target domain scoring (that is, probability that the second candidate corpus belongs to the target domain), it is big in the scoring
When this threshold value, it is believed that the second candidate corpus is the extension corpus of the target domain, is not more than the door in the scoring
When threshold value, it is believed that the second candidate corpus is not the extension corpus of the target domain.
In the embodiment of the present application, determine that the second candidate corpus is target domain in two disaggregated models based on target domain
Extension corpus after, can also further by user determine the second candidate corpus whether be really target domain extension language
Material, to be further ensured that the accuracy for the extension corpus excavated.
In the embodiment of the present application, two disaggregated models of the target domain of pre-training provide the door threshold value to the target domain,
Further, it is pre- can also to be input to this by a kind of extension corpora mining method provided by the embodiments of the present application for the first candidate corpus
Two disaggregated models of trained target domain obtain two disaggregated models of the target domain to the first candidate corpus in the target
The scoring (that is, probability that the first candidate corpus belongs to the target domain) in field can be with when the scoring is greater than this threshold value
Think that the first candidate corpus is the extension corpus of the target domain, when the scoring is not more than this threshold value, it is believed that should
First candidate corpus is not the extension corpus of the target domain.
In the embodiment of the present application, determine that the first candidate corpus is target domain in two disaggregated models based on target domain
Extension corpus after, can also further by user determine the first candidate corpus whether be really target domain extension language
Material, to be further ensured that the accuracy for the extension corpus excavated.
Two disaggregated models that the application can use the target domain of pre-training determine whether candidate corpus is target domain
Extension corpus, wherein candidate corpus includes the second candidate corpus (that is, one second candidate corpus can regard a time as
Select corpus), alternatively, candidate corpus includes the first candidate corpus and the second candidate corpus (that is, one first candidate corpus can be seen
At being a candidate corpus, one second candidate corpus can also regard a candidate corpus as).
In the embodiment of the present application, when candidate corpus includes the second candidate corpus, if two disaggregated models of target domain
After determining the second candidate corpus for extension corpus, further it can also determine that two disaggregated models by target domain are true by user
Whether the second candidate corpus for being set to extension corpus is really the extension corpus of target domain, and determines the first time by user
Select whether corpus is really the extension corpus of target domain, to be further ensured that the accuracy for the extension corpus excavated.
In order to be more clearly illustrated to a kind of extension corpora mining method provided by the embodiments of the present application, now to this
Apply the domain prediction model according to pre-training in a kind of extension corpora mining method of embodiment offer to corpus in target
The method whether determining corpus of the scoring in field belongs to the first candidate corpus of target domain is described in detail.
Fig. 6 be a kind of domain prediction model according to pre-training provided by the embodiments of the present application to corpus in target domain
Whether the determining corpus that scores belongs to the method flow diagram of the first candidate corpus of target domain.
As shown in fig. 6, this method comprises:
S601, the domain prediction model that corpus is input to pre-training obtain domain prediction model and lead to corpus in target
The scoring in domain;
Whether the scoring of S602, detection field prediction model to corpus in target domain is located in first threshold range;If
Scoring of the domain prediction model to corpus in target domain is located in first threshold range, executes step S603;If domain prediction
Scoring of the model to corpus in target domain is not in first threshold range, executes step S604;
In the embodiment of the present application, first threshold range is related to door threshold value of the domain prediction model to target domain.
S603, determine that corpus belongs to the first candidate corpus of target domain;
S604, determine that corpus is not belonging to the first candidate corpus of target domain.
In order to be more clearly illustrated to a kind of extension corpora mining method provided by the embodiments of the present application, now to this
A kind of two disaggregated models of target domain using pre-training that application embodiment provides determine whether candidate corpus is target neck
The method of the extension corpus in domain is described in detail.
Fig. 7 is that a kind of two disaggregated models of target domain using pre-training provided by the embodiments of the present application determine candidate language
Material whether be target domain extension corpus method flow diagram.
As shown in fig. 7, this method comprises:
S701, two disaggregated models that candidate corpus is input to the target domain of pre-training, obtain two disaggregated models to time
Select the scoring of corpus;
Whether S702, two disaggregated models of detection are greater than two disaggregated models to the door threshold of target domain to the scoring of candidate corpus
Value;If two disaggregated models are greater than two disaggregated models to the door threshold value of target domain to the scoring of candidate corpus, step S703 is executed;
If two disaggregated models are to the scoring of candidate corpus no more than two disaggregated models to the door threshold value of target domain, execution step S704;
S703, determine that candidate corpus is the extension corpus of target domain;
S704, determine that candidate corpus is not the extension corpus of target domain.
The application provides a kind of corpora mining method, and the corpus prediction model based on pre-training is to corpus in target domain
It scores and determines whether corpus is fuzzy corpus in target domain (that is, the first candidate corpus, the corpus may belong to target domain
It may also be not belonging to target domain);Pass through life-stylize corpus to first if corpus is the first candidate corpus of target domain
Candidate corpus is extended, and obtains the second candidate corpus with the first highest life-stylize of candidate corpus similarity;To pass through two
Disaggregated model determines whether candidate corpus (candidate corpus includes the second candidate corpus) really belongs to the extension corpus of target domain.
The application does not need to match keyword, standard corpus or standard form one by one, can reduce time-consuming accordingly, with respect to the prior art
Improve extension corpora mining efficiency, and the expansion based on the second corpus to the first highest life-stylize of candidate corpus similarity
It fills, realizes the deep excavation to extension corpus.
Fig. 8 is a kind of structural schematic diagram for extending corpora mining device provided by the embodiments of the present application.
As shown in figure 8, the device includes:
First candidate corpus determination unit 81, for the domain prediction model according to pre-training to corpus in target domain
Score the first candidate corpus for determining whether corpus belongs to target domain;
Second candidate corpus determination unit 82, if belonging to the first candidate corpus of target domain for corpus, from life-stylize
One and the first candidate corpus of candidate corpus similarity highest second are determined at least one corpus of corpus;
Corpus determination unit 83 is extended, two disaggregated models for the target domain using pre-training determine that candidate corpus is
The no extension corpus for target domain, two disaggregated models are not belonging to target domain using the corpus for belonging to target domain as positive sample
The corpus sample training sorting algorithm that is negative obtain, candidate corpus includes the second candidate corpus.
In the embodiment of the present application, it is preferred that the first candidate corpus determination unit, comprising:
First scoring unit obtains domain prediction model pair for corpus to be input to the domain prediction model of pre-training
Scoring of the corpus in target domain;
Whether first detection unit, the scoring for detection field prediction model to corpus in target domain are located at the first threshold
It is worth in range, first threshold range is related to door threshold value of the domain prediction model to target domain;
First determination unit, if the scoring for domain prediction model to corpus in target domain is located at first threshold range
It is interior, determine that corpus belongs to the first candidate corpus of target domain;
Second determination unit, if the scoring for domain prediction model to corpus in target domain is not at first threshold
In range, determine that corpus is not belonging to the first candidate corpus of target domain.
In the embodiment of the present application, it is preferred that extension corpus determination unit, comprising:
Second scoring unit, two disaggregated models of the target domain for candidate corpus to be input to pre-training obtain two
Scoring of the disaggregated model to candidate corpus;
Whether second detection unit is greater than two disaggregated models to mesh to the scoring of candidate corpus for detecting two disaggregated models
The door threshold value in mark field;
Third determination unit, if being greater than two disaggregated models to target domain to the scoring of candidate corpus for two disaggregated models
Door threshold value, determine candidate corpus be target domain extension corpus;
4th determination unit, if being led no more than two disaggregated models to target for two disaggregated models to the scoring of candidate corpus
The door threshold value in domain determines that candidate's corpus is not the extension corpus of target domain.
Further, a kind of extension corpora mining device provided by the embodiments of the present application further includes that domain prediction model generates
Unit, comprising:
First acquisition unit, for obtaining at least one training sample, at least one training sample includes in multiple fields
It is belonging respectively to the corpus in each field;
Initial domain prediction model generation unit, for treating trained logistic regression algorithm based at least one training sample
It is trained, obtains initial domain prediction model;
Second acquisition unit, for obtaining at least one corpus sample;
Third detection unit, for detecting whether scoring of the initial domain prediction model to corpus sample in field is located at
In two threshold ranges, second threshold range is related to door threshold value of the initial domain prediction model to field;
5th determination unit, if being located at second threshold for initial scoring of the domain prediction model to corpus sample in field
In range, corpus sample is determined as to the target corpus sample in field;
Training sample generation unit, for responding user to the proving operation of target corpus sample fields, generate with
The corresponding training sample of target corpus sample;
Domain prediction model generates subelement, for being carried out based on training sample generated to initial domain prediction model
Training is updated, the domain prediction model of pre-training is obtained.
Further, a kind of extension corpora mining device provided by the embodiments of the present application, further includes:
Domain prediction model modification unit is determined as training sample for that will extend corpus, based on identified trained sample
This is updated training to domain prediction model.
Further, a kind of extension corpora mining device provided by the embodiments of the present application, further includes domain prediction model school
Verification certificate member, comprising:
Third acquiring unit, for obtaining at least one testing material, testing material carries realm information;
Predicting unit, for according to the domain prediction model of pre-training to testing material respectively in the scoring in each field,
Predict the first field belonging to testing material;
Verification unit, the realm information carried for the first field belonging to the testing material based on prediction and testing material
The second field indicated verifies domain prediction model.
In the embodiment of the present application, it is preferred that the second candidate corpus determination unit is specifically used for examining by search server
Rope determines one and the first candidate language of candidate corpus similarity highest second from least one corpus of life-stylize corpus
Material.
Further, the embodiment of the present application also provides a kind of computer readable storage medium, the computer-readable storage
Computer executable instructions are stored in medium, the computer executable instructions are for executing expansion involved by above-described embodiment
Open up corpora mining method.
The detailed description of program in relation to storing in storage medium provided by the embodiments of the present application can refer to above-described embodiment,
This will not be repeated here.
The application provides a kind of corpora mining method, apparatus, server and storage medium, the corpus prediction based on pre-training
Model to corpus the scoring of target domain determine corpus whether be in target domain fuzzy corpus (that is, the first candidate corpus,
The corpus, which may belong to target domain, may also be not belonging to target domain);If corpus is the first candidate corpus of target domain
The first candidate corpus is extended by life-stylize corpus, is obtained and the first highest life-stylize of candidate corpus similarity
Second candidate corpus;To determine whether candidate corpus (candidate corpus includes the second candidate corpus) really belongs to by two disaggregated models
In the extension corpus of target domain.The application does not need to match keyword, standard corpus or standard form one by one, therefore, relatively
Time-consuming improve can be reduced in the prior art and extends corpora mining efficiency, and based on highest to the first candidate corpus similarity
The expansion of second corpus of life-stylize realizes the deep excavation to extension corpus.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure
And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These
Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession
Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered
Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor
The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments in the case where not departing from core of the invention thought or scope.Therefore, originally
Invention is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein
Consistent widest scope.
Claims (10)
1. a kind of extension corpora mining method characterized by comprising
Whether the corpus, which belongs to the mesh, is determined in the scoring of target domain to corpus according to the domain prediction model of pre-training
The candidate corpus of the first of mark field;
If the corpus belongs to the first candidate corpus of the target domain, from least one corpus of life-stylize corpus really
Fixed one and the candidate corpus of the first candidate corpus similarity highest second;
Using two disaggregated models of the target domain of pre-training determine candidate corpus whether be the target domain extension
Corpus, two disaggregated model are not belonging to the corpus of the target domain using the corpus for belonging to the target domain as positive sample
The sample training sorting algorithm that is negative obtains, and candidate's corpus includes the described second candidate corpus.
2. the method according to claim 1, wherein described exist to corpus according to the domain prediction model of pre-training
The scoring of target domain determines whether the corpus belongs to the first candidate corpus of the target domain, comprising:
The corpus is input to the domain prediction model of pre-training, obtains the domain prediction model to the corpus in target
The scoring in field;
Detect whether scoring of the domain prediction model to the corpus in target domain is located in first threshold range, it is described
First threshold range is related to door threshold value of the domain prediction model to the target domain;
If scoring of the domain prediction model to the corpus in target domain is located in the first threshold range, institute is determined
Predicate material belongs to the first candidate corpus of the target domain;
If the domain prediction model is not in the first threshold range corpus in the scoring of target domain, really
The fixed corpus is not belonging to the first candidate corpus of the target domain.
3. the method according to claim 1, wherein two classification of the target domain using pre-training
Model determine candidate corpus whether be the target domain extension corpus, comprising:
Candidate corpus is input to two disaggregated models of the target domain of pre-training, obtains two disaggregated model to described
The scoring of candidate corpus;
It detects two disaggregated model and the target is led to whether the scoring of the candidate corpus is greater than two disaggregated model
The door threshold value in domain;
If two disaggregated model is greater than two disaggregated model to the door of the target domain to the scoring of the candidate corpus
Threshold value determines that the candidate corpus is the extension corpus of the target domain;
If two disaggregated model is to the scoring of the candidate corpus no more than two disaggregated model to the target domain
Door threshold value determines that the candidate corpus is not the extension corpus of the target domain.
4. the method according to claim 1, wherein further including domain prediction model generating process, the process packet
It includes:
At least one training sample is obtained, at least one described training sample includes that each neck is belonging respectively in multiple fields
The corpus in domain;
Trained logistic regression algorithm is treated based at least one described training sample to be trained, and obtains initial domain prediction mould
Type;
Obtain at least one corpus sample;
Detect whether scoring of the initial domain prediction model to the corpus sample in field is located within the scope of second threshold,
The second threshold range and the initial domain prediction model are related to the door threshold value in the field;
If scoring of the initial domain prediction model to the corpus sample in the field is located at the second threshold range
It is interior, the corpus sample is determined as to the target corpus sample in the field;
User is responded to the proving operation of the target corpus sample fields, is generated corresponding with the target corpus sample
Training sample;
Training is updated to the initial domain prediction model based on training sample generated, the field for obtaining pre-training is pre-
Survey model.
5. according to the method described in claim 4, it is characterized by further comprising:
The extension corpus is determined as training sample, the domain prediction model is carried out more based on identified training sample
New training.
6. method described in -5 any one according to claim 1, which is characterized in that further include:
At least one testing material is obtained, the testing material carries realm information;
The testing material institute is predicted respectively in the scoring in each field to testing material according to the domain prediction model of pre-training
The first field belonged to;
The of the realm information instruction that first field belonging to the testing material based on prediction and the testing material carry
Two fields verify the domain prediction model.
7. the method according to claim 1, wherein in described at least one corpus from life-stylize corpus really
Fixed one and the candidate corpus of the first candidate corpus similarity highest second, comprising: retrieved by search server from life
It activates and determines one and the described first candidate corpus of candidate corpus similarity highest second at least one corpus of corpus.
8. a kind of extension corpora mining device characterized by comprising
First candidate corpus determination unit, it is true in the scoring of target domain to corpus for the domain prediction model according to pre-training
Whether the fixed corpus belongs to the first candidate corpus of the target domain;
Second candidate corpus determination unit, if belonging to the first candidate corpus of the target domain for the corpus, from life
Change and determines one and the described first candidate corpus of candidate corpus similarity highest second at least one corpus of corpus;
Corpus determination unit is extended, whether two disaggregated models for the target domain using pre-training determine candidate corpus
For the extension corpus of the target domain, two disaggregated model does not belong to using the corpus for belonging to the target domain as positive sample
It is obtained in the corpus of the target domain sample training sorting algorithm that is negative, candidate's corpus includes the described second candidate language
Material.
9. a kind of server characterized by comprising at least one processor and at least one processor;The memory is deposited
Program is contained, the processor calls the program of the memory storage, and described program is any for realizing such as claim 1-7
Extension corpora mining method described in one.
10. a kind of storage medium, which is characterized in that be stored with computer executable instructions, the calculating in the storage medium
Machine executable instruction requires extension corpora mining method described in 1-7 any one for perform claim.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910501365.5A CN110188204B (en) | 2019-06-11 | 2019-06-11 | Extended corpus mining method and device, server and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910501365.5A CN110188204B (en) | 2019-06-11 | 2019-06-11 | Extended corpus mining method and device, server and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110188204A true CN110188204A (en) | 2019-08-30 |
CN110188204B CN110188204B (en) | 2022-10-04 |
Family
ID=67721230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910501365.5A Active CN110188204B (en) | 2019-06-11 | 2019-06-11 | Extended corpus mining method and device, server and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110188204B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852109A (en) * | 2019-11-11 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Corpus generating method, corpus generating device, and storage medium |
CN111091011A (en) * | 2019-12-20 | 2020-05-01 | 科大讯飞股份有限公司 | Domain prediction method, domain prediction device and electronic equipment |
CN111339309A (en) * | 2020-05-22 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Corpus expansion method and system for user intention |
CN111599349A (en) * | 2020-04-01 | 2020-08-28 | 云知声智能科技股份有限公司 | Method and system for training language model |
CN112052320A (en) * | 2020-09-01 | 2020-12-08 | 腾讯科技(深圳)有限公司 | Information processing method and device and computer readable storage medium |
CN112487810A (en) * | 2020-12-17 | 2021-03-12 | 税友软件集团股份有限公司 | Intelligent customer service method, device, equipment and storage medium |
CN113392647A (en) * | 2020-11-25 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Corpus generation method, related device, computer equipment and storage medium |
CN113538075A (en) * | 2020-04-14 | 2021-10-22 | 阿里巴巴集团控股有限公司 | Data processing method, model training method, device and equipment |
CN114822483A (en) * | 2021-01-19 | 2022-07-29 | 美的集团(上海)有限公司 | Data enhancement method, device, equipment and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070073534A1 (en) * | 2005-09-29 | 2007-03-29 | International Business Machines Corporation | Corpus expansion system and method thereof |
CN104346406A (en) * | 2013-08-08 | 2015-02-11 | 北大方正集团有限公司 | Training corpus expanding device and training corpus expanding method |
CN104391942A (en) * | 2014-11-25 | 2015-03-04 | 中国科学院自动化研究所 | Short text characteristic expanding method based on semantic atlas |
CN104484411A (en) * | 2014-12-16 | 2015-04-01 | 中国科学院自动化研究所 | Building method for semantic knowledge base based on a dictionary |
US20170004224A1 (en) * | 2015-07-02 | 2017-01-05 | International Business Machines Corporation | Log-aided automatic query expansion approach based on topic modeling |
CN106777274A (en) * | 2016-06-16 | 2017-05-31 | 北京理工大学 | A kind of Chinese tour field knowledge mapping construction method and system |
CN107526725A (en) * | 2017-09-04 | 2017-12-29 | 北京百度网讯科技有限公司 | The method and apparatus for generating text based on artificial intelligence |
CN108491462A (en) * | 2018-03-05 | 2018-09-04 | 昆明理工大学 | A kind of semantic query expansion method and device based on word2vec |
CN108804512A (en) * | 2018-04-20 | 2018-11-13 | 平安科技(深圳)有限公司 | Generating means, method and the computer readable storage medium of textual classification model |
CN109284397A (en) * | 2018-09-27 | 2019-01-29 | 深圳大学 | A kind of construction method of domain lexicon, device, equipment and storage medium |
WO2019019860A1 (en) * | 2017-07-24 | 2019-01-31 | 华为技术有限公司 | Method and apparatus for training classification model |
-
2019
- 2019-06-11 CN CN201910501365.5A patent/CN110188204B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070073534A1 (en) * | 2005-09-29 | 2007-03-29 | International Business Machines Corporation | Corpus expansion system and method thereof |
CN104346406A (en) * | 2013-08-08 | 2015-02-11 | 北大方正集团有限公司 | Training corpus expanding device and training corpus expanding method |
CN104391942A (en) * | 2014-11-25 | 2015-03-04 | 中国科学院自动化研究所 | Short text characteristic expanding method based on semantic atlas |
CN104484411A (en) * | 2014-12-16 | 2015-04-01 | 中国科学院自动化研究所 | Building method for semantic knowledge base based on a dictionary |
US20170004224A1 (en) * | 2015-07-02 | 2017-01-05 | International Business Machines Corporation | Log-aided automatic query expansion approach based on topic modeling |
CN106777274A (en) * | 2016-06-16 | 2017-05-31 | 北京理工大学 | A kind of Chinese tour field knowledge mapping construction method and system |
WO2019019860A1 (en) * | 2017-07-24 | 2019-01-31 | 华为技术有限公司 | Method and apparatus for training classification model |
CN107526725A (en) * | 2017-09-04 | 2017-12-29 | 北京百度网讯科技有限公司 | The method and apparatus for generating text based on artificial intelligence |
CN108491462A (en) * | 2018-03-05 | 2018-09-04 | 昆明理工大学 | A kind of semantic query expansion method and device based on word2vec |
CN108804512A (en) * | 2018-04-20 | 2018-11-13 | 平安科技(深圳)有限公司 | Generating means, method and the computer readable storage medium of textual classification model |
CN109284397A (en) * | 2018-09-27 | 2019-01-29 | 深圳大学 | A kind of construction method of domain lexicon, device, equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
STEFAN ULTES等: "Analysis of an Extended Interaction Quality Corpus", 《NATURAL LANGUAGE DIALOG SYSTEMS AND INTELLIGENT ASSISTANTS》 * |
庞伟: "基于Web的藏汉双语可比语料库构建技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852109A (en) * | 2019-11-11 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Corpus generating method, corpus generating device, and storage medium |
CN111091011A (en) * | 2019-12-20 | 2020-05-01 | 科大讯飞股份有限公司 | Domain prediction method, domain prediction device and electronic equipment |
CN111599349A (en) * | 2020-04-01 | 2020-08-28 | 云知声智能科技股份有限公司 | Method and system for training language model |
CN113538075A (en) * | 2020-04-14 | 2021-10-22 | 阿里巴巴集团控股有限公司 | Data processing method, model training method, device and equipment |
CN111339309A (en) * | 2020-05-22 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Corpus expansion method and system for user intention |
CN111339309B (en) * | 2020-05-22 | 2020-09-04 | 支付宝(杭州)信息技术有限公司 | Corpus expansion method and system for user intention |
CN112052320B (en) * | 2020-09-01 | 2023-09-29 | 腾讯科技(深圳)有限公司 | Information processing method, device and computer readable storage medium |
CN112052320A (en) * | 2020-09-01 | 2020-12-08 | 腾讯科技(深圳)有限公司 | Information processing method and device and computer readable storage medium |
CN113392647A (en) * | 2020-11-25 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Corpus generation method, related device, computer equipment and storage medium |
CN113392647B (en) * | 2020-11-25 | 2024-04-26 | 腾讯科技(深圳)有限公司 | Corpus generation method, related device, computer equipment and storage medium |
CN112487810A (en) * | 2020-12-17 | 2021-03-12 | 税友软件集团股份有限公司 | Intelligent customer service method, device, equipment and storage medium |
CN114822483A (en) * | 2021-01-19 | 2022-07-29 | 美的集团(上海)有限公司 | Data enhancement method, device, equipment and storage medium |
CN114822483B (en) * | 2021-01-19 | 2024-07-12 | 美的集团(上海)有限公司 | Data enhancement method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110188204B (en) | 2022-10-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188204A (en) | A kind of extension corpora mining method, apparatus, server and storage medium | |
US20210334624A1 (en) | Neural architecture search using a performance prediction neural network | |
WO2018072663A1 (en) | Data processing method and device, classifier training method and system, and storage medium | |
US10776715B2 (en) | Artificial intelligent cognition threshold | |
US11226968B2 (en) | Providing search result content tailored to stage of project and user proficiency and role on given topic | |
GB2595088A (en) | Security systems and methods | |
Bozkurt et al. | Computational analysis of Turkish makam music: Review of state-of-the-art and challenges | |
EP1840764A1 (en) | Hybrid audio-visual categorization system and method | |
CN105389349A (en) | Dictionary updating method and apparatus | |
US11232144B2 (en) | Computer-implemented method and system for competency information management | |
WO2024011813A1 (en) | Text expansion method and apparatus, device, and medium | |
Mesaros et al. | Datasets and evaluation | |
Pannetier et al. | Branching patterns in phylogenies cannot distinguish diversity-dependent diversification from time-dependent diversification | |
US20240185734A1 (en) | Methods, Systems, Devices, and Software for Managing and Conveying Knowledge | |
Arthur et al. | SongExplorer: A deep learning workflow for discovery and segmentation of animal acoustic communication signals | |
Deepika et al. | Relief-F and Budget Tree Random Forest Based Feature Selection for Student Academic Performance Prediction. | |
CN112765398A (en) | Information recommendation method and device and storage medium | |
CN116959393B (en) | Training data generation method, device, equipment and medium of music generation model | |
CN110334112A (en) | A kind of biographic information search method and device | |
CN109582874B (en) | Bidirectional LSTM-based related news mining method and system | |
Wang et al. | VIBRANT: A brainstorming agent for computer supported creative problem solving | |
Kheng et al. | Centroid-based memetic algorithm–adaptive Lamarckian and Baldwinian learning | |
Yang | Chinese contemporary music diffusion strategy based on public opinion maximization | |
Zhang | Integration of art teaching resources in vertical social network | |
Chakraborty et al. | Comparative analysis of deep learning models for bird song classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |