CN107688583A - The method and apparatus for creating the training data for natural language processing device - Google Patents
The method and apparatus for creating the training data for natural language processing device Download PDFInfo
- Publication number
- CN107688583A CN107688583A CN201610640647.XA CN201610640647A CN107688583A CN 107688583 A CN107688583 A CN 107688583A CN 201610640647 A CN201610640647 A CN 201610640647A CN 107688583 A CN107688583 A CN 107688583A
- Authority
- CN
- China
- Prior art keywords
- training data
- natural language
- bag
- module
- subpackage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 183
- 238000003058 natural language processing Methods 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000010276 construction Methods 0.000 claims description 18
- 230000007935 neutral effect Effects 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 7
- 238000010586 diagram Methods 0.000 description 19
- 238000012545 processing Methods 0.000 description 12
- 238000011282 treatment Methods 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 241001269238 Data Species 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of method and apparatus for creating the training data for natural language processing device, and the natural language processing device using the training data.A kind of method for creating the training data for natural language processing system, including:Receive the request for creating the training data;The database for natural language for creating the training data is obtained to input;Determine the subpackage parameter needed for the training data;Based on the subpackage parameter, database for natural language input is divided into multiple bags, the multiple bag each includes multiple examples;For each of the multiple example, Sentence-level characteristic vector is automatically extracted, wherein, there is the multiple bag of the Sentence-level characteristic vector as the training data.
Description
Technical field
The present invention relates to artificial intelligence field, is filled more particularly it relates to which one kind creates for natural language processing
The method and apparatus for the training data put, and the natural language processing device using the training data.
Background technology
In recent years, with the continuous development of computer technology, based on the artificial intelligence of computer technology in many Ying Zhongshi
Now the consciousness to people, thinking information process simulation.Because language is the fundamental mark that the mankind are different from other species, utilize
Computer handles the super objective and boundary that the natural language processing of the language of the mankind embodies artificial intelligence.In such as question and answer
(QA) in the natural language processing system of system, realize and answer human user natural language with accurate, succinct natural language
The problem of proposition.
In question answering system, typically natural language language is extracted using the good grader based on neutral net of training in advance
The structured features of sentence, are then based on that the structured features are retrieved from the knowledge base pre-established or reasoning is answered accordingly
Case.During the training of the above-mentioned grader based on neutral net and the foundation of knowledge base, it is required for providing a large amount of marks
The training data for having structured features performs deep learning for the grader based on neutral net.In a kind of existing question answering system
In, in order to perform the training in advance of grader, it is necessary to using the training data of mark feature, this mark manually are manually in advance
It is time-consuming and expensive.In another existing question answering system, analyzed dependent on traditional natural language parsing (NLP) instrument
To extract feature, it needs critically design feature and there may be error propagation text, therefore lacks versatility and flower
Take substantial amounts of manpower.In addition, existing question answering system often also relies on existing knowledge base so that its performance and application scenarios
Critical constraints.
Accordingly, it is desirable to provide a kind of method and apparatus for creating the training data for natural language processing device, and
Using the natural language processing device of the training data, it automatically generates training using the text that do not mark from natural discourse storehouse
Data neatly optimize training data for the training of grader and the construction of knowledge base according to the purposes of training data
Noise, so as to improve the precision of sorter model training, reduce the complexity of overall calculation.
The content of the invention
In view of the above problems, the present invention provide a kind of method for creating the training data for natural language processing device and
Equipment, and the natural language processing device using the training data.
According to one embodiment of present invention, there is provided a kind of to create training data for natural language processing system
Method, including:Receive the request for creating the training data;Obtain the database for natural language for creating the training data
Input;Determine the subpackage parameter needed for the training data;Based on the subpackage parameter, the database for natural language is inputted
It is divided into multiple bags, the multiple bag each includes multiple examples;For each of the multiple example, sentence is automatically extracted
Level characteristic vector, wherein, there is the multiple bag of the Sentence-level characteristic vector as the training data.
In addition, method according to an embodiment of the invention, wherein the subpackage determined needed for the training data
Parameter includes:Source based on the request for creating the training data and/or database for natural language input, it is determined that described
Subpackage parameter.
In addition, method according to an embodiment of the invention, wherein each for the multiple example, is carried automatically
Sentence-level characteristic vector is taken to include:For each vocabulary elements in each example of the multiple example, predetermined window is extracted
Multiple vocabulary in the range of mouthful extract it with the distance of target word as position feature as word feature;It is special to the word
The characteristic vector that the position feature of seeking peace forms performs maximum pond, obtains the Sentence-level characteristic vector.
In addition, method according to an embodiment of the invention, in addition to:Using the training data, grader is trained
Or construction knowledge base.
In addition, method according to an embodiment of the invention, wherein described utilize the training data, trains grader
Including:Initialize the neural network parameter of the grader;Randomly choose a bag in the multiple bag;Determine one
Cause the maximized example of object function in bag;The nerve net of grader described in gradient updating based on one example
Network parameter, until the neutral net restrains.
According to another embodiment of the invention, there is provided a kind of training data created for natural language processing system
Equipment, including:Request receiving module, the request of the training data is created for receiving;Input module, for being used for
Create the database for natural language input of the training data;Subpackage parameter determination module, for determining the training data institute
The subpackage parameter needed;Subpackage module, for based on the subpackage parameter, database for natural language input being divided into multiple
Bag, the multiple bag each include multiple examples;Characteristic vector pickup module, for for each of the multiple example
It is individual, Sentence-level characteristic vector is automatically extracted, wherein, there is the multiple bag of the Sentence-level characteristic vector as the training
Data.
In addition, equipment according to another embodiment of the invention, wherein the subpackage parameter determination module is based on creating
The request of the training data and/or the source of database for natural language input, determine the subpackage parameter.
In addition, equipment according to another embodiment of the invention, wherein the characteristic vector pickup module is for described
Each vocabulary elements in each example of multiple examples, the multiple vocabulary extracted in the range of predetermined window are special as word
Sign, extracts it with the distance of target word as position feature;To the feature that the word feature and the position feature form to
Amount performs maximum pond, obtains the Sentence-level characteristic vector.
In addition, equipment according to another embodiment of the invention, wherein the training data be used to training grader or
Construction knowledge base.
In addition, equipment according to another embodiment of the invention, in addition to classifier training module, the grader instruction
Practice module to be used for:Initialize the neural network parameter of the grader;Randomly choose a bag in the multiple bag;Determine institute
State in a bag and cause the maximized example of object function;Grader described in gradient updating based on one example
Neural network parameter, until the neutral net restrains.
According to still another embodiment of the invention, there is provided a kind of natural language processing device, including:User interface is set
It is standby, for the natural language problem that receives user input and perform the output of answer;Classifier device, for for it is described from
The relation classification of right language issues input extraction feature and the feature, to obtain structure problem;Knowledge library facilities, for base
In the structure problem, the knowledge base data wherein prestored are retrieved, obtain the structure corresponding to the structure problem
Change information;Answer reasoning equipment, for based on the structured message, reasoning to determine the answer;And training data creates
Equipment, for creating training data, the training data is used to train the classifier device or the construction knowledge library facilities
In knowledge base data, wherein the training data create equipment further comprise request receiving module, for receive create institute
The request input module of training data is stated, for obtaining the database for natural language input for being used to create the training data;Point
Bag parameter determining module, for determining the subpackage parameter needed for the training data;Subpackage module, for being joined based on the subpackage
Number, database for natural language input is divided into multiple bags, the multiple bag each includes multiple examples;Characteristic vector carries
Modulus block, for each for the multiple example, Sentence-level characteristic vector is automatically extracted, wherein, there is the sentence
The multiple bag of level characteristic vector is as the training data.
In addition, natural language processing device according to still another embodiment of the invention, wherein the subpackage parameter determines
Source of the module based on the request for creating the training data and/or database for natural language input, determines the subpackage
Parameter.
In addition, natural language processing device according to still another embodiment of the invention, wherein the characteristic vector pickup
Module extracts multiple vocabulary in the range of predetermined window for each vocabulary elements in each example of the multiple example
As word feature, it is extracted with the distance of target word as position feature;
The characteristic vector formed to the word feature and the position feature performs maximum pond, obtains the Sentence-level
Characteristic vector.
In addition, natural language processing device according to still another embodiment of the invention, wherein the training data creates
Equipment also includes classifier device training module, and the classifier device training module is used for:Initialize the classifier device
Neural network parameter;Randomly choose a bag in the multiple bag;Determine to cause that object function is maximum in one bag
The example changed;The neural network parameter of classifier device described in gradient updating based on one example, until described
Neutral net restrains.
The method and apparatus of establishment according to embodiments of the present invention for the training data of natural language processing device, and
Using the natural language processing device of the training data, it automatically generates training using the text that do not mark from natural discourse storehouse
Data neatly optimize training data for the training of grader and the construction of knowledge base according to the purposes of training data
Noise, so as to improve the precision of sorter model training, reduce the complexity of overall calculation.
It is to be understood that foregoing general description and following detailed description are both exemplary, and it is intended to
In the further explanation for providing claimed technology.
Brief description of the drawings
The embodiment of the present invention is described in more detail in conjunction with the accompanying drawings, above-mentioned and other purpose of the invention,
Feature and advantage will be apparent.Accompanying drawing is used for providing further understanding the embodiment of the present invention, and forms explanation
A part for book, it is used to explain the present invention together with the embodiment of the present invention, is not construed as limiting the invention.In the accompanying drawings,
Identical reference number typically represents same parts or step.
Fig. 1 is method of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system
Flow chart.
Fig. 2 is equipment of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system
Block diagram.
Fig. 3 is in method of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system
The schematic diagram of Sentence-level feature extraction.
Fig. 4 A and 4B are training of the further diagram establishment according to embodiments of the present invention for natural language processing system
The schematic diagram of window treatments in the method for data.
Fig. 5 is flow chart of the diagram using the method for training data training grader according to embodiments of the present invention.
Fig. 6 is the block diagram of diagram natural language processing device according to embodiments of the present invention.
Embodiment
Become apparent in order that obtaining the object, technical solutions and advantages of the present invention, root is described in detail below with reference to accompanying drawings
According to the example embodiment of the present invention.Obviously, described embodiment is only the part of the embodiment of the present invention, rather than this hair
Bright whole embodiments, it should be appreciated that the present invention is not limited by example embodiment described herein.Described in the present invention
The embodiment of the present invention, those skilled in the art's all other embodiment resulting in the case where not paying creative work
It should all fall under the scope of the present invention.
In question answering system, typically natural language language is extracted using the good grader based on neutral net of training in advance
The structured features of sentence, are then based on that the structured features are retrieved from the knowledge base pre-established or reasoning is answered accordingly
Case.Therefore, during the training of the grader based on neutral net and the foundation of knowledge base, it is required for providing a large amount of marks
The training data for having structured features performs deep learning for the grader based on neutral net.In existing question answering system, it is
The training in advance of grader is performed, it is necessary to using the training data of mark feature manually in advance, this mark manually is time-consuming
And expensive.In addition, another existing question answering system analyzes text dependent on traditional natural language parsing (NLP) instrument
To extract feature, it needs critically design feature and there may be error propagation, therefore lacks versatility and spend big
The manpower of amount.In addition, existing question answering system often also relies on existing knowledge base so that its performance and application scenarios are serious
It is limited.
The invention provides a kind of method and apparatus for creating the training data for natural language processing device, and it is used
The text that do not mark from natural discourse storehouse automatically generates training data for the training of grader and the construction of knowledge base, from
Without time-consuming and expensive manual mark;Further, making an uproar for training data is neatly optimized according to the purposes of training data
Sound, i.e., according to training data be for train grader or for construction knowledge base and database for natural language input come
Source, different subpackage parameters is flexibly set, so as to improve the precision of sorter model training, reduce answering for overall calculation
Miscellaneous degree.
Hereinafter, embodiments of the invention will be described in detail with reference to the attached drawings.
Fig. 1 is method of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system
Flow chart.As shown in figure 1, the method bag of establishment according to embodiments of the present invention for the training data of natural language processing system
Include following steps.
In step S101, the request for creating training data is received.In one embodiment of the invention, wound can be received
Build the training data request for the different entities in question answering system.For example, it may be possible to from the knowledge base server in question answering system
Receive the request for creating the training data for construction knowledge base.Or it may be connect from the semantic analysis entity in question answering system
Receive the request for creating the training data for being used to train grader.That is, can from create training data request source,
It is determined that create the purpose of training data.Hereafter, processing enters step S102.
In step s 102, the database for natural language for creating the training data is obtained to input.The present invention's
In one embodiment, database for natural language input can be obtained from different sources.For example, can be from such as " popular point
Comment ", the website of " Wikipedia " etc. obtain database for natural language input.That is, in one embodiment of the present of invention
In, obtain for create the training data database for natural language input be be not labeled feature corpus it is defeated
Enter.Hereafter, processing enters step S103.
In step s 103, the subpackage parameter needed for training data is determined.Discussed further below, according to embodiments of the present invention
Construction of Knowledge Base and classifier training utilize the multi-instance learning based on convolutional neural networks.In multi-instance learning, definition
" bag " is the set of multiple examples.In one embodiment of the invention, based on the request and/or institute for creating the training data
The source of database for natural language input is stated, determines the subpackage parameter.For example, for the training data from " masses comment on "
More strict subpackage parameter will be set than the training data from " Wikipedia ", and this is due to from " masses' comment "
More noise datas be present than the database for natural language input from " Wikipedia " in database for natural language input.This
Outside, for for training the training data of grader to set more strict subpackage than the training data for construction knowledge base
Parameter.Hereafter, processing enters step S104.
In step S104, based on subpackage parameter, database for natural language input is divided into multiple bags, multiple bags it is each
Including multiple examples.In one embodiment of the invention, for example, exist T bag M1,M2,…MT, wherein including for i-th having
qiIndividual exampleHereafter, processing enters step S105.
In step S105, for each of multiple examples, Sentence-level characteristic vector is automatically extracted, there is the sentence
The multiple bag of level characteristic vector is as the training data.It will be described in as follows, wound according to embodiments of the present invention
The method for building the training data for natural language processing system is based on convolutional neural networks, for the natural language of no mark
Example of the corpus input after above-mentioned subpackage performs automatically extracting for Sentence-level characteristic vector, for training grader or
Person's construction knowledge base.Hereinafter, it is described in further detail with reference to the accompanying drawings according to embodiments of the present invention based on convolutional neural networks
Sentence-level characteristic vector automatically extract.
Fig. 2 is equipment of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system
Block diagram.Establishment according to embodiments of the present invention creates equipment 20 for the training data of the training data of natural language processing system
It can be only fitted in question answering system.
As shown in Fig. 2 training data according to embodiments of the present invention, which creates equipment 20, includes request receiving module 201, defeated
Enter module 202, subpackage parameter determination module 203, subpackage module 204, characteristic vector pickup module 205 and classifier training module
206。
The request receiving module 201 is used to receive the request for creating the training data.Specifically, the request receives
Module 201 can receive the training data request for creating the different entities being used in question answering system.For example, it may be possible to from question answering system
In knowledge base server receive create for construction knowledge base training data request.Or may be from question answering system
Semantic analysis entity receive create be used for train grader training data request.
The input module 202 is used to obtain the database for natural language input for being used to create the training data.Specifically
Ground, database for natural language input can be obtained from different sources.For example, can from such as " masses comment ",
The website of " Wikipedia " etc. obtains database for natural language input.That is, in one embodiment of the invention, obtain
The database for natural language input that must be used to create the training data is not to be labeled the corpus input of feature.
The subpackage parameter determination module 203 is used to determine the subpackage parameter needed for the training data.Specifically, it is described
Subpackage parameter determination module 203 based on the request for creating the training data and/or database for natural language input come
Source, determine the subpackage parameter.For example, will be than the instruction from " Wikipedia " for the training data from " masses comment on "
Practice data and more strict subpackage parameter is set, this is due to that the database for natural language input ratio from " masses' comment " comes from
More noise datas be present in the database for natural language input of " Wikipedia ".In addition, for the instruction for training grader
More strict subpackage parameter will be set than the training data for construction knowledge base by practicing data.
The subpackage module 204 is used to be based on the subpackage parameter, database for natural language input is divided into multiple
Bag, the multiple bag each include multiple examples.
The characteristic vector pickup module 205 is used for each for the multiple example, automatically extracts Sentence-level spy
Sign vector, there is the multiple bag of the Sentence-level characteristic vector as the training data, for training grader or
Person's construction knowledge base.
The classifier training module 206 is used to utilize the training data, performs classifier training.The one of the present invention
In individual embodiment, the classifier training module 206 initializes the neural network parameter of grader, randomly chooses the multiple bag
In a bag, determine to cause the maximized example of object function in one bag, and be based on one example
Gradient updating described in grader neural network parameter, until the neutral net restrain.
More than, the instruction of establishment according to embodiments of the present invention for natural language processing system is described referring to Figures 1 and 2
Practice the method and apparatus of data.Hereinafter, Fig. 3 and Fig. 4 establishments according to embodiments of the present invention will be referred to further and be used for nature language
Say the Sentence-level feature extraction in the method for the training data of processing system, and the window treatments in Sentence-level feature extraction.
Fig. 3 is in method of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system
The schematic diagram of Sentence-level feature extraction.Fig. 4 A and 4B are further to illustrate establishment according to embodiments of the present invention to be used for natural language
The schematic diagram of window treatments in the method for the training data of processing system.
Due to single vocabulary vector model critical constraints, therefore the characteristic vector pickup according to the present invention using Sentence-level.
As shown in figure 3, each example of each multiple examples included for multiple bags, i.e., from database for natural language input
Sentence, window treatments are first carried out.In order to be asked using sequence of words is different in size corresponding to the different sentences of window treatments solution
Topic, introduce the position feature of vocabulary.
Specifically, the lexical feature vector matrix of context sliding window is obtained.That is, for an input sentence, consider
Size is w sliding window.For example, as shown in Figure 4 A, for sentence " the People have been moving back of input
Into downtown ", lexical feature vector representation are:
WF={ [Xs,X0,X1],[X0,X1,X2],…,[X5,X6,Xe]}。
Further, the position of vocabulary is described with the distance between two vocabulary, so as to obtain the location matrix of vocabulary
PF.For example, as shown in Fig. 4 V, for vocabulary " been ", it is respectively " 2 " with the distance between " People " and " downtown "
" -4 ", i.e. its position feature vector representation are:
PF=[2, -4]T。
In this way, after window treatments, with SF=[WF, PF]TThe matrix representative sentence vector of composition.
Referring back to Fig. 3, the sentence vector further obtained to window treatments performs maximum pondization and handled, and selects each area
The maximum in domain is as the value after the pool area.Finally nonlinearity is obtained by the use of tanh as activation primitive, extraction
Sentence-level feature.
More than, reference picture 3 describes the Sentence-level feature extraction of training data according to embodiments of the present invention to Fig. 4 B, with
The lower method for describing to train grader using the training data for being extracted Sentence-level feature by reference picture 5.
Fig. 5 is flow chart of the diagram using the method for training data training grader according to embodiments of the present invention.Such as Fig. 5
Shown, the method for training grader according to embodiments of the present invention comprises the following steps.
In step S501, the neural network parameter of grader is initialized.In the multi-instance learning that the present invention uses, base
θ can be considered as in the classification relation model of convolutional neural networks (CNN).Assuming as before, T bag M be present1,M2,…MT, its
In i-th include there is qiIndividual exampleSo, objective function is:
Wherein, j selection be:
Hereafter, processing enters step S502.
In step S502, a bag in multiple bags is randomly choosed.Example in randomly selected bag is fed one by one
Into neutral net.Hereafter, processing enters step S503.
In step S503, determine to cause the maximized example of object function in one bag.For example, based on true
Fixed j-th of exampleSo that object function maximizes.Hereafter, processing enters step S504.
In step S504, the neural network parameter of grader described in the gradient updating based on one example, until
The neutral net convergence.In one embodiment of the invention, such as via Adadelta algorithms it is based onGradient updating
θ.Further, by iterative step S502 to S504, until the neutral net restrains.
In the method for the utilization that reference picture 5 describes training data training grader according to embodiments of the present invention, it will join
The sentence comprising two or more vocabulary entities obtained according to the process of Fig. 3 descriptions is made as the Sentence-level feature acquired in sample
To input, the similarity of feature is calculated using vector space model.By setting multiple bags, maximize for object function
Example selection with maximum contribution, so as to obtain more accurate training pattern.In addition, the calculating every time in iteration is complicated
Spend low.
More than, describe the instruction of establishment according to embodiments of the present invention for natural language processing system referring to figs. 1 to Fig. 5
Practice the method and apparatus of data and using training of the training data for grader, Fig. 6 descriptions will be referred to further below
It is configured with the natural language processing device of the equipment of the establishment training data.
Fig. 6 is the block diagram of diagram natural language processing device according to embodiments of the present invention.As shown in fig. 6, according to this hair
The natural language processing device 60 of bright embodiment include user interface facilities 601, classifier device 602, knowledge library facilities 603,
Answer reasoning equipment 604 and training data create equipment 605.
The user interface facilities 601 is used for the input for receiving the natural language problem of user and the output for performing answer.
In one embodiment of the invention, the user interface facilities 601 is used to realize the natural language processing device 60 with using
The interaction at family.For example, the user interface facilities 601 receives the problem of user's input, the expression for the problem of being inputted to user is entered
Row is checked, follow-up apparatus assembly is submitted to by problem is inputted by the user of inspection.In addition, follow-up apparatus assembly to
After family input problem carries out classification and reasoning acquisition answer, the answer of acquisition is presented to use by the user interface facilities 601
Family inputs the response of problem to realize to user.
The classifier device 602 is used for the relation that extraction feature and the feature are inputted for the natural language problem
Classification, to obtain structure problem.In one embodiment of the invention, the classifier device 602 is retouched using reference picture 5
What the method training for the training grader stated obtained.
The knowledge library facilities 603 is used to be based on the structure problem, retrieves the knowledge base data wherein prestored,
Obtain the structured message corresponding to the structure problem.In one embodiment of the invention, the knowledge library facilities 603
Based on the structure problem inputted from the classifier device 602, based on the knowledge base data prestored, execution is directed to problem
Retrieval.For example, the knowledge library facilities 603 can prestore the index file of known problem and answer pair, the index
File records known the semantic chunk sequence of problem and the positional information of answer, and knowledge is provided to answer user's input problem
Source.The knowledge library facilities 603 advances with the training of establishment according to embodiments of the present invention for natural language processing system
Data configuration.
The answer reasoning equipment 604 is used to be based on the structured message, and reasoning determines the answer.The present invention's
In one embodiment, the answer reasoning equipment 604 is found based on the structured message provided from the knowledge library facilities 603
Inputting problem with user has the relevant issues of same or similar keyword, obtains each relevant issues and inputs problem with user
Similarity, according to similarity select response relevant issues, according to index file record positional information extraction response use
Relevant issues answer, and user is presented to by the user interface facilities 601.
The training data creates equipment 605 and is used to create training data, and the training data is used to train the classification
Knowledge base data in device equipment 602 or the construction knowledge library facilities 603.In one embodiment of the invention, the instruction
It is true to practice request receiving module 201, input module 202, subpackage parameter that data authoring device 605 includes describing above by reference to Fig. 2
Cover half block 203, subpackage module 204, characteristic vector pickup module 205 and classifier training module 206, it will omit herein for each
The repeated description of individual module.
More than, described referring to figs. 1 to Fig. 6 and create the method for the training data for natural language processing device and set
It is standby, and the natural language processing device using the training data, its using from natural discourse storehouse not mark text automatic
Training data is generated for the training of grader and the construction of knowledge base, and is neatly optimized according to the purposes of training data
The noise of training data, so as to improve the precision of sorter model training, reduce the complexity of overall calculation.
It should be noted that in this manual, term " comprising ", "comprising" or its any other variant are intended to
Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those
Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Other identical element also be present in process, method, article or equipment including the key element.
Finally, it is to be noted that, a series of above-mentioned processing are not only included with order described here in temporal sequence
The processing of execution, and the processing including performing parallel or respectively rather than in chronological order.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by
Software adds the mode of required hardware platform to realize, naturally it is also possible to is all implemented by hardware.Based on such understanding,
What technical scheme contributed to background technology can be embodied in the form of software product in whole or in part,
The computer software product can be stored in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are making
Obtain a computer equipment (can be personal computer, server, or network equipment etc.) and perform each embodiment of the present invention
Or the method described in some parts of embodiment.
The present invention is described in detail above, principle and embodiment party of the specific case used herein to the present invention
Formula is set forth, and the explanation of above example is only intended to help the method and its core concept for understanding the present invention;It is meanwhile right
In those of ordinary skill in the art, according to the thought of the present invention, change is had in specific embodiments and applications
Part, in summary, this specification content should not be construed as limiting the invention.
Claims (14)
1. a kind of method for creating the training data for natural language processing system, including:
Receive the request for creating the training data;
The database for natural language for creating the training data is obtained to input;
Determine the subpackage parameter needed for the training data;
Based on the subpackage parameter, database for natural language input is divided into multiple bags, each of the multiple bag includes
Multiple examples;
For each of the multiple example, Sentence-level characteristic vector is automatically extracted,
Wherein, the multiple bag with the Sentence-level characteristic vector is as the training data.
2. the method as described in claim 1, wherein the subpackage parameter determined needed for the training data includes:
Source based on the request for creating the training data and/or database for natural language input, determines the subpackage
Parameter.
3. the method as described in claim 1, wherein each for the multiple example, automatically extract Sentence-level feature to
Amount includes:
For each vocabulary elements in each example of the multiple example, multiple vocabulary in the range of predetermined window are extracted
As word feature, it is extracted with the distance of target word as position feature;
The characteristic vector formed to the word feature and the position feature performs maximum pond, obtains the Sentence-level feature
Vector.
4. Claim 1-3 it is any as described in method, in addition to:Using the training data, grader or construction are trained
Knowledge base.
5. method as claimed in claim 4, wherein described utilize the training data, training grader includes:
Initialize the neural network parameter of the grader;
Randomly choose a bag in the multiple bag;
Determine to cause the maximized example of object function in one bag;
The neural network parameter of grader described in gradient updating based on one example, until the neutral net restrains.
6. a kind of equipment for creating the training data for natural language processing system, including:
Request receiving module, the request of the training data is created for receiving;
Input module, for obtaining the database for natural language input for being used to create the training data;
Subpackage parameter determination module, for determining the subpackage parameter needed for the training data;
Subpackage module, it is the multiple for based on the subpackage parameter, database for natural language input to be divided into multiple bags
Bag each includes multiple examples;
Characteristic vector pickup module, for each for the multiple example, Sentence-level characteristic vector is automatically extracted,
Wherein, the multiple bag with the Sentence-level characteristic vector is as the training data.
7. equipment as claimed in claim 6, wherein the subpackage parameter determination module asking based on the establishment training data
Ask and/or the database for natural language input source, determine the subpackage parameter.
8. equipment as claimed in claim 6, wherein the characteristic vector pickup module for the multiple example each
Each vocabulary elements in example, multiple vocabulary in the range of predetermined window are extracted as word feature, extract itself and target word
Distance as position feature;
The characteristic vector formed to the word feature and the position feature performs maximum pond, obtains the Sentence-level feature
Vector.
9. claim 6 to 8 it is any as described in equipment, wherein the training data be used for train grader or construction knowledge
Storehouse.
10. equipment as claimed in claim 9, in addition to classifier training module, the classifier training module is used for:
Initialize the neural network parameter of the grader;
Randomly choose a bag in the multiple bag;
Determine to cause the maximized example of object function in one bag;
The neural network parameter of grader described in gradient updating based on one example, until the neutral net restrains.
11. a kind of natural language processing device, including:
User interface facilities, for the natural language problem that receives user input and perform the output of answer;
Classifier device, the relation for inputting extraction feature and the feature for the natural language problem is classified, to obtain
Obtain structure problem;
Knowledge library facilities, for based on the structure problem, retrieving the knowledge base data wherein prestored, being corresponded to
The structured message of the structure problem;
Answer reasoning equipment, for based on the structured message, reasoning to determine the answer;And
Training data creates equipment, and for creating training data, the training data is used to train the classifier device or structure
The knowledge base data in the knowledge library facilities are made,
Wherein described training data creates equipment and further comprised
Request receiving module, the request of the training data is created for receiving
Input module, for obtaining the database for natural language input for being used to create the training data;
Subpackage parameter determination module, for determining the subpackage parameter needed for the training data;
Subpackage module, it is the multiple for based on the subpackage parameter, database for natural language input to be divided into multiple bags
Bag each includes multiple examples;
Characteristic vector pickup module, for each for the multiple example, Sentence-level characteristic vector is automatically extracted,
Wherein, the multiple bag with the Sentence-level characteristic vector is as the training data.
12. natural language processing device as claimed in claim 11, wherein the subpackage parameter determination module is based on creating institute
The source of request and/or the database for natural language input of training data is stated, determines the subpackage parameter.
13. natural language processing device as claimed in claim 11, wherein the characteristic vector pickup module is for described more
Each vocabulary elements in each example of individual example, multiple vocabulary in the range of predetermined window are extracted as word feature,
It is extracted with the distance of target word as position feature;
The characteristic vector formed to the word feature and the position feature performs maximum pond, obtains the Sentence-level feature
Vector.
14. claim 11 to 13 it is any as described in natural language processing device, wherein the training data create equipment
Also include classifier device training module, the classifier device training module is used for:
Initialize the neural network parameter of the classifier device;
Randomly choose a bag in the multiple bag;
Determine to cause the maximized example of object function in one bag;
The neural network parameter of classifier device described in gradient updating based on one example, until the neutral net is received
Hold back.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610640647.XA CN107688583A (en) | 2016-08-05 | 2016-08-05 | The method and apparatus for creating the training data for natural language processing device |
JP2017151426A JP2018022496A (en) | 2016-08-05 | 2017-08-04 | Method and equipment for creating training data to be used for natural language processing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610640647.XA CN107688583A (en) | 2016-08-05 | 2016-08-05 | The method and apparatus for creating the training data for natural language processing device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107688583A true CN107688583A (en) | 2018-02-13 |
Family
ID=61152105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610640647.XA Pending CN107688583A (en) | 2016-08-05 | 2016-08-05 | The method and apparatus for creating the training data for natural language processing device |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP2018022496A (en) |
CN (1) | CN107688583A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875045A (en) * | 2018-06-28 | 2018-11-23 | 第四范式(北京)技术有限公司 | The method and its system of machine-learning process are executed for text classification |
CN109933602A (en) * | 2019-02-28 | 2019-06-25 | 武汉大学 | A kind of conversion method and device of natural language and structured query language |
CN110298372A (en) * | 2018-03-23 | 2019-10-01 | 鼎捷软件股份有限公司 | The method and system of automatic training virtual assistant |
CN110781294A (en) * | 2018-07-26 | 2020-02-11 | 国际商业机器公司 | Training corpus refinement and incremental update |
CN113806489A (en) * | 2021-09-26 | 2021-12-17 | 北京有竹居网络技术有限公司 | Method, electronic device and computer program product for dataset creation |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109766994A (en) * | 2018-12-25 | 2019-05-17 | 华东师范大学 | A kind of neural network framework of natural language inference |
CN110110327B (en) * | 2019-04-26 | 2021-06-22 | 网宿科技股份有限公司 | Text labeling method and equipment based on counterstudy |
-
2016
- 2016-08-05 CN CN201610640647.XA patent/CN107688583A/en active Pending
-
2017
- 2017-08-04 JP JP2017151426A patent/JP2018022496A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298372A (en) * | 2018-03-23 | 2019-10-01 | 鼎捷软件股份有限公司 | The method and system of automatic training virtual assistant |
CN110298372B (en) * | 2018-03-23 | 2023-06-09 | 鼎捷软件股份有限公司 | Method and system for automatically training virtual assistant |
CN108875045A (en) * | 2018-06-28 | 2018-11-23 | 第四范式(北京)技术有限公司 | The method and its system of machine-learning process are executed for text classification |
CN108875045B (en) * | 2018-06-28 | 2021-06-04 | 第四范式(北京)技术有限公司 | Method of performing machine learning process for text classification and system thereof |
CN110781294A (en) * | 2018-07-26 | 2020-02-11 | 国际商业机器公司 | Training corpus refinement and incremental update |
CN110781294B (en) * | 2018-07-26 | 2024-02-02 | 国际商业机器公司 | Training corpus refinement and incremental update |
CN109933602A (en) * | 2019-02-28 | 2019-06-25 | 武汉大学 | A kind of conversion method and device of natural language and structured query language |
CN113806489A (en) * | 2021-09-26 | 2021-12-17 | 北京有竹居网络技术有限公司 | Method, electronic device and computer program product for dataset creation |
Also Published As
Publication number | Publication date |
---|---|
JP2018022496A (en) | 2018-02-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107688583A (en) | The method and apparatus for creating the training data for natural language processing device | |
CN106528845B (en) | Retrieval error correction method and device based on artificial intelligence | |
CN110427463B (en) | Search statement response method and device, server and storage medium | |
WO2018207723A1 (en) | Abstract generation device, abstract generation method, and computer program | |
CN106844658A (en) | A kind of Chinese text knowledge mapping method for auto constructing and system | |
CN112990296B (en) | Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation | |
CN108536681A (en) | Intelligent answer method, apparatus, equipment and storage medium based on sentiment analysis | |
CN106919655A (en) | A kind of answer provides method and apparatus | |
US20200183928A1 (en) | System and Method for Rule-Based Conversational User Interface | |
CN110096711A (en) | The natural language semantic matching method of the concern of the sequence overall situation and local dynamic station concern | |
US11417339B1 (en) | Detection of plagiarized spoken responses using machine learning | |
CN112685550B (en) | Intelligent question-answering method, intelligent question-answering device, intelligent question-answering server and computer readable storage medium | |
Cui et al. | Dataset for the first evaluation on Chinese machine reading comprehension | |
Faizan et al. | Automatic generation of multiple choice questions from slide content using linked data | |
Rahman et al. | NLP-based automatic answer script evaluation | |
CN109033073A (en) | Text contains recognition methods and device | |
Alshammari et al. | TAQS: an Arabic question similarity system using transfer learning of BERT with BILSTM | |
Singh et al. | Encoder-decoder architectures for generating questions | |
CN116860947A (en) | Text reading and understanding oriented selection question generation method, system and storage medium | |
CN116362331A (en) | Knowledge point filling method based on man-machine cooperation construction knowledge graph | |
Karpagam et al. | Deep learning approaches for answer selection in question answering system for conversation agents | |
CN107818078B (en) | Semantic association and matching method for Chinese natural language dialogue | |
Zhang et al. | Sliding-Bert: Striding Towards Conversational Machine Comprehension in Long Contex | |
Su et al. | Automatic ontology population using deep learning for triple extraction | |
CN113011141A (en) | Buddha note model training method, Buddha note generation method and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180213 |
|
WD01 | Invention patent application deemed withdrawn after publication |