CN107688583A - The method and apparatus for creating the training data for natural language processing device - Google Patents

The method and apparatus for creating the training data for natural language processing device Download PDF

Info

Publication number
CN107688583A
CN107688583A CN201610640647.XA CN201610640647A CN107688583A CN 107688583 A CN107688583 A CN 107688583A CN 201610640647 A CN201610640647 A CN 201610640647A CN 107688583 A CN107688583 A CN 107688583A
Authority
CN
China
Prior art keywords
training data
natural language
bag
module
subpackage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610640647.XA
Other languages
Chinese (zh)
Inventor
王晓利
张永生
刘康
王炳宁
陈玉博
魏琢钰
赵军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
NTT Korea Co Ltd
Original Assignee
Institute of Automation of Chinese Academy of Science
NTT Korea Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science, NTT Korea Co Ltd filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201610640647.XA priority Critical patent/CN107688583A/en
Priority to JP2017151426A priority patent/JP2018022496A/en
Publication of CN107688583A publication Critical patent/CN107688583A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of method and apparatus for creating the training data for natural language processing device, and the natural language processing device using the training data.A kind of method for creating the training data for natural language processing system, including:Receive the request for creating the training data;The database for natural language for creating the training data is obtained to input;Determine the subpackage parameter needed for the training data;Based on the subpackage parameter, database for natural language input is divided into multiple bags, the multiple bag each includes multiple examples;For each of the multiple example, Sentence-level characteristic vector is automatically extracted, wherein, there is the multiple bag of the Sentence-level characteristic vector as the training data.

Description

The method and apparatus for creating the training data for natural language processing device
Technical field
The present invention relates to artificial intelligence field, is filled more particularly it relates to which one kind creates for natural language processing The method and apparatus for the training data put, and the natural language processing device using the training data.
Background technology
In recent years, with the continuous development of computer technology, based on the artificial intelligence of computer technology in many Ying Zhongshi Now the consciousness to people, thinking information process simulation.Because language is the fundamental mark that the mankind are different from other species, utilize Computer handles the super objective and boundary that the natural language processing of the language of the mankind embodies artificial intelligence.In such as question and answer (QA) in the natural language processing system of system, realize and answer human user natural language with accurate, succinct natural language The problem of proposition.
In question answering system, typically natural language language is extracted using the good grader based on neutral net of training in advance The structured features of sentence, are then based on that the structured features are retrieved from the knowledge base pre-established or reasoning is answered accordingly Case.During the training of the above-mentioned grader based on neutral net and the foundation of knowledge base, it is required for providing a large amount of marks The training data for having structured features performs deep learning for the grader based on neutral net.In a kind of existing question answering system In, in order to perform the training in advance of grader, it is necessary to using the training data of mark feature, this mark manually are manually in advance It is time-consuming and expensive.In another existing question answering system, analyzed dependent on traditional natural language parsing (NLP) instrument To extract feature, it needs critically design feature and there may be error propagation text, therefore lacks versatility and flower Take substantial amounts of manpower.In addition, existing question answering system often also relies on existing knowledge base so that its performance and application scenarios Critical constraints.
Accordingly, it is desirable to provide a kind of method and apparatus for creating the training data for natural language processing device, and Using the natural language processing device of the training data, it automatically generates training using the text that do not mark from natural discourse storehouse Data neatly optimize training data for the training of grader and the construction of knowledge base according to the purposes of training data Noise, so as to improve the precision of sorter model training, reduce the complexity of overall calculation.
The content of the invention
In view of the above problems, the present invention provide a kind of method for creating the training data for natural language processing device and Equipment, and the natural language processing device using the training data.
According to one embodiment of present invention, there is provided a kind of to create training data for natural language processing system Method, including:Receive the request for creating the training data;Obtain the database for natural language for creating the training data Input;Determine the subpackage parameter needed for the training data;Based on the subpackage parameter, the database for natural language is inputted It is divided into multiple bags, the multiple bag each includes multiple examples;For each of the multiple example, sentence is automatically extracted Level characteristic vector, wherein, there is the multiple bag of the Sentence-level characteristic vector as the training data.
In addition, method according to an embodiment of the invention, wherein the subpackage determined needed for the training data Parameter includes:Source based on the request for creating the training data and/or database for natural language input, it is determined that described Subpackage parameter.
In addition, method according to an embodiment of the invention, wherein each for the multiple example, is carried automatically Sentence-level characteristic vector is taken to include:For each vocabulary elements in each example of the multiple example, predetermined window is extracted Multiple vocabulary in the range of mouthful extract it with the distance of target word as position feature as word feature;It is special to the word The characteristic vector that the position feature of seeking peace forms performs maximum pond, obtains the Sentence-level characteristic vector.
In addition, method according to an embodiment of the invention, in addition to:Using the training data, grader is trained Or construction knowledge base.
In addition, method according to an embodiment of the invention, wherein described utilize the training data, trains grader Including:Initialize the neural network parameter of the grader;Randomly choose a bag in the multiple bag;Determine one Cause the maximized example of object function in bag;The nerve net of grader described in gradient updating based on one example Network parameter, until the neutral net restrains.
According to another embodiment of the invention, there is provided a kind of training data created for natural language processing system Equipment, including:Request receiving module, the request of the training data is created for receiving;Input module, for being used for Create the database for natural language input of the training data;Subpackage parameter determination module, for determining the training data institute The subpackage parameter needed;Subpackage module, for based on the subpackage parameter, database for natural language input being divided into multiple Bag, the multiple bag each include multiple examples;Characteristic vector pickup module, for for each of the multiple example It is individual, Sentence-level characteristic vector is automatically extracted, wherein, there is the multiple bag of the Sentence-level characteristic vector as the training Data.
In addition, equipment according to another embodiment of the invention, wherein the subpackage parameter determination module is based on creating The request of the training data and/or the source of database for natural language input, determine the subpackage parameter.
In addition, equipment according to another embodiment of the invention, wherein the characteristic vector pickup module is for described Each vocabulary elements in each example of multiple examples, the multiple vocabulary extracted in the range of predetermined window are special as word Sign, extracts it with the distance of target word as position feature;To the feature that the word feature and the position feature form to Amount performs maximum pond, obtains the Sentence-level characteristic vector.
In addition, equipment according to another embodiment of the invention, wherein the training data be used to training grader or Construction knowledge base.
In addition, equipment according to another embodiment of the invention, in addition to classifier training module, the grader instruction Practice module to be used for:Initialize the neural network parameter of the grader;Randomly choose a bag in the multiple bag;Determine institute State in a bag and cause the maximized example of object function;Grader described in gradient updating based on one example Neural network parameter, until the neutral net restrains.
According to still another embodiment of the invention, there is provided a kind of natural language processing device, including:User interface is set It is standby, for the natural language problem that receives user input and perform the output of answer;Classifier device, for for it is described from The relation classification of right language issues input extraction feature and the feature, to obtain structure problem;Knowledge library facilities, for base In the structure problem, the knowledge base data wherein prestored are retrieved, obtain the structure corresponding to the structure problem Change information;Answer reasoning equipment, for based on the structured message, reasoning to determine the answer;And training data creates Equipment, for creating training data, the training data is used to train the classifier device or the construction knowledge library facilities In knowledge base data, wherein the training data create equipment further comprise request receiving module, for receive create institute The request input module of training data is stated, for obtaining the database for natural language input for being used to create the training data;Point Bag parameter determining module, for determining the subpackage parameter needed for the training data;Subpackage module, for being joined based on the subpackage Number, database for natural language input is divided into multiple bags, the multiple bag each includes multiple examples;Characteristic vector carries Modulus block, for each for the multiple example, Sentence-level characteristic vector is automatically extracted, wherein, there is the sentence The multiple bag of level characteristic vector is as the training data.
In addition, natural language processing device according to still another embodiment of the invention, wherein the subpackage parameter determines Source of the module based on the request for creating the training data and/or database for natural language input, determines the subpackage Parameter.
In addition, natural language processing device according to still another embodiment of the invention, wherein the characteristic vector pickup Module extracts multiple vocabulary in the range of predetermined window for each vocabulary elements in each example of the multiple example As word feature, it is extracted with the distance of target word as position feature;
The characteristic vector formed to the word feature and the position feature performs maximum pond, obtains the Sentence-level Characteristic vector.
In addition, natural language processing device according to still another embodiment of the invention, wherein the training data creates Equipment also includes classifier device training module, and the classifier device training module is used for:Initialize the classifier device Neural network parameter;Randomly choose a bag in the multiple bag;Determine to cause that object function is maximum in one bag The example changed;The neural network parameter of classifier device described in gradient updating based on one example, until described Neutral net restrains.
The method and apparatus of establishment according to embodiments of the present invention for the training data of natural language processing device, and Using the natural language processing device of the training data, it automatically generates training using the text that do not mark from natural discourse storehouse Data neatly optimize training data for the training of grader and the construction of knowledge base according to the purposes of training data Noise, so as to improve the precision of sorter model training, reduce the complexity of overall calculation.
It is to be understood that foregoing general description and following detailed description are both exemplary, and it is intended to In the further explanation for providing claimed technology.
Brief description of the drawings
The embodiment of the present invention is described in more detail in conjunction with the accompanying drawings, above-mentioned and other purpose of the invention, Feature and advantage will be apparent.Accompanying drawing is used for providing further understanding the embodiment of the present invention, and forms explanation A part for book, it is used to explain the present invention together with the embodiment of the present invention, is not construed as limiting the invention.In the accompanying drawings, Identical reference number typically represents same parts or step.
Fig. 1 is method of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system Flow chart.
Fig. 2 is equipment of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system Block diagram.
Fig. 3 is in method of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system The schematic diagram of Sentence-level feature extraction.
Fig. 4 A and 4B are training of the further diagram establishment according to embodiments of the present invention for natural language processing system The schematic diagram of window treatments in the method for data.
Fig. 5 is flow chart of the diagram using the method for training data training grader according to embodiments of the present invention.
Fig. 6 is the block diagram of diagram natural language processing device according to embodiments of the present invention.
Embodiment
Become apparent in order that obtaining the object, technical solutions and advantages of the present invention, root is described in detail below with reference to accompanying drawings According to the example embodiment of the present invention.Obviously, described embodiment is only the part of the embodiment of the present invention, rather than this hair Bright whole embodiments, it should be appreciated that the present invention is not limited by example embodiment described herein.Described in the present invention The embodiment of the present invention, those skilled in the art's all other embodiment resulting in the case where not paying creative work It should all fall under the scope of the present invention.
In question answering system, typically natural language language is extracted using the good grader based on neutral net of training in advance The structured features of sentence, are then based on that the structured features are retrieved from the knowledge base pre-established or reasoning is answered accordingly Case.Therefore, during the training of the grader based on neutral net and the foundation of knowledge base, it is required for providing a large amount of marks The training data for having structured features performs deep learning for the grader based on neutral net.In existing question answering system, it is The training in advance of grader is performed, it is necessary to using the training data of mark feature manually in advance, this mark manually is time-consuming And expensive.In addition, another existing question answering system analyzes text dependent on traditional natural language parsing (NLP) instrument To extract feature, it needs critically design feature and there may be error propagation, therefore lacks versatility and spend big The manpower of amount.In addition, existing question answering system often also relies on existing knowledge base so that its performance and application scenarios are serious It is limited.
The invention provides a kind of method and apparatus for creating the training data for natural language processing device, and it is used The text that do not mark from natural discourse storehouse automatically generates training data for the training of grader and the construction of knowledge base, from Without time-consuming and expensive manual mark;Further, making an uproar for training data is neatly optimized according to the purposes of training data Sound, i.e., according to training data be for train grader or for construction knowledge base and database for natural language input come Source, different subpackage parameters is flexibly set, so as to improve the precision of sorter model training, reduce answering for overall calculation Miscellaneous degree.
Hereinafter, embodiments of the invention will be described in detail with reference to the attached drawings.
Fig. 1 is method of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system Flow chart.As shown in figure 1, the method bag of establishment according to embodiments of the present invention for the training data of natural language processing system Include following steps.
In step S101, the request for creating training data is received.In one embodiment of the invention, wound can be received Build the training data request for the different entities in question answering system.For example, it may be possible to from the knowledge base server in question answering system Receive the request for creating the training data for construction knowledge base.Or it may be connect from the semantic analysis entity in question answering system Receive the request for creating the training data for being used to train grader.That is, can from create training data request source, It is determined that create the purpose of training data.Hereafter, processing enters step S102.
In step s 102, the database for natural language for creating the training data is obtained to input.The present invention's In one embodiment, database for natural language input can be obtained from different sources.For example, can be from such as " popular point Comment ", the website of " Wikipedia " etc. obtain database for natural language input.That is, in one embodiment of the present of invention In, obtain for create the training data database for natural language input be be not labeled feature corpus it is defeated Enter.Hereafter, processing enters step S103.
In step s 103, the subpackage parameter needed for training data is determined.Discussed further below, according to embodiments of the present invention Construction of Knowledge Base and classifier training utilize the multi-instance learning based on convolutional neural networks.In multi-instance learning, definition " bag " is the set of multiple examples.In one embodiment of the invention, based on the request and/or institute for creating the training data The source of database for natural language input is stated, determines the subpackage parameter.For example, for the training data from " masses comment on " More strict subpackage parameter will be set than the training data from " Wikipedia ", and this is due to from " masses' comment " More noise datas be present than the database for natural language input from " Wikipedia " in database for natural language input.This Outside, for for training the training data of grader to set more strict subpackage than the training data for construction knowledge base Parameter.Hereafter, processing enters step S104.
In step S104, based on subpackage parameter, database for natural language input is divided into multiple bags, multiple bags it is each Including multiple examples.In one embodiment of the invention, for example, exist T bag M1,M2,…MT, wherein including for i-th having qiIndividual exampleHereafter, processing enters step S105.
In step S105, for each of multiple examples, Sentence-level characteristic vector is automatically extracted, there is the sentence The multiple bag of level characteristic vector is as the training data.It will be described in as follows, wound according to embodiments of the present invention The method for building the training data for natural language processing system is based on convolutional neural networks, for the natural language of no mark Example of the corpus input after above-mentioned subpackage performs automatically extracting for Sentence-level characteristic vector, for training grader or Person's construction knowledge base.Hereinafter, it is described in further detail with reference to the accompanying drawings according to embodiments of the present invention based on convolutional neural networks Sentence-level characteristic vector automatically extract.
Fig. 2 is equipment of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system Block diagram.Establishment according to embodiments of the present invention creates equipment 20 for the training data of the training data of natural language processing system It can be only fitted in question answering system.
As shown in Fig. 2 training data according to embodiments of the present invention, which creates equipment 20, includes request receiving module 201, defeated Enter module 202, subpackage parameter determination module 203, subpackage module 204, characteristic vector pickup module 205 and classifier training module 206。
The request receiving module 201 is used to receive the request for creating the training data.Specifically, the request receives Module 201 can receive the training data request for creating the different entities being used in question answering system.For example, it may be possible to from question answering system In knowledge base server receive create for construction knowledge base training data request.Or may be from question answering system Semantic analysis entity receive create be used for train grader training data request.
The input module 202 is used to obtain the database for natural language input for being used to create the training data.Specifically Ground, database for natural language input can be obtained from different sources.For example, can from such as " masses comment ", The website of " Wikipedia " etc. obtains database for natural language input.That is, in one embodiment of the invention, obtain The database for natural language input that must be used to create the training data is not to be labeled the corpus input of feature.
The subpackage parameter determination module 203 is used to determine the subpackage parameter needed for the training data.Specifically, it is described Subpackage parameter determination module 203 based on the request for creating the training data and/or database for natural language input come Source, determine the subpackage parameter.For example, will be than the instruction from " Wikipedia " for the training data from " masses comment on " Practice data and more strict subpackage parameter is set, this is due to that the database for natural language input ratio from " masses' comment " comes from More noise datas be present in the database for natural language input of " Wikipedia ".In addition, for the instruction for training grader More strict subpackage parameter will be set than the training data for construction knowledge base by practicing data.
The subpackage module 204 is used to be based on the subpackage parameter, database for natural language input is divided into multiple Bag, the multiple bag each include multiple examples.
The characteristic vector pickup module 205 is used for each for the multiple example, automatically extracts Sentence-level spy Sign vector, there is the multiple bag of the Sentence-level characteristic vector as the training data, for training grader or Person's construction knowledge base.
The classifier training module 206 is used to utilize the training data, performs classifier training.The one of the present invention In individual embodiment, the classifier training module 206 initializes the neural network parameter of grader, randomly chooses the multiple bag In a bag, determine to cause the maximized example of object function in one bag, and be based on one example Gradient updating described in grader neural network parameter, until the neutral net restrain.
More than, the instruction of establishment according to embodiments of the present invention for natural language processing system is described referring to Figures 1 and 2 Practice the method and apparatus of data.Hereinafter, Fig. 3 and Fig. 4 establishments according to embodiments of the present invention will be referred to further and be used for nature language Say the Sentence-level feature extraction in the method for the training data of processing system, and the window treatments in Sentence-level feature extraction.
Fig. 3 is in method of the diagram establishment according to embodiments of the present invention for the training data of natural language processing system The schematic diagram of Sentence-level feature extraction.Fig. 4 A and 4B are further to illustrate establishment according to embodiments of the present invention to be used for natural language The schematic diagram of window treatments in the method for the training data of processing system.
Due to single vocabulary vector model critical constraints, therefore the characteristic vector pickup according to the present invention using Sentence-level. As shown in figure 3, each example of each multiple examples included for multiple bags, i.e., from database for natural language input Sentence, window treatments are first carried out.In order to be asked using sequence of words is different in size corresponding to the different sentences of window treatments solution Topic, introduce the position feature of vocabulary.
Specifically, the lexical feature vector matrix of context sliding window is obtained.That is, for an input sentence, consider Size is w sliding window.For example, as shown in Figure 4 A, for sentence " the People have been moving back of input Into downtown ", lexical feature vector representation are:
WF={ [Xs,X0,X1],[X0,X1,X2],…,[X5,X6,Xe]}。
Further, the position of vocabulary is described with the distance between two vocabulary, so as to obtain the location matrix of vocabulary PF.For example, as shown in Fig. 4 V, for vocabulary " been ", it is respectively " 2 " with the distance between " People " and " downtown " " -4 ", i.e. its position feature vector representation are:
PF=[2, -4]T
In this way, after window treatments, with SF=[WF, PF]TThe matrix representative sentence vector of composition.
Referring back to Fig. 3, the sentence vector further obtained to window treatments performs maximum pondization and handled, and selects each area The maximum in domain is as the value after the pool area.Finally nonlinearity is obtained by the use of tanh as activation primitive, extraction Sentence-level feature.
More than, reference picture 3 describes the Sentence-level feature extraction of training data according to embodiments of the present invention to Fig. 4 B, with The lower method for describing to train grader using the training data for being extracted Sentence-level feature by reference picture 5.
Fig. 5 is flow chart of the diagram using the method for training data training grader according to embodiments of the present invention.Such as Fig. 5 Shown, the method for training grader according to embodiments of the present invention comprises the following steps.
In step S501, the neural network parameter of grader is initialized.In the multi-instance learning that the present invention uses, base θ can be considered as in the classification relation model of convolutional neural networks (CNN).Assuming as before, T bag M be present1,M2,…MT, its In i-th include there is qiIndividual exampleSo, objective function is:
Wherein, j selection be:
Hereafter, processing enters step S502.
In step S502, a bag in multiple bags is randomly choosed.Example in randomly selected bag is fed one by one Into neutral net.Hereafter, processing enters step S503.
In step S503, determine to cause the maximized example of object function in one bag.For example, based on true Fixed j-th of exampleSo that object function maximizes.Hereafter, processing enters step S504.
In step S504, the neural network parameter of grader described in the gradient updating based on one example, until The neutral net convergence.In one embodiment of the invention, such as via Adadelta algorithms it is based onGradient updating θ.Further, by iterative step S502 to S504, until the neutral net restrains.
In the method for the utilization that reference picture 5 describes training data training grader according to embodiments of the present invention, it will join The sentence comprising two or more vocabulary entities obtained according to the process of Fig. 3 descriptions is made as the Sentence-level feature acquired in sample To input, the similarity of feature is calculated using vector space model.By setting multiple bags, maximize for object function Example selection with maximum contribution, so as to obtain more accurate training pattern.In addition, the calculating every time in iteration is complicated Spend low.
More than, describe the instruction of establishment according to embodiments of the present invention for natural language processing system referring to figs. 1 to Fig. 5 Practice the method and apparatus of data and using training of the training data for grader, Fig. 6 descriptions will be referred to further below It is configured with the natural language processing device of the equipment of the establishment training data.
Fig. 6 is the block diagram of diagram natural language processing device according to embodiments of the present invention.As shown in fig. 6, according to this hair The natural language processing device 60 of bright embodiment include user interface facilities 601, classifier device 602, knowledge library facilities 603, Answer reasoning equipment 604 and training data create equipment 605.
The user interface facilities 601 is used for the input for receiving the natural language problem of user and the output for performing answer. In one embodiment of the invention, the user interface facilities 601 is used to realize the natural language processing device 60 with using The interaction at family.For example, the user interface facilities 601 receives the problem of user's input, the expression for the problem of being inputted to user is entered Row is checked, follow-up apparatus assembly is submitted to by problem is inputted by the user of inspection.In addition, follow-up apparatus assembly to After family input problem carries out classification and reasoning acquisition answer, the answer of acquisition is presented to use by the user interface facilities 601 Family inputs the response of problem to realize to user.
The classifier device 602 is used for the relation that extraction feature and the feature are inputted for the natural language problem Classification, to obtain structure problem.In one embodiment of the invention, the classifier device 602 is retouched using reference picture 5 What the method training for the training grader stated obtained.
The knowledge library facilities 603 is used to be based on the structure problem, retrieves the knowledge base data wherein prestored, Obtain the structured message corresponding to the structure problem.In one embodiment of the invention, the knowledge library facilities 603 Based on the structure problem inputted from the classifier device 602, based on the knowledge base data prestored, execution is directed to problem Retrieval.For example, the knowledge library facilities 603 can prestore the index file of known problem and answer pair, the index File records known the semantic chunk sequence of problem and the positional information of answer, and knowledge is provided to answer user's input problem Source.The knowledge library facilities 603 advances with the training of establishment according to embodiments of the present invention for natural language processing system Data configuration.
The answer reasoning equipment 604 is used to be based on the structured message, and reasoning determines the answer.The present invention's In one embodiment, the answer reasoning equipment 604 is found based on the structured message provided from the knowledge library facilities 603 Inputting problem with user has the relevant issues of same or similar keyword, obtains each relevant issues and inputs problem with user Similarity, according to similarity select response relevant issues, according to index file record positional information extraction response use Relevant issues answer, and user is presented to by the user interface facilities 601.
The training data creates equipment 605 and is used to create training data, and the training data is used to train the classification Knowledge base data in device equipment 602 or the construction knowledge library facilities 603.In one embodiment of the invention, the instruction It is true to practice request receiving module 201, input module 202, subpackage parameter that data authoring device 605 includes describing above by reference to Fig. 2 Cover half block 203, subpackage module 204, characteristic vector pickup module 205 and classifier training module 206, it will omit herein for each The repeated description of individual module.
More than, described referring to figs. 1 to Fig. 6 and create the method for the training data for natural language processing device and set It is standby, and the natural language processing device using the training data, its using from natural discourse storehouse not mark text automatic Training data is generated for the training of grader and the construction of knowledge base, and is neatly optimized according to the purposes of training data The noise of training data, so as to improve the precision of sorter model training, reduce the complexity of overall calculation.
It should be noted that in this manual, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or equipment including the key element.
Finally, it is to be noted that, a series of above-mentioned processing are not only included with order described here in temporal sequence The processing of execution, and the processing including performing parallel or respectively rather than in chronological order.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by Software adds the mode of required hardware platform to realize, naturally it is also possible to is all implemented by hardware.Based on such understanding, What technical scheme contributed to background technology can be embodied in the form of software product in whole or in part, The computer software product can be stored in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are making Obtain a computer equipment (can be personal computer, server, or network equipment etc.) and perform each embodiment of the present invention Or the method described in some parts of embodiment.
The present invention is described in detail above, principle and embodiment party of the specific case used herein to the present invention Formula is set forth, and the explanation of above example is only intended to help the method and its core concept for understanding the present invention;It is meanwhile right In those of ordinary skill in the art, according to the thought of the present invention, change is had in specific embodiments and applications Part, in summary, this specification content should not be construed as limiting the invention.

Claims (14)

1. a kind of method for creating the training data for natural language processing system, including:
Receive the request for creating the training data;
The database for natural language for creating the training data is obtained to input;
Determine the subpackage parameter needed for the training data;
Based on the subpackage parameter, database for natural language input is divided into multiple bags, each of the multiple bag includes Multiple examples;
For each of the multiple example, Sentence-level characteristic vector is automatically extracted,
Wherein, the multiple bag with the Sentence-level characteristic vector is as the training data.
2. the method as described in claim 1, wherein the subpackage parameter determined needed for the training data includes:
Source based on the request for creating the training data and/or database for natural language input, determines the subpackage Parameter.
3. the method as described in claim 1, wherein each for the multiple example, automatically extract Sentence-level feature to Amount includes:
For each vocabulary elements in each example of the multiple example, multiple vocabulary in the range of predetermined window are extracted As word feature, it is extracted with the distance of target word as position feature;
The characteristic vector formed to the word feature and the position feature performs maximum pond, obtains the Sentence-level feature Vector.
4. Claim 1-3 it is any as described in method, in addition to:Using the training data, grader or construction are trained Knowledge base.
5. method as claimed in claim 4, wherein described utilize the training data, training grader includes:
Initialize the neural network parameter of the grader;
Randomly choose a bag in the multiple bag;
Determine to cause the maximized example of object function in one bag;
The neural network parameter of grader described in gradient updating based on one example, until the neutral net restrains.
6. a kind of equipment for creating the training data for natural language processing system, including:
Request receiving module, the request of the training data is created for receiving;
Input module, for obtaining the database for natural language input for being used to create the training data;
Subpackage parameter determination module, for determining the subpackage parameter needed for the training data;
Subpackage module, it is the multiple for based on the subpackage parameter, database for natural language input to be divided into multiple bags Bag each includes multiple examples;
Characteristic vector pickup module, for each for the multiple example, Sentence-level characteristic vector is automatically extracted,
Wherein, the multiple bag with the Sentence-level characteristic vector is as the training data.
7. equipment as claimed in claim 6, wherein the subpackage parameter determination module asking based on the establishment training data Ask and/or the database for natural language input source, determine the subpackage parameter.
8. equipment as claimed in claim 6, wherein the characteristic vector pickup module for the multiple example each Each vocabulary elements in example, multiple vocabulary in the range of predetermined window are extracted as word feature, extract itself and target word Distance as position feature;
The characteristic vector formed to the word feature and the position feature performs maximum pond, obtains the Sentence-level feature Vector.
9. claim 6 to 8 it is any as described in equipment, wherein the training data be used for train grader or construction knowledge Storehouse.
10. equipment as claimed in claim 9, in addition to classifier training module, the classifier training module is used for:
Initialize the neural network parameter of the grader;
Randomly choose a bag in the multiple bag;
Determine to cause the maximized example of object function in one bag;
The neural network parameter of grader described in gradient updating based on one example, until the neutral net restrains.
11. a kind of natural language processing device, including:
User interface facilities, for the natural language problem that receives user input and perform the output of answer;
Classifier device, the relation for inputting extraction feature and the feature for the natural language problem is classified, to obtain Obtain structure problem;
Knowledge library facilities, for based on the structure problem, retrieving the knowledge base data wherein prestored, being corresponded to The structured message of the structure problem;
Answer reasoning equipment, for based on the structured message, reasoning to determine the answer;And
Training data creates equipment, and for creating training data, the training data is used to train the classifier device or structure The knowledge base data in the knowledge library facilities are made,
Wherein described training data creates equipment and further comprised
Request receiving module, the request of the training data is created for receiving
Input module, for obtaining the database for natural language input for being used to create the training data;
Subpackage parameter determination module, for determining the subpackage parameter needed for the training data;
Subpackage module, it is the multiple for based on the subpackage parameter, database for natural language input to be divided into multiple bags Bag each includes multiple examples;
Characteristic vector pickup module, for each for the multiple example, Sentence-level characteristic vector is automatically extracted,
Wherein, the multiple bag with the Sentence-level characteristic vector is as the training data.
12. natural language processing device as claimed in claim 11, wherein the subpackage parameter determination module is based on creating institute The source of request and/or the database for natural language input of training data is stated, determines the subpackage parameter.
13. natural language processing device as claimed in claim 11, wherein the characteristic vector pickup module is for described more Each vocabulary elements in each example of individual example, multiple vocabulary in the range of predetermined window are extracted as word feature, It is extracted with the distance of target word as position feature;
The characteristic vector formed to the word feature and the position feature performs maximum pond, obtains the Sentence-level feature Vector.
14. claim 11 to 13 it is any as described in natural language processing device, wherein the training data create equipment Also include classifier device training module, the classifier device training module is used for:
Initialize the neural network parameter of the classifier device;
Randomly choose a bag in the multiple bag;
Determine to cause the maximized example of object function in one bag;
The neural network parameter of classifier device described in gradient updating based on one example, until the neutral net is received Hold back.
CN201610640647.XA 2016-08-05 2016-08-05 The method and apparatus for creating the training data for natural language processing device Pending CN107688583A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610640647.XA CN107688583A (en) 2016-08-05 2016-08-05 The method and apparatus for creating the training data for natural language processing device
JP2017151426A JP2018022496A (en) 2016-08-05 2017-08-04 Method and equipment for creating training data to be used for natural language processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610640647.XA CN107688583A (en) 2016-08-05 2016-08-05 The method and apparatus for creating the training data for natural language processing device

Publications (1)

Publication Number Publication Date
CN107688583A true CN107688583A (en) 2018-02-13

Family

ID=61152105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610640647.XA Pending CN107688583A (en) 2016-08-05 2016-08-05 The method and apparatus for creating the training data for natural language processing device

Country Status (2)

Country Link
JP (1) JP2018022496A (en)
CN (1) CN107688583A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875045A (en) * 2018-06-28 2018-11-23 第四范式(北京)技术有限公司 The method and its system of machine-learning process are executed for text classification
CN109933602A (en) * 2019-02-28 2019-06-25 武汉大学 A kind of conversion method and device of natural language and structured query language
CN110298372A (en) * 2018-03-23 2019-10-01 鼎捷软件股份有限公司 The method and system of automatic training virtual assistant
CN110781294A (en) * 2018-07-26 2020-02-11 国际商业机器公司 Training corpus refinement and incremental update
CN113806489A (en) * 2021-09-26 2021-12-17 北京有竹居网络技术有限公司 Method, electronic device and computer program product for dataset creation

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766994A (en) * 2018-12-25 2019-05-17 华东师范大学 A kind of neural network framework of natural language inference
CN110110327B (en) * 2019-04-26 2021-06-22 网宿科技股份有限公司 Text labeling method and equipment based on counterstudy

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298372A (en) * 2018-03-23 2019-10-01 鼎捷软件股份有限公司 The method and system of automatic training virtual assistant
CN110298372B (en) * 2018-03-23 2023-06-09 鼎捷软件股份有限公司 Method and system for automatically training virtual assistant
CN108875045A (en) * 2018-06-28 2018-11-23 第四范式(北京)技术有限公司 The method and its system of machine-learning process are executed for text classification
CN108875045B (en) * 2018-06-28 2021-06-04 第四范式(北京)技术有限公司 Method of performing machine learning process for text classification and system thereof
CN110781294A (en) * 2018-07-26 2020-02-11 国际商业机器公司 Training corpus refinement and incremental update
CN110781294B (en) * 2018-07-26 2024-02-02 国际商业机器公司 Training corpus refinement and incremental update
CN109933602A (en) * 2019-02-28 2019-06-25 武汉大学 A kind of conversion method and device of natural language and structured query language
CN113806489A (en) * 2021-09-26 2021-12-17 北京有竹居网络技术有限公司 Method, electronic device and computer program product for dataset creation

Also Published As

Publication number Publication date
JP2018022496A (en) 2018-02-08

Similar Documents

Publication Publication Date Title
CN107688583A (en) The method and apparatus for creating the training data for natural language processing device
CN106528845B (en) Retrieval error correction method and device based on artificial intelligence
CN110427463B (en) Search statement response method and device, server and storage medium
WO2018207723A1 (en) Abstract generation device, abstract generation method, and computer program
CN106844658A (en) A kind of Chinese text knowledge mapping method for auto constructing and system
CN112990296B (en) Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
CN108536681A (en) Intelligent answer method, apparatus, equipment and storage medium based on sentiment analysis
CN106919655A (en) A kind of answer provides method and apparatus
US20200183928A1 (en) System and Method for Rule-Based Conversational User Interface
CN110096711A (en) The natural language semantic matching method of the concern of the sequence overall situation and local dynamic station concern
US11417339B1 (en) Detection of plagiarized spoken responses using machine learning
CN112685550B (en) Intelligent question-answering method, intelligent question-answering device, intelligent question-answering server and computer readable storage medium
Cui et al. Dataset for the first evaluation on Chinese machine reading comprehension
Faizan et al. Automatic generation of multiple choice questions from slide content using linked data
Rahman et al. NLP-based automatic answer script evaluation
CN109033073A (en) Text contains recognition methods and device
Alshammari et al. TAQS: an Arabic question similarity system using transfer learning of BERT with BILSTM
Singh et al. Encoder-decoder architectures for generating questions
CN116860947A (en) Text reading and understanding oriented selection question generation method, system and storage medium
CN116362331A (en) Knowledge point filling method based on man-machine cooperation construction knowledge graph
Karpagam et al. Deep learning approaches for answer selection in question answering system for conversation agents
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Zhang et al. Sliding-Bert: Striding Towards Conversational Machine Comprehension in Long Contex
Su et al. Automatic ontology population using deep learning for triple extraction
CN113011141A (en) Buddha note model training method, Buddha note generation method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180213

WD01 Invention patent application deemed withdrawn after publication