CN110019648A - A kind of method, apparatus and storage medium of training data - Google Patents

A kind of method, apparatus and storage medium of training data Download PDF

Info

Publication number
CN110019648A
CN110019648A CN201711269292.9A CN201711269292A CN110019648A CN 110019648 A CN110019648 A CN 110019648A CN 201711269292 A CN201711269292 A CN 201711269292A CN 110019648 A CN110019648 A CN 110019648A
Authority
CN
China
Prior art keywords
word
candidate
hash
vector
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711269292.9A
Other languages
Chinese (zh)
Other versions
CN110019648B (en
Inventor
李潇
郑孙聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Shenzhen Tencent Computer Systems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Computer Systems Co Ltd filed Critical Shenzhen Tencent Computer Systems Co Ltd
Priority to CN201711269292.9A priority Critical patent/CN110019648B/en
Publication of CN110019648A publication Critical patent/CN110019648A/en
Application granted granted Critical
Publication of CN110019648B publication Critical patent/CN110019648B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

A kind of method, apparatus and storage medium of training data, this method include obtaining corpus set to be processed;Entity sets are extracted from corpus set, candidate upper set of words is extracted from entity sets;By the entity in entity sets respectively with each upper word combination in candidate upper set of words, obtain it is candidate to set, candidate includes that multiple candidate are right to set, it is candidate to referring to the entity and upper contamination for having incidence relation;By candidate to, each with candidate a prediction data is respectively configured to associated sentence, and extensive processing is carried out to associated sentence with candidate in prediction data;Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;Extensive process layer is inputted to each word in set of words to convert, and obtains vector set;The vector set is trained and is predicted according to prediction data and shot and long term memory artificial neural network LSTM.By using the program, the efficiency of training data can be improved.

Description

A kind of method, apparatus and storage medium of training data
Technical field
This application involves the method, apparatus and storage of big data processing technology field more particularly to a kind of training data to be situated between Matter.
Background technique
In time recursive neural network technology field, generally using shot and long term memory artificial neural network (full name in English: Long-short term memory, English abbreviation: LSTM) it handles, the important thing that interval is long in predicted time sequence, delay is long Part.Before using LSTM prediction, needs to excavate hypernym from corpus set, and problem is converted into classification problem, that is, give Fixed one candidate entity-hypernym pair, predict the candidate's entity-hypernym to whether real entity-hypernym pair.Pre- In survey method, typically word segmentation processing, extract feature, then candidate entity-hypernym is carried out using traditional classifier Classification.But this mode is more demanding to domain knowledge, and the result of final classification may not have generalization, and institute can be pre- The range of survey is smaller.
The method for being based primarily upon deep learning at present classifies to candidate entity-hypernym, automatically from corpus set Feature and raw batches of training data are extracted, the training data based on batch is predicted, it can be improved the performance of classification, but It is since depth network is very complicated, the increase of additional name physical quantities needs to generate more training datas, generates a large amount of Time spent by training data is longer, and efficiency is lower.
Summary of the invention
This application provides a kind of method, apparatus of training data and storage mediums, are able to solve and train in the prior art The lower problem of the efficiency of data.
The application first aspect provides a kind of method of training data, which comprises
Obtain corpus set to be processed;
Entity sets are extracted from the corpus set, the entity sets include the entity of multiple names;
Candidate upper set of words is extracted from the entity sets;
By the entity in the entity sets respectively with each upper word combination in the upper set of words of candidate, waited Choosing to set, it is described it is candidate include that multiple candidate are right to set, the candidate to refer to the entity that has incidence relation with it is upper Contamination;
By candidate to, each with candidate a prediction data is respectively configured to associated sentence, and to prediction data In with candidate extensive processing is carried out to associated sentence;
Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;
Extensive process layer is inputted to each word in the set of words to convert, and obtains vector set;
The vector set is trained according to the prediction data and shot and long term memory artificial neural network LSTM and Prediction.
The application second aspect provides a kind of device for training data, has and realizes that corresponding to above-mentioned first aspect mentions The function of the method for the training data of confession.The function can also be executed corresponding soft by hardware realization by hardware Part is realized.Hardware or software include one or more modules corresponding with above-mentioned function, the module can be software and/or Hardware.
In a kind of possible design, described device includes:
Module is obtained, for obtaining corpus set to be processed;
Processing module, for extracting entity sets from the corpus set, the entity sets include multiple names Entity;
Candidate upper set of words is extracted from the entity sets;
By the entity in the entity sets respectively with each upper word combination in the upper set of words of candidate, waited Choosing to set, it is described it is candidate include that multiple candidate are right to set, the candidate to refer to the entity that has incidence relation with it is upper Contamination;
By candidate to, each with candidate a prediction data is respectively configured to associated sentence, and to prediction data In with candidate extensive processing is carried out to associated sentence;
Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;
Extensive process layer is inputted to each word in the set of words to convert, and obtains vector set;
The vector set is trained according to the prediction data and shot and long term memory artificial neural network LSTM and Prediction.
The another aspect of the application provides a kind of device for training data comprising the processing of at least one connection Device, memory and transceiver, wherein the memory is for storing program code, and the processor is for calling the storage Program code in device executes method described in above-mentioned first aspect.
The another aspect of the application provides a kind of computer storage medium comprising instruction, when it runs on computers When, so that computer executes method described in above-mentioned first aspect.
Compared to the prior art, in scheme provided by the present application, after extracting entity sets and candidate upper set of words, by institute The entity in entity sets is stated respectively with each upper word combination in the upper set of words of candidate, candidate is obtained to set, incites somebody to action Candidate to, each with candidate a prediction data is respectively configured to associated sentence, and to right with candidate in prediction data Associated sentence carries out extensive processing;Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;It is right Each word in the set of words inputs extensive process layer and is converted, and obtains vector set, and being handled by extensive layer can The order of magnitude of data is reduced, and then carries out fast convergence on the basis of a small amount of prediction data, and then is reduced for trained and pre- Number of parameters needed for surveying, to improve the efficiency of training data.
Detailed description of the invention
Fig. 1 is a kind of a kind of flow diagram of the method for training data in the embodiment of the present application;
Fig. 2 is a kind of a kind of flow diagram of the method for training data in the embodiment of the present application;
Fig. 3 is LSTM schematic network structure in the embodiment of the present application;
Fig. 4 is to convert a kind of schematic diagram of word in the char layer of LSTM in the embodiment of the present application;
Fig. 5 is to convert a kind of schematic diagram of word in the hash layer of LSTM in the embodiment of the present application;
Fig. 6 is a kind of a kind of structural schematic diagram of the device for training data in the embodiment of the present application;
Fig. 7 is a kind of another structural schematic diagram of the device for training data in the embodiment of the present application;
Fig. 8 is a kind of structural schematic diagram of terminal device in the embodiment of the present application;
Fig. 9 is a kind of structural schematic diagram of server in the embodiment of the present application.
Specific embodiment
The description and claims of this application and term " first " in above-mentioned attached drawing, " second " etc. are for distinguishing Similar object, without being used to describe a particular order or precedence order.It should be understood that the data used in this way are in appropriate feelings It can be interchanged under condition, so that the embodiments described herein can be real with the sequence other than the content for illustrating or describing herein It applies.In addition, term " includes " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, packet The process, method, system, product or equipment for having contained series of steps or module those of be not necessarily limited to be clearly listed step or Module, but may include other steps being not clearly listed or intrinsic for these process, methods, product or equipment or Module, the division of module appeared in the application, only a kind of division in logic can have when realizing in practical application Other division mode, such as multiple modules can be combined into or are integrated in another system, or some features can be ignored, Or do not execute, in addition, shown or discussion mutual coupling, direct-coupling or communication connection can be by one A little interfaces, the indirect coupling or communication connection between module can be electrical or other similar form, do not make in the application It limits.Also, module or submodule can be the separation that may not be physically as illustrated by the separation member, can be It can not be physical module, or can be distributed in multiple circuit modules, portion therein can be selected according to the actual needs Point or whole module realize the purpose of application scheme.
The application has supplied the method, apparatus and storage medium of a kind of training data, is used for artificial neural network, artificial neuron Network is a kind of imitation animal nerve network behavior feature, carries out the algorithm mathematics model of distributed parallel information processing.It is A kind of operational model, be coupled to each other by a large amount of node (or neuron or processing unit) constitute it is non-linear, adaptive Information processing system.Wherein, a kind of specific output function of each node on behalf, referred to as excitation function.Between every two node Connection all represents a weighted value for passing through the connection signal, referred to as weight, is equivalent to the memory of artificial neural network. The output of artificial neural network then according to the connection type of artificial neural network, the difference of weighted value and excitation function and it is different.And Artificial neural network itself is approached certain algorithm of nature or function, it is also possible to a kind of logic strategy Expression.Artificial neural network can rely on the complexity of system, by adjusting interconnected between internal great deal of nodes Relationship achievees the purpose that handle information.
Artificial neural network has the function of that self-learning function, connection entropy, high speed find the operational capability of optimization solution, from group Knit, adaptively, the ability of real-time learning.
Wherein, it should be strongly noted that this application involves terminal device, can be directed to user provide voice and/ Or the equipment of data connectivity, with wireless connecting function handheld device or be connected to radio modem other Processing equipment.Wireless terminal can be through wireless access network (full name in English: radio access network, English abbreviation: RAN) Communicated with one or more core nets, wireless terminal can be mobile terminal, as mobile phone (or for " honeycomb " electricity Words) and computer with mobile terminal, for example, it may be portable, pocket, hand-held, built-in computer or vehicle The mobile device of load, they exchange voice and/or data with wireless access network.For example, personal communication service (full name in English: Personal communication service, English abbreviation: PCS) phone, wireless phone, Session initiation Protocol (SIP) words Machine, wireless local loop (full name in English: wireless local loop, English abbreviation: WLL) are stood, personal digital assistant (English Literary full name: personal digital assistant, English abbreviation: PDA) etc. equipment.Wireless terminal be referred to as system, Subscriber unit (subscriber Unit), subscriber station (Subscriber Station), movement station (Mobile Station), Mobile station (Mobile), distant station (Remote Station), access point (Access Point), remote terminal (Remote Terminal), access terminal (Access Terminal), user terminal (User Terminal), terminal device, user agent (User Agent), user equipment (User Device) or user equipment (User Equipment).
Fig. 1, a kind of method of training data provided herein introduced below are please referred to, the embodiment of the present application is mainly wrapped It includes:
101, corpus set to be processed is obtained.
Wherein, the corpus set refers to the set of the collected corpus in a statistical time, and each corpus can come From at least one platform.The corpus set includes multiple corpus, and each corpus includes multiple words, and multiple words constitute one A set of words.Such as the corpus set is from a model or the data of news.Institute can be grabbed by modes such as crawlers Predicate material set, concrete mode the application are not construed as limiting.The corpus set is also possible to the data from an enterprise, wherein can Including employee information, company information, intellectual property, legal information, employee's up/down grade relationship, staff attendance, assessment of staff, enterprise Industry news, the product sale information of enterprise and creation data of enterprise etc..In addition, for convenient for follow-up data processing, it can be with Denoising is carried out to corpus set, specific the application is not construed as limiting.
102, entity sets are extracted from the corpus set.
Wherein, the entity sets include the entity of multiple names, and entity can be any noun, for example, as name, Name, things title, organization, term etc..
103, candidate upper set of words is extracted from the entity sets.
Such as entity sets include Liu Dehua, Yao Chen, party, attend, famous star, album and issue, eat, apple, The entities such as lichee.It is possible to be inferred to the hypernym that famous star is Liu Dehua, Yao Chen, Yi Jishui from the entity sets Fruit is the hypernym of apple, lichee.
104, the entity in the entity sets is obtained respectively with each upper word combination in the upper set of words of candidate To candidate to set.
Wherein, it is described it is candidate include that multiple candidate are right to set, it is described it is candidate to refer to the entity for having incidence relation and Upper contamination.
After being inferred to candidate hypernym by step 103, it is possible to which being inferred to famous star is the upper of Liu Dehua, Yao Chen Position word, can be right respectively as a candidate by (Liu Dehua, famous star), (Yao Chen, famous star).It can also be by (apple, water Fruit), (lichee, fruit) it is right respectively as a candidate.
105, by candidate to, each with candidate a prediction data is respectively configured to associated sentence, and to prediction Extensive processing is carried out to associated sentence with candidate in data.
In some embodiments, prediction data can be used (pair, extensive sentence) to indicate, wherein extensive sentence refers to time Choosing carries out the sentence obtained after extensive processing to associated sentence, and pair presentation-entity and the candidate of candidate hypernym composition are right.
For example, the candidate entity in 1 is Liu Dehua, candidate hypernym is famous star, the candidate entity in 2 is Yao Morning, candidate hypernym are famous star, candidate to may include to sentence associated by 1:
" Liu Dehua and Yao Chen etc. are famous, and star has attended party.", " Liu Dehua and Yao Chen etc. are famous, and star performs jointly One film ", " the famous star such as Liu Dehua and Fan ice ice chorused one first song " ...
After carrying out extensive processing to 1 associated sentence to above-mentioned candidate, following extensive sentence can be respectively obtained:
" Tag such as Nr and Yao Chen have attended party ", " Tag such as Nr and Yao Chen have performed a film jointly ", " Nr and Fan Bing The Tag such as ice have chorused a first song " ...
Wherein, Nr indicates extensive name entity, such as with " Liu Dehua and Yao Chen etc. are famous, and star has attended party." be Example, if pair for " Liu Dehua ", is incited somebody to action, " Liu Dehua and Yao Chen etc. are famous, and star has attended party." in " Liu Dehua " It is generalized for Nr;If pair is for " Yao Chen ", by " Liu Dehua and Yao Chen etc. are famous, and star has attended party." in " Yao Morning " is generalized for Nr.
Tag indicates the label of the hypernym of extensive entity attribute, such as famous in the famous star such as Liu Dehua, Yao Chen Star then refers to the hypernym to " Liu Dehua, Yao Chen etc. " people entities.
106, word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words.
Wherein, the set of words includes N number of word.Such as " Liu Dehua and Yao Chen etc. are famous, and star attends to sentence Party " obtains after carrying out word segmentation processing: Liu Dehua and Yao Chen, etc., famous, star, attend, party.
107, extensive process layer is inputted to each word in the set of words to convert, obtain vector set.
Optionally, in some embodiments of the present application, the extensive process layer include character layer (char level) and Hash layer (hash level), each word in the set of words input extensive process layer and convert, turned The set of words after changing, comprising:
1, each word in the set of words is inputted into the character layer respectively, the word will be inputted in the character layer The word of symbol layer is respectively converted into word vectors, obtains word vectors set.
In some embodiments, the first word can be matched with the character in character look-up table, obtain n character Corresponding n vector generates word vectors, first word according to two-phase LSTM by the n vector and first word Language refers to the word in the set of words wait train and predict.
For example, as shown in figure 4, the word (word) in Fig. 4 is the first word.After word enters char layers, by this Word is matched with the character look-up table (char lookup table) in char layers respectively, is combined.Such as by word with Char1 to charN is respectively combined, and word is combined with char1 can obtain output 1 (output1), other are similarly finally exported N number of Output, i.e. output1 are to outputN.
2, each word in the set of words is inputted respectively it is hash layers described, described hash layers will input described in Hash layers of word is respectively converted into Hash vector (hash vector), obtains hash vector set.
In some embodiments, Hash hash function can be used that N number of word is respectively mapped in K hash barrels, N number of word is compressed in hash barrels each respectively, obtains K hash vector, each hash vector corresponds to the N A word, wherein N and K is positive integer, N > K.
For example, as shown in figure 5, after word1 to wordN enters hash layers, Harbin function in hash layers is by word1 Hash1 is respectively mapped into hashK to wordN, and wherein hash1 to hashK indicates hash1 barrels.For example, extremely by word1 WordN is respectively mapped to hash1, finally obtains a Hash vector, i.e. hash1vector, other similarly, final hash layers is defeated K hash vector, i.e. hash1vector to hashK vector out.
3, the vector set is obtained according to the word vectors set and the hash vector set.
In some embodiments, can the word vectors and the K hash vector be spliced or is pasted, obtained described Vector set.
In some embodiments, extensive process layer is inputted to each word in the set of words to convert, obtain The vector that each word has been arrived after vector set, for the vector of each word, since word can be embodied in sentence With candidate in two dimensions, thus the vector of finally obtained each word can correspond to sentence matrix and it is candidate to matrix both Matrix.Below in the corpus set the first sentence and candidate first candidate in set to for, respectively It is introduced:
1, about sentence matrix
For example, the first sentence is corresponding can be obtained first matrix, and first matrix is according to first sentence It is corresponding word quantity after participle, general via the vector dimension exported after the extensive processing of the character layer and via hash layers Set vector dimension obtains when changing processing.
In some embodiments, the first matrix can be indicated with L1* (char_N+hash_N).Wherein, L1 is sentence point The quantity of word after word, char_N are the vector dimensions exported after the extensive processing of char level, and hash_N is Hash lookup table The vector dimension of (hash lookup table) setting.
2, about candidate to matrix
For example, first it is candidate obtain second matrix to corresponding, second matrix according to it is described it is candidate to point It is corresponding word quantity after word, extensive via the vector dimension exported after the extensive processing of the character layer and via hash layers Set vector dimension obtains when processing.
In some embodiments, the second matrix can be indicated with L2* (char_N+hash_N).Wherein, L2 is the first time Word quantity after selecting the candidate entity of centering, candidate hypernym to segment respectively, char_N is the extensive processing of char level The vector dimension exported afterwards, hash_N are hash vector dimensions.
108, the vector set is instructed according to the prediction data and shot and long term memory artificial neural network LSTM Practice and predicts.
Compared with current mechanism, in the embodiment of the present application, after extracting entity sets and candidate upper set of words, by the reality Entity in body set with each upper word combination in the upper set of words of candidate, obtains candidate to set respectively, will be candidate To, each with candidate a prediction data is respectively configured to associated sentence, and in prediction data with candidate to being associated with Sentence carry out extensive processing;Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;To described Each word in set of words inputs extensive process layer and is converted, and obtains vector set, handles to obtain vector by extensive layer Set, can carry out fast convergence on the basis of a small amount of prediction data, gone to be trained and can reduce based on vector set Number of parameters for training and needed for predicting, to improve the efficiency of training data, reduce training data manufacturing cost and Training time.Also, handled in the embodiment of the present application by extensive layer, additionally it is possible to reduce deep learning to training samples number mistake In relying on, restrain slow characteristic, and relatively good performance is automatically directly reached by the training of low volume data, not needed Artificial extraction feature.
For ease of understanding, below by taking concrete application scene as an example, to training data provided in the embodiment of the present application Method is introduced.As shown in Fig. 2, the embodiment of the present application can include:
Step 1: the sentence in corpus set being segmented, candidate pair is obtained based on corpus set, uses candidate It is extensive that pair carries out sentence.
In corpus set each sentence first using name entity recognition techniques obtain it includes entity sets, then will be complete The possible noun in portion, noun phrase etc. are as candidate upper set of words, by the entity and the upper set of words of candidate in entity sets In any combination of two of hypernym treat as candidate pair.Then each candidate pair is used, and corresponding to candidate pair Sentence is configured to a prediction data, while carrying out extensive processing to sentence.
Wherein, naming Entity recognition (full name in English: named entities recognition, English abbreviation: NER) is One background task of natural language processing, the purpose is to identify, the names such as name, place name, institution term are real in corpus set Body.Since these name physical quantities are continuously increased, it is often impossible to it is exhaustive in dictionary to list, and its constructive method has respectively From certain law, thus, it is usually that the identification to these words is independent from vocabulary morphological process (such as Chinese word segmentation) task Processing, referred to as name Entity recognition.Naming entity recognition techniques is information extraction, information retrieval, machine translation, question answering system etc. A variety of essential component parts of natural language processing technique.
In view of the entity in candidate pair may be from different sentences with candidate hypernym, for example, in corpus set Including following two sentence:
(1) the famous star such as Liu Dehua and Yao Chen has attended party.
(2) the famous star such as Liu Dehua and Yao Chen has performed a film jointly.
So, here structure forecast data when, then will appear multiple extensive sentences, but only correspond to the candidate pair。
With sentence, " Liu Dehua and Yao Chen etc. are famous, and star has attended party below." for.Liu Dehua and Yao Chen belong to people The entity of object name, famous star are corresponding candidate hypernym.By composite entity and candidate hypernym, available 2 Candidate pair:(Liu Dehua, famous star), (Yao Chen, famous star).It may then based on the two candidate pair, in conjunction with time Entity in pair, the sentence where candidate hypernym is selected to construct corresponding prediction data:
(1) prediction data 1:pair (Liu Dehua, famous star), extensive sentence (Tag such as Nr and Yao Chen have attended party).
(2) prediction data 2:pair (Yao Chen, famous star), extensive sentence (Tag such as Liu Dehua and Nr have attended party).
Step 2: each word after segmenting to sentence generates word all by an extensive process layer processing Vector is converted each word by extensive process layer, can be effectively reduced number of parameters, and in a small amount of training data Upper realization fast convergence.
Step 3: to by extensive layer, treated that data are trained and are predicted using LSTM network, inputting as candidate Pair sentence corresponding with candidate pair.
In some embodiments of the present application, LSTM network structure is described below, is carried out using LSTM network structure extensive The process of layer processing.
LSTM network structure as shown in Figure 3, LSTM network structure include softmax classifier, sentence template, pair about Beam template and extensive layer.
Wherein, extensive layer includes character layer (char level) and Hash layer (hash level), and Char level includes Two-way LSTM and character look-up table (char lookup table), char lookup table may include N number of different char, N can take 1~20,000, and the application does not limit the value of N.
Hash level includes hash function and hash lookup table, hash lookup table includes K Hash barrels, K can be rule of thumb arranged, and the application does not limit the value of K.
Vector of each char and hash in respective lookup table can be M (20~50) dimension.
Softmax classifier refers to using multinomial distribution as model modeling, can divide the classification of a variety of mutual exclusions, energy Enough any real vector mappings (compression) by a K dimension are the real vector of another K dimension.Softmax classifier refers to people The output layer of artificial neural networks.
Sentence template refers to the LSTM of processing sentence matrix.
Pair constraint template refers to the LSTM of processing pair matrix.
One, extensive process layer handling principle:
Extensive layer process flow includes: that char level replaces word level, is mapped as hash using hash vector。
1, charlevel replaces word level.
As shown in figure 4, passing through char lookup to each of word lookup table after participle word Table (char1 ... charN) obtains the vector of each char, later by a two-way LSTM by result (n Vector) with the new word vector of word combination producing, that is, remain word itself information can also substantially reduce it is original Use parameter explosion issues caused by word lookup table.
2, it maps to obtain hash vector using hash.
As shown in figure 5, using hash Function Mapping to K hash to N number of word in each word lookup table In bucket, K can be far smaller than N here, guarantee the reduction of parameter number magnitude.By the way that multiple word pressure to be compressed together, altogether Enjoy a hash vector.By sharing hash vector mechanism, training speed can be greatly speeded up and can be less Preferable result is obtained on training dataset.
Wherein, a corresponding hash vector can be obtained by each hash barrels, a shared hash here Vector refers to that N number of word is respectively mapped to hash1 barrels to K barrels of hash, N number of for the N number of word for being mapped to hash1 barrels Word shares hash1vector.
As it can be seen that being indicated using the vector that LSTM network structure obtains sentence the corresponding sentence of pair, while to pair sheet Body, which obtains vector using LSTM network structure, to be indicated, is then done these two types of vectors together and is classified, at the same using pair information and Sentence information, the data obtained by the two dimensions can quickly finish Data Convergence, and significantly reduce parameter explosion Phenomenon.
Two, extensive processing is carried out based on LSTM network structure.
Be described below using the LSTM network structure carry out extensive processing process (including step 1 is to step 4):
1, char lookup table matrix, hash lookup table matrix are initialized.
It, can be by the way of random initializtion in some embodiments.
2, it for the sentence after a candidate pair and participle, can handle to obtain the defeated of each word by extensive layer Enter vector:
(a), for sentence, the sentence matrix of an available L1* (char_N+hash_N).
Wherein, L1 is the quantity of word after sentence participle, and char_N is the vector dimension exported after the extensive processing of charlevel Degree, hash_N are the vector dimensions of hash lookup table setting.
Char_N, hash_N obtained in step (a) are pasted together, and then the sentence matrix for obtaining extensive layer is defeated Out.
(b), for pair, the pair matrix of an available L2* (char_N+hash_N).
Wherein, L2 is the word quantity after candidate entity in pair, candidate hypernym segment respectively, and char_N is The vector dimension exported after the extensive processing of charlevel, hash_N are the vector dimensions of hash lookup table setting.
Char_N, hash_N obtained in step (b) are pasted into append together, and then obtain the pair square of extensive layer Battle array output.
(c), by sentence Input matrix clause model, and by pair Input matrix pair constrain model.
3, two results obtained after being handled respectively in step (c) via clause model, pair constraint model Append together, is finally obtained and is exported (clause h1+pair constrains a h2) dimensional vector.
Wherein, clause h1 is the h1 dimensional vector that exports after handling via clause model, pair constraint h2 be via pair about The h2 dimensional vector exported after beam model processing, append refer to multiple vector direct splicings together.
4, (clause h1+pair constrains h2) dimensional vector that splicing in step 3 obtains is divided using softmax classifier Class.
Any technical characteristic in embodiment corresponding to any one of Fig. 1 to Fig. 5 is applied equally in the application Embodiment corresponding to Fig. 6 to Fig. 8, subsequent similar place repeat no more.
The method of training data a kind of in the application is illustrated above, below to the method for executing above-mentioned training data Device is described.The device can be mounted in the functional module on terminal device or server, be also possible to terminal device Or server, it can be combined with functional module and hardware module, specific the application is not construed as limiting.
Referring to Fig. 6, described device includes:
Module is obtained, for obtaining corpus set to be processed;
Processing module, for extracting entity sets from the corpus set, the entity sets include multiple names Entity;
Candidate upper set of words is extracted from the entity sets;
By the entity in the entity sets respectively with each upper word combination in the upper set of words of candidate, waited Choosing to set, it is described it is candidate include that multiple candidate are right to set, the candidate to refer to the entity that has incidence relation with it is upper Contamination;
By candidate to, each with candidate a prediction data is respectively configured to associated sentence, and to prediction data In with candidate extensive processing is carried out to associated sentence;
Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;
Extensive process layer is inputted to each word in the set of words to convert, and obtains vector set;
The vector set is trained according to the prediction data and shot and long term memory artificial neural network LSTM and Prediction.
In the embodiment of the present application, after the processing module extracts entity sets and candidate upper set of words, by the entity Entity in set respectively with each upper word combination in the upper set of words of candidate, obtain it is candidate to set, by candidate to, A prediction data each is respectively configured to associated sentence with candidate, and in prediction data with candidate to associated language Sentence carries out extensive processing;Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;To the word Each word in set inputs extensive process layer and is converted, and obtains vector set, and being handled by extensive layer can reduce data The order of magnitude, and then fast convergence is carried out on the basis of a small amount of prediction data, and then reduce for needed for training and prediction Number of parameters, to improve the efficiency of training data.
Optionally, in some embodiments of the present application, the extensive process layer includes character layer and hash layers of Hash, institute Processing module is stated to be specifically used for:
Each word in the set of words is inputted into the character layer respectively, the character will be inputted in the character layer The word of layer is respectively converted into word vectors, obtains word vectors set;
Each word in the set of words is inputted hash layers described respectively, the hash will be inputted at described hash layers The word of layer is respectively converted into hash vector, obtains hash vector set;
The vector set is obtained according to the word vectors set and the hash vector set.
Optionally, in some embodiments of the present application, the set of words includes N number of word, the processing module tool Body is used for:
First word is matched with the character in character look-up table, obtains the corresponding n vector of n character, according to The n vector and first word are generated word vectors by two-phase LSTM, and first word refers to the set of words In word wait train and predict.
Optionally, in some embodiments of the present application, the processing module is specifically used for:
N number of word is respectively mapped in K hash barrels using Hash hash function, respectively in hash barrels each N number of word is compressed, K hash vector is obtained, each hash vector corresponds to N number of word, and wherein N and K are equal For positive integer, N > K.
Optionally, in some embodiments of the present application, the processing module is specifically used for:
The word vectors and the K hash vector are spliced, the vector set is obtained.
Optionally, in some embodiments of the present application, the first sentence in the corpus set is corresponding to obtain one the One matrix, first matrix according to first sentence segment after corresponding word quantity, via the extensive place of the character layer The vector dimension that exports after reason and via hash layers of extensive processing when set vector dimension obtain;
Candidate first candidate in set obtains second matrix to correspondence, and second matrix is according to Candidate to word quantity corresponding after participle, via the vector dimension exported after the extensive processing of the character layer and via Set vector dimension obtains when hash layers of extensive processing.
Above from the angle of modular functionality entity in the embodiment of the present application server and terminal device retouched State, below from the angle of hardware handles respectively in the embodiment of the present application network authentication server and terminal device retouch It states.It should be noted that the corresponding entity device of acquisition module in the application z in embodiment corresponding to Fig. 6 can be Input-output unit device, the corresponding entity device of processing module can be processor.Device shown in fig. 6 can have such as Fig. 7 Shown in structure, when a kind of device has structure as shown in Figure 7, before processor and input-output unit in Fig. 7 are realized State processing module and acquisition the same or similar function of module that the Installation practice of the corresponding device provides, the storage in Fig. 7 The program code for needing to call when device storage processor executes the method for above-mentioned training data.
The embodiment of the present application also provides a kind of terminal devices, as shown in figure 8, for ease of description, illustrating only and this Apply for the relevant part of embodiment, it is disclosed by specific technical details, please refer to the embodiment of the present application method part.The terminal is set It include mobile phone, tablet computer, personal digital assistant (full name in English: personal digital assistant, English for that can be It is literary referred to as: PDA), any terminal such as point-of-sale terminal (full name in English: point of sales, English abbreviation: POS), vehicle-mounted computer Equipment.
Fig. 8 shows the part of terminal device relevant to the device provided by the embodiments of the present application for training data The block diagram of structure.With reference to Fig. 8, terminal device includes: radio frequency (full name in English: radio frequency, English abbreviation: RF) electricity (English is complete for road 88, memory 820, input unit 830, display unit 840, sensor 850, voicefrequency circuit 860, Wireless Fidelity Claim: wireless fidelity, English abbreviation: WiFi) components such as module 870, processor 880 and power supply 890.This field Technical staff is appreciated that terminal device structure shown in Fig. 8 does not constitute the restriction to terminal device, may include than figure Show more or fewer components, perhaps combines certain components or different component layouts.
It is specifically introduced below with reference to each component parts of the Fig. 8 to terminal device:
RF circuit 88 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, will be under base station After row information receives, handled to processor 880;In addition, the data for designing uplink are sent to base station.In general, RF circuit 88 wraps Include but be not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (full name in English: low noise Amplifier, English abbreviation: LNA), duplexer etc..In addition, RF circuit 88 can also by wireless communication with network and other Equipment communication.Any communication standard or agreement, including but not limited to global system for mobile communications can be used in above-mentioned wireless communication (full name in English: global system of mobile communication, English abbreviation: GSM), general grouped wireless clothes Be engaged in (full name in English: general packet radio service, English abbreviation: GPRS), CDMA (full name in English: Code division multiple Access, English abbreviation: CDMA), wideband code division multiple access (full name in English: wideband Code division multiple access, English abbreviation: WCDMA), long term evolution (full name in English: long term Evolution, English abbreviation: LTE), Email, short message service (full name in English: short messaging service, English abbreviation: SMS) etc..
Memory 820 can be used for storing software program and module, and processor 880 is stored in memory 820 by operation Software program and module, thereby executing the various function application and data processing of terminal device.Memory 820 can be main Including storing program area and storage data area, wherein storing program area can answer needed for storage program area, at least one function With program (such as sound-playing function, image player function etc.) etc.;Storage data area can store the use according to terminal device Data (such as audio data, phone directory etc.) created etc..In addition, memory 820 may include high random access storage Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid State memory device.
Input unit 830 can be used for receiving the number or character information of input, and generates and set with the user of terminal device It sets and the related key signals of function control inputs.Specifically, input unit 830 may include touch panel 831 and other are defeated Enter equipment 832.Touch panel 831, also referred to as touch screen collect touch operation (such as the user of user on it or nearby Use the operation of any suitable object or attachment such as finger, stylus on touch panel 831 or near touch panel 831), And corresponding attachment device is driven according to preset formula.Optionally, touch panel 831 may include touch detecting apparatus and Two parts of touch controller.Wherein, the touch orientation of touch detecting apparatus detection user, and detect touch operation bring letter Number, transmit a signal to touch controller;Touch controller receives touch information from touch detecting apparatus, and is converted into Contact coordinate, then give processor 880, and order that processor 880 is sent can be received and executed.Furthermore, it is possible to using The multiple types such as resistance-type, condenser type, infrared ray and surface acoustic wave realize touch panel 831.It is defeated in addition to touch panel 831 Entering unit 830 can also include other input equipments 832.Specifically, other input equipments 832 can include but is not limited to physics One of keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc. are a variety of.
Display unit 840 can be used for showing information input by user or the information and terminal device that are supplied to user Various menus.Display unit 840 may include display panel 841, optionally, can using liquid crystal display (full name in English: Liquid Crystal Display, English abbreviation: LCD), Organic Light Emitting Diode (full name in English: Organic Light- Emitting Diode, English abbreviation: OLED) etc. forms configure display panel 841.Further, touch panel 831 can cover Lid display panel 841 sends processor 880 to after touch panel 831 detects touch operation on it or nearby with true The type for determining touch event is followed by subsequent processing device 880 according to the type of touch event and provides corresponding vision on display panel 841 Output.Although in fig. 8, touch panel 831 and display panel 841 are to realize terminal device as two independent components Input and input function, but in some embodiments it is possible to touch panel 831 and display panel 841 are integrated and realized eventually End equipment outputs and inputs function.
Terminal device may also include at least one sensor 850, such as optical sensor, motion sensor and other sensings Device.Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to environment The light and shade of light adjusts the brightness of display panel 841, and proximity sensor can close display when terminal device is moved in one's ear Panel 841 and/or backlight.As a kind of motion sensor, accelerometer sensor can detect (generally three in all directions Axis) acceleration size, can detect that size and the direction of gravity when static, can be used to identify the application of terminal device posture (such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, percussion) etc.; The other sensors such as the gyroscope, barometer, hygrometer, thermometer, the infrared sensor that can also configure as terminal device, This is repeated no more.
Voicefrequency circuit 860, loudspeaker 861, microphone 862 can provide the audio interface between user and terminal device.Sound Electric signal after the audio data received conversion can be transferred to loudspeaker 861, be converted to by loudspeaker 861 by frequency circuit 860 Voice signal output;On the other hand, the voice signal of collection is converted to electric signal by microphone 862, is received by voicefrequency circuit 860 After be converted to audio data, then by after the processing of audio data output processor 880, such as another end is sent to through RF circuit 88 End equipment, or audio data is exported to memory 820 to be further processed.
WiFi belongs to short range wireless transmission technology, and terminal device can help user to receive and dispatch electricity by WiFi module 870 Sub- mail, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Fig. 8 shows Go out WiFi module 870, but it is understood that, and it is not belonging to must be configured into for terminal device, it completely can be according to need It to be omitted in the range for the essence for not changing application.
Processor 880 is the control centre of terminal device, utilizes each of various interfaces and the entire terminal device of connection A part by running or execute the software program and/or module that are stored in memory 820, and calls and is stored in storage Data in device 820 execute the various functions and processing data of terminal device, to carry out integral monitoring to terminal device.It can Choosing, processor 880 may include one or more processing units;Preferably, processor 880 can integrate application processor and modulation Demodulation processor, wherein the main processing operation system of application processor, user interface and application program etc., modulation /demodulation processing Device mainly handles wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 880.
Terminal device further includes the power supply 890 (such as battery) powered to all parts, it is preferred that power supply can pass through electricity Management system and processor 880 are logically contiguous, to realize management charging, electric discharge and power consumption by power-supply management system The functions such as management.
Although being not shown, terminal device can also include camera, bluetooth module etc., and details are not described herein.
In the embodiment of the present application, processor 880 included by the terminal device also there is control to execute above by Fig. 6 Shown in method flow performed by device.For example, the processor 880 by call memory 820 in instruction, execute with Lower operation:
Obtain corpus set to be processed;
Entity sets are extracted from the corpus set, the entity sets include the entity of multiple names;
Candidate upper set of words is extracted from the entity sets;
By the entity in the entity sets respectively with each upper word combination in the upper set of words of candidate, waited Choosing to set, it is described it is candidate include that multiple candidate are right to set, the candidate to refer to the entity that has incidence relation with it is upper Contamination;
By candidate to, each with candidate a prediction data is respectively configured to associated sentence, and to prediction data In with candidate extensive processing is carried out to associated sentence;
Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;
Extensive process layer is inputted to each word in the set of words to convert, and obtains vector set;
The vector set is trained according to the prediction data and shot and long term memory artificial neural network LSTM and Prediction.
Fig. 9 is a kind of server architecture schematic diagram provided by the embodiments of the present application, which can be because of configuration or performance It is different and generate bigger difference, it may include one or more central processing unit (full name in English: central Processing units, English abbreviation: CPU) 922 (for example, one or more processors) and memory 932, one Or (such as one or more mass memories are set the storage medium 930 of more than one storage application program 1542 or data 944 It is standby).Wherein, memory 932 and storage medium 930 can be of short duration storage or persistent storage.It is stored in the journey of storage medium 930 Sequence may include one or more modules (diagram does not mark), and each module may include to a series of fingers in server Enable operation.Further, central processing unit 922 can be set to communicate with storage medium 930, execute on server 920 Series of instructions operation in storage medium 930.
Server 920 can also include one or more power supplys 926, one or more wired or wireless networks Interface 950, one or more input/output interfaces 958, and/or, one or more operating systems 941, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD etc..
Step performed by device in above-described embodiment as shown in Figure 6 can be based on the server architecture shown in Fig. 9. For example, the processor 922, by calling the instruction in memory 932, execution is following to be operated:
Obtain corpus set to be processed;
Entity sets are extracted from the corpus set, the entity sets include the entity of multiple names;
Candidate upper set of words is extracted from the entity sets;
By the entity in the entity sets respectively with each upper word combination in the upper set of words of candidate, waited Choosing to set, it is described it is candidate include that multiple candidate are right to set, the candidate to refer to the entity that has incidence relation with it is upper Contamination;
By candidate to, each with candidate a prediction data is respectively configured to associated sentence, and to prediction data In with candidate extensive processing is carried out to associated sentence;
Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;
Extensive process layer is inputted to each word in the set of words to convert, and obtains vector set;
The vector set is trained according to the prediction data and shot and long term memory artificial neural network LSTM and Prediction.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the module It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple module or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or module It closes or communicates to connect, can be electrical property, mechanical or other forms.
The module as illustrated by the separation member may or may not be physically separated, aobvious as module The component shown may or may not be physical module, it can and it is in one place, or may be distributed over multiple On network module.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
It, can also be in addition, can integrate in a processing module in each functional module in each embodiment of the application It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit realizes and that when sold or used as an independent product can store can in a computer in the form of software function module It reads in storage medium.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.
The computer program product includes one or more computer instructions.Load and execute on computers the meter When calculation machine program instruction, entirely or partly generate according to process or function described in the embodiment of the present application.The computer can To be general purpose computer, special purpose computer, computer network or other programmable devices.The computer instruction can be deposited Storage in a computer-readable storage medium, or from a computer readable storage medium to another computer readable storage medium Transmission, for example, the computer instruction can pass through wired (example from a web-site, computer, server or data center Such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave) mode to another website Website, computer, server or data center are transmitted.The computer readable storage medium can be computer and can deposit Any usable medium of storage either includes that the data storages such as one or more usable mediums integrated server, data center are set It is standby.The usable medium can be magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or partly lead Body medium (such as solid state hard disk Solid State Disk (SSD)) etc..
Technical solution provided herein is described in detail above, specific case is applied in the application to this The principle and embodiment of application is expounded, the present processes that the above embodiments are only used to help understand and Its core concept;At the same time, for those skilled in the art in specific embodiment and is answered according to the thought of the application With in range, there will be changes, in conclusion the contents of this specification should not be construed as limiting the present application.

Claims (14)

1. a kind of method of training data, which is characterized in that the described method includes:
Obtain corpus set to be processed;
Entity sets are extracted from the corpus set, the entity sets include the entity of multiple names;
Candidate upper set of words is extracted from the entity sets;
By the entity in the entity sets respectively with each upper word combination in the upper set of words of candidate, it is candidate right to obtain Set, it is described it is candidate include that multiple candidate are right to set, the candidate is to referring to the entity and hypernym for having incidence relation Combination;
By candidate to, each with candidate a prediction data is respectively configured to associated sentence, and in prediction data with Candidate carries out extensive processing to associated sentence;
Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;
Extensive process layer is inputted to each word in the set of words to convert, and obtains vector set;
The vector set is trained and is predicted according to the prediction data and shot and long term memory artificial neural network LSTM.
2. the method according to claim 1, wherein the extensive process layer includes character layer and Hash hash Layer, each word in the set of words input extensive process layer and convert, the word collection after being converted It closes, comprising:
Each word in the set of words is inputted into the character layer respectively, the character layer will be inputted in the character layer Word is respectively converted into word vectors, obtains word vectors set;
Each word in the set of words is inputted hash layers described respectively, described hash layers will be inputted at described hash layers Word is respectively converted into hash vector, obtains hash vector set;
The vector set is obtained according to the word vectors set and the hash vector set.
3. described by institute's predicate according to the method described in claim 2, it is characterized in that, the set of words includes N number of word Each word in language set inputs the character layer respectively, converts the word for inputting the character layer respectively in the character layer For word vectors, the word vectors set is obtained, comprising:
First word is matched with the character in character look-up table, the corresponding n vector of n character is obtained, according to two-phase The n vector and first word are generated word vectors by LSTM, first word refer in the set of words to The word of training and prediction.
4. according to the method in claim 2 or 3, which is characterized in that each word by the set of words is distinguished It inputs hash layers described, hash layers of the word will be inputted at described hash layers and be respectively converted into hash vector, obtain hash Vector set, comprising:
N number of word is respectively mapped in K hash barrels using Hash hash function, respectively in hash barrels each to institute It states N number of word to be compressed, obtains K hash vector, each hash vector corresponds to N number of word, and wherein N and K are positive Integer, N > K.
5. according to the method described in claim 4, it is characterized in that, it is described according to the word vectors set and the hash to Duration set obtains the vector set, comprising:
The word vectors and the K hash vector are spliced, the vector set is obtained.
6. according to the method described in claim 5, it is characterized in that, described extensive to each word input in the set of words Process layer is converted, and after obtaining vector set, the first sentence correspondence in the corpus set obtains first matrix, institute State corresponding word quantity after the first matrix is segmented according to first sentence, via what is exported after the extensive processing of the character layer Vector dimension and via hash layers of extensive processing when set vector dimension obtain;
Candidate first candidate in set obtains second matrix to correspondence, and second matrix is according to the candidate To word quantity corresponding after participle, via the vector dimension exported after the extensive processing of the character layer and via hash layers Set vector dimension obtains when extensive processing.
7. a kind of device for training data, which is characterized in that described device includes:
Module is obtained, for obtaining corpus set to be processed;
Processing module, for extracting entity sets from the corpus set, the entity sets include the entity of multiple names;
Candidate upper set of words is extracted from the entity sets;
By the entity in the entity sets respectively with each upper word combination in the upper set of words of candidate, it is candidate right to obtain Set, it is described it is candidate include that multiple candidate are right to set, the candidate is to referring to the entity and hypernym for having incidence relation Combination;
By candidate to, each with candidate a prediction data is respectively configured to associated sentence, and in prediction data with Candidate carries out extensive processing to associated sentence;
Word segmentation processing is carried out to associated sentence to each candidate respectively, obtains set of words;
Extensive process layer is inputted to each word in the set of words to convert, and obtains vector set;
The vector set is trained and is predicted according to the prediction data and shot and long term memory artificial neural network LSTM.
8. device according to claim 7, which is characterized in that the extensive process layer includes character layer and Hash hash Layer, the processing module are specifically used for:
Each word in the set of words is inputted into the character layer respectively, the character layer will be inputted in the character layer Word is respectively converted into word vectors, obtains word vectors set;
Each word in the set of words is inputted hash layers described respectively, described hash layers will be inputted at described hash layers Word is respectively converted into hash vector, obtains hash vector set;
The vector set is obtained according to the word vectors set and the hash vector set.
9. device according to claim 8, which is characterized in that the set of words includes N number of word, the processing module It is specifically used for:
First word is matched with the character in character look-up table, the corresponding n vector of n character is obtained, according to two-phase The n vector and first word are generated word vectors by LSTM, first word refer in the set of words to The word of training and prediction.
10. device according to claim 8 or claim 9, which is characterized in that the processing module is specifically used for:
N number of word is respectively mapped in K hash barrels using Hash hash function, respectively in hash barrels each to institute It states N number of word to be compressed, obtains K hash vector, each hash vector corresponds to N number of word, and wherein N and K are positive Integer, N > K.
11. device according to claim 10, which is characterized in that the processing module is specifically used for:
The word vectors and the K hash vector are spliced, the vector set is obtained.
12. device according to claim 11, which is characterized in that the first sentence correspondence in the corpus set obtains one A first matrix, corresponding word quantity, general via the character layer after first matrix is segmented according to first sentence The vector dimension that exports and via hash layers of extensive processing when set vector dimension obtain after change processing;
Candidate first candidate in set obtains second matrix to correspondence, and second matrix is according to the candidate To word quantity corresponding after participle, via the vector dimension exported after the extensive processing of the character layer and via hash layers Set vector dimension obtains when extensive processing.
13. a kind of computer storage medium, which is characterized in that it includes instruction, when run on a computer, so that calculating Machine executes the method as described in claim 1-6 is any.
14. a kind of computer program product comprising instruction, which is characterized in that when run on a computer, so that calculating Machine executes any method of the claims 1-6.
CN201711269292.9A 2017-12-05 2017-12-05 Method and device for training data and storage medium Active CN110019648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711269292.9A CN110019648B (en) 2017-12-05 2017-12-05 Method and device for training data and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711269292.9A CN110019648B (en) 2017-12-05 2017-12-05 Method and device for training data and storage medium

Publications (2)

Publication Number Publication Date
CN110019648A true CN110019648A (en) 2019-07-16
CN110019648B CN110019648B (en) 2021-02-02

Family

ID=67185955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711269292.9A Active CN110019648B (en) 2017-12-05 2017-12-05 Method and device for training data and storage medium

Country Status (1)

Country Link
CN (1) CN110019648B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765244A (en) * 2019-09-18 2020-02-07 平安科技(深圳)有限公司 Method and device for acquiring answering, computer equipment and storage medium
US11501070B2 (en) 2020-07-01 2022-11-15 International Business Machines Corporation Taxonomy generation to insert out of vocabulary terms and hypernym-hyponym pair induction

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086549B2 (en) * 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning
CN105808525A (en) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN106407211A (en) * 2015-07-30 2017-02-15 富士通株式会社 Method and device for classifying semantic relationships among entity words
CN106570179A (en) * 2016-11-10 2017-04-19 中国科学院信息工程研究所 Evaluative text-oriented kernel entity identification method and apparatus
CN106649819A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Method and device for extracting entity words and hypernyms
CN106919977A (en) * 2015-12-25 2017-07-04 科大讯飞股份有限公司 A kind of feedforward sequence Memory Neural Networks and its construction method and system
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
US20170221474A1 (en) * 2016-02-02 2017-08-03 Mitsubishi Electric Research Laboratories, Inc. Method and System for Training Language Models to Reduce Recognition Errors
WO2017130089A1 (en) * 2016-01-26 2017-08-03 Koninklijke Philips N.V. Systems and methods for neural clinical paraphrase generation
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
CN107273357A (en) * 2017-06-14 2017-10-20 北京百度网讯科技有限公司 Modification method, device, equipment and the medium of participle model based on artificial intelligence

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086549B2 (en) * 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning
CN106407211A (en) * 2015-07-30 2017-02-15 富士通株式会社 Method and device for classifying semantic relationships among entity words
CN106919977A (en) * 2015-12-25 2017-07-04 科大讯飞股份有限公司 A kind of feedforward sequence Memory Neural Networks and its construction method and system
WO2017130089A1 (en) * 2016-01-26 2017-08-03 Koninklijke Philips N.V. Systems and methods for neural clinical paraphrase generation
US20170221474A1 (en) * 2016-02-02 2017-08-03 Mitsubishi Electric Research Laboratories, Inc. Method and System for Training Language Models to Reduce Recognition Errors
CN105808525A (en) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN106570179A (en) * 2016-11-10 2017-04-19 中国科学院信息工程研究所 Evaluative text-oriented kernel entity identification method and apparatus
CN106649819A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Method and device for extracting entity words and hypernyms
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
CN107273357A (en) * 2017-06-14 2017-10-20 北京百度网讯科技有限公司 Modification method, device, equipment and the medium of participle model based on artificial intelligence

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CHAO MA等: "Unsupervised Video Hashing by Exploiting Spatio-Temporal Feature", 《INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING》 *
JULIAN GEORG ZILLY等: "Recurrent highway networks", 《ARXIV:1607.03474V5》 *
ZIMING ZHANG等: "Efficient Training of Very Deep Neural Networks for Supervised Hashing", 《ARXIV:1511.04524V2》 *
张俊驰: "基于循环神经网络的依存句法分析模型研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李彦鹏: "特征耦合泛化及其在文体挖掘中的应用", 《中国博士学位论文全文数据库 信息科技辑》 *
胡新辰: "基于LSTM的语义关系分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765244A (en) * 2019-09-18 2020-02-07 平安科技(深圳)有限公司 Method and device for acquiring answering, computer equipment and storage medium
CN110765244B (en) * 2019-09-18 2023-06-06 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for obtaining answering operation
US11501070B2 (en) 2020-07-01 2022-11-15 International Business Machines Corporation Taxonomy generation to insert out of vocabulary terms and hypernym-hyponym pair induction

Also Published As

Publication number Publication date
CN110019648B (en) 2021-02-02

Similar Documents

Publication Publication Date Title
KR102646667B1 (en) Methods for finding image regions, model training methods, and related devices
WO2020108483A1 (en) Model training method, machine translation method, computer device and storage medium
CN111428516B (en) Information processing method and device
CN111553162B (en) Intention recognition method and related device
CN108280458B (en) Group relation type identification method and device
CN111046227B (en) Video duplicate checking method and device
CN108304388A (en) Machine translation method and device
WO2019062413A1 (en) Method and apparatus for managing and controlling application program, storage medium, and electronic device
WO2020147369A1 (en) Natural language processing method, training method, and data processing device
CN111816159B (en) Language identification method and related device
CN108228270A (en) Start resource loading method and device
JP2017514204A (en) Contact grouping method and apparatus
CN110019825B (en) Method and device for analyzing data semantics
CN110069715A (en) A kind of method of information recommendation model training, the method and device of information recommendation
CN114444579B (en) General disturbance acquisition method and device, storage medium and computer equipment
CN113821589A (en) Text label determination method and device, computer equipment and storage medium
CN113723378A (en) Model training method and device, computer equipment and storage medium
CN108846051A (en) Data processing method, device and computer readable storage medium
CN110019648A (en) A kind of method, apparatus and storage medium of training data
CN112862021B (en) Content labeling method and related device
CN114840563B (en) Method, device, equipment and storage medium for generating field description information
CN113569043A (en) Text category determination method and related device
CN114840499A (en) Table description information generation method, related device, equipment and storage medium
CN111062198A (en) Big data-based enterprise category analysis method and related equipment
CN110781274A (en) Question-answer pair generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant