CN110019648B - Method and device for training data and storage medium - Google Patents

Method and device for training data and storage medium Download PDF

Info

Publication number
CN110019648B
CN110019648B CN201711269292.9A CN201711269292A CN110019648B CN 110019648 B CN110019648 B CN 110019648B CN 201711269292 A CN201711269292 A CN 201711269292A CN 110019648 B CN110019648 B CN 110019648B
Authority
CN
China
Prior art keywords
word
hash
layer
candidate
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711269292.9A
Other languages
Chinese (zh)
Other versions
CN110019648A (en
Inventor
李潇
郑孙聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Shenzhen Tencent Computer Systems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Computer Systems Co Ltd filed Critical Shenzhen Tencent Computer Systems Co Ltd
Priority to CN201711269292.9A priority Critical patent/CN110019648B/en
Publication of CN110019648A publication Critical patent/CN110019648A/en
Application granted granted Critical
Publication of CN110019648B publication Critical patent/CN110019648B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

A method, apparatus and storage medium of training data, the method includes obtaining the corpus set to be treated; extracting an entity set from the corpus set, and extracting a candidate hypernym set from the entity set; combining the entities in the entity set with the superior words in the candidate superior word set respectively to obtain a candidate pair set, wherein the candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to the combination of the entities with incidence relations and the superior words; respectively constructing the candidate pairs and sentences associated with the candidate pairs into a set of prediction data, and generalizing the sentences associated with the candidate pairs in the prediction data; performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets; inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set; and training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM. By adopting the scheme, the efficiency of training data can be improved.

Description

Method and device for training data and storage medium
Technical Field
The present application relates to the field of big data processing technologies, and in particular, to a method and an apparatus for training data, and a storage medium.
Background
In the technical field of time recursive neural networks, long-short term memory artificial neural networks (LSTM) are generally adopted to process and predict important events with long intervals and long delays in time sequences. Before LSTM prediction is used, hypernyms need to be mined from corpus sets, and the problem is converted into a classification problem, namely a candidate entity-hypernym pair is given, and whether the candidate entity-hypernym pair is a real entity-hypernym pair is predicted. In the prediction method, word segmentation processing and feature extraction are generally performed, and then a traditional classifier is used for classifying the candidate entity, namely the hypernym. However, this approach requires high domain knowledge, and the final classification result may not have generalization and can be predicted to a smaller extent.
At present, a deep learning-based method is mainly used for classifying candidate entities, namely hypernyms, automatically extracting features from a corpus set and generating batch training data, and predicting based on the batch training data, so that the classification performance can be improved.
Disclosure of Invention
The application provides a method, a device and a storage medium for training data, which can solve the problem of low efficiency of training data in the prior art.
A first aspect of the present application provides a method of training data, the method comprising:
acquiring a corpus set to be processed;
extracting an entity set from the corpus set, wherein the entity set comprises a plurality of named entities;
extracting a candidate hypernym set from the entity set;
combining the entities in the entity set with the upper words in the candidate upper word set respectively to obtain a candidate pair set, wherein the candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to the combination of the entities with incidence relations and the upper words;
respectively constructing the candidate pairs and sentences associated with the candidate pairs into a set of prediction data, and generalizing the sentences associated with the candidate pairs in the prediction data;
performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets;
inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set;
and training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM.
A second aspect of the present application provides an apparatus for training data having functions to implement a method corresponding to the training data provided in the first aspect described above. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.
In one possible design, the apparatus includes:
the acquisition module is used for acquiring a corpus set to be processed;
the processing module is used for extracting an entity set from the corpus set, and the entity set comprises a plurality of named entities;
extracting a candidate hypernym set from the entity set;
combining the entities in the entity set with the upper words in the candidate upper word set respectively to obtain a candidate pair set, wherein the candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to the combination of the entities with incidence relations and the upper words;
respectively constructing the candidate pairs and sentences associated with the candidate pairs into a set of prediction data, and generalizing the sentences associated with the candidate pairs in the prediction data;
performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets;
inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set;
and training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM.
A further aspect of the application provides an apparatus for training data comprising at least one connected processor, memory and transceiver, wherein the memory is configured to store program code and the processor is configured to invoke the program code in the memory to perform the method of the first aspect.
A further aspect of the present application provides a computer storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect described above.
Compared with the prior art, in the scheme provided by the application, after the entity set and the candidate hypernym set are extracted, the entities in the entity set are respectively combined with the hypernyms in the candidate hypernym set to obtain a candidate pair set, the candidate pairs and the sentences associated with each candidate pair are respectively constructed into a piece of prediction data, and the sentences associated with the candidate pairs in the prediction data are subjected to generalization processing; performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets; and inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set, and reducing the magnitude of data through generalization layer processing, so as to rapidly converge on the basis of a small amount of prediction data, further reduce the number of parameters required for training and prediction, and further improve the efficiency of training data.
Drawings
FIG. 1 is a schematic flow chart of a method for training data according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for training data according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an LSTM network structure in the embodiment of the present application;
FIG. 4 is a schematic diagram of a char layer translation word at LSTM in an embodiment of the present application;
FIG. 5 is a diagram illustrating a hash layer translation word at LSTM in an embodiment of the present application;
FIG. 6 is a schematic diagram of an embodiment of an apparatus for training data;
FIG. 7 is a schematic diagram of another structure of an apparatus for training data according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a terminal device in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a server in an embodiment of the present application.
Detailed Description
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprise," "include," and "have," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, the division of modules presented herein is merely a logical division that may be implemented in a practical application in a further manner, such that a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not implemented, and such that couplings or direct couplings or communicative coupling between each other as shown or discussed may be through some interfaces, indirect couplings or communicative coupling between modules may be electrical or other similar forms, this application is not intended to be limiting. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.
The application provides a method, a device and a storage medium for training data, which are used for an artificial neural network, wherein the artificial neural network is an arithmetic mathematical model for simulating animal neural network behavior characteristics and performing distributed parallel information processing. It is an operational model, which is a nonlinear, adaptive information processing system formed by a large number of nodes (or called neurons or processing units) connected with each other. Wherein each node represents a particular output function, called the stimulus function. Every connection between two nodes represents a weighted value, called weight, for the signal passing through the connection, which is equivalent to the memory of the artificial neural network. The output of the artificial neural network is different according to the connection mode of the artificial neural network, the weight value and the excitation function. The artificial neural network itself is usually an approximation to an algorithm or function in nature, and may also be an expression of a logic strategy. The artificial neural network can achieve the purpose of processing information by adjusting the interconnection relationship among a large number of internal nodes according to the complexity of the system.
The artificial neural network has the self-learning function, the associative storage function, the computing capability of searching the optimized solution at high speed, and the self-organizing, self-adapting and real-time learning capabilities.
It should be noted that the terminal device referred to in this application may be a device providing voice and/or data connectivity to a user, a handheld device having a wireless connection function, or another processing device connected to a wireless modem. Wireless terminals, which may be mobile terminals such as mobile telephones (or "cellular" telephones) and computers having mobile terminals, such as portable, pocket, hand-held, computer-included, or vehicle-mounted mobile devices, may communicate with one or more core networks via a Radio Access Network (RAN). For example, Personal Communication Service (PCS) phones, cordless phones, Session Initiation Protocol (SIP) phones, Wireless Local Loop (WLL) stations, Personal Digital Assistants (PDA), and the like. A wireless Terminal may also be referred to as a system, a Subscriber Unit (Subscriber Unit), a Subscriber Station (Subscriber Station), a Mobile Station (Mobile), a Remote Station (Remote Station), an Access Point (Access Point), a Remote Terminal (Remote Terminal), an Access Terminal (Access Terminal), a User Terminal (User Terminal), a Terminal Device, a User Agent (User Agent), a User Device (User Device), or a User Equipment (User Equipment).
Referring to fig. 1, a method for training data provided in the present application is described below, where an embodiment of the present application mainly includes:
101. and acquiring a corpus set to be processed.
The corpus collection refers to a collection of corpuses collected within a statistical time, and each corpus can come from at least one platform. The corpus set comprises a plurality of corpuses, each corpus comprises a plurality of words, and the words can form a word set. For example, the corpus is derived from data of a post or news. The corpus collection can be captured by means of crawlers and the like, and the concrete mode is not limited in the application. The corpus may also be data from an enterprise, which may include employee information, enterprise information, intellectual property, legal information, employee superior/inferior relationships, employee attendance, employee reviews, enterprise news, product sales information for the enterprise, and production data for the enterprise. In addition, in order to facilitate subsequent data processing, denoising processing may be performed on the corpus set, which is not limited in the present application.
102. And extracting an entity set from the corpus set.
Where the set of entities includes a plurality of named entities, an entity may be any noun such as, for example, a person's name, a place name, a thing name, an organization, a term, etc.
103. And extracting a candidate hypernym set from the entity set.
For example, the set of entities includes Liudebua, Yao morning, evening, attendance, famous stars, albums, and entities such as release, eat, apple, litchi, etc. Then, the superior terms of famous star as Liudebua, Yaomorning and fruit as apple, litchi can be inferred from the entity set.
104. And combining the entities in the entity set with the superior words in the candidate superior word set respectively to obtain a candidate pair set.
The candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to a combination of entities with incidence relations and upper words.
After the candidate superior word is deduced from step 103, it is deduced that the famous star is the superior word of liu de hua and yao morning, and (liu de hua, famous star), (yao morning and famous star) can be respectively used as a candidate pair. Also (apple, fruit), (litchi, fruit) can be respectively used as a candidate pair.
105. And respectively constructing the candidate pairs and the sentences associated with the candidate pairs into a piece of prediction data, and generalizing the sentences associated with the candidate pairs in the prediction data.
In some embodiments, the prediction data may be represented by (pair, generalized sentence), where the generalized sentence refers to a sentence obtained by performing generalization processing on a sentence associated with the candidate pair, and the pair represents a candidate pair composed of an entity and a candidate hypernym.
For example, the entity in candidate pair 1 is liudehua, the superior word candidate is famous star, the entity in candidate pair 2 is yao morning, the superior word candidate is famous star, and the sentence associated with candidate pair 1 may include:
famous stars such as Liu De Hua and Yao morning attend the evening. 'A movie has been played by famous stars such as Liu De Hua and Yao morning, and a song has been singed by famous stars such as Liu De Hua and Van Bing' …
After the sentence associated with the candidate pair 1 is generalized, the following generalized sentences can be obtained respectively:
"Nr and Yao morning and other Tag attend evening party", "Nr and Yao morning and other Tag come out a movie together", "Nr and Fan Bing and other Tag singing a song" …
Where Nr represents a generalized named entity, such as the famous star attending the evening, for example, "Liudebua and Yao morning. For example, if pair is for "Liu De Hua", then "Liu De Hua and Yao morning, famous stars were attended to the evening. The "Liu De Hua" in "is generalized to Nr; if pair is for "yao morning", then "Liu De Hua and yao morning, etc. famous stars are attended to the evening. "Zhongyao morning" is generalized to Nr.
Tag indicates the label of the superior word of the generalized entity attribute, for example, the famous star in the famous stars such as Liu De Hua, Yao morning refers to the superior word of the character entity such as "Liu De Hua, Yao morning, etc.
106. And performing word segmentation processing on the associated sentences of each candidate pair to obtain a word set.
Wherein the set of words includes N words. For example, the sentence "Liu De Hua and Yao morning, etc. famous star attends evening" is obtained after word segmentation treatment: liu De Hua, He, Yao morning, etc., famous, star, attendance, date, evening.
107. And inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set.
Optionally, in some embodiments of the present application, the generalization processing layer includes a character layer (char level) and a hash layer (hash level), and the input of each term in the term set into the generalization processing layer is converted to obtain the converted term set, including:
1. and respectively inputting each word in the word set into the character layer, and respectively converting the words input into the character layer into word vectors in the character layer to obtain a word vector set.
In some embodiments, a first word may be matched with characters in a character lookup table to obtain n vectors corresponding to n characters, and the n vectors and the first word are combined according to bi-phase LSTM to generate a word vector, where the first word refers to a word to be trained and predicted in the word set.
For example, as shown in FIG. 4, word (word) in FIG. 4 is the first word. After entering the char layer, the word is respectively matched and combined with a character lookup table (char lookup table) in the char layer. For example, combining word with char 1-charN respectively, combining word with char1 can result in output 1(output1), and finally outputting N outputs, i.e., output 1-output N, in the same way.
2. And respectively inputting each word in the word set into the hash layer, and respectively converting the words input into the hash layer into hash vectors (hash vector) by the hash layer to obtain a hash vector set.
In some embodiments, the N words may be mapped to K hash buckets respectively using a hash function, and the N words are compressed in each hash bucket respectively to obtain K hash vectors, where each hash vector corresponds to the N words, where N and K are positive integers, and N > K.
For example, as shown in fig. 5, after words 1-word enter the hash layer, a hash function in the hash layer maps words 1-word into hashes 1-hashks, respectively, wherein hashes 1-hashks all represent hashes 1 buckets. For example, word1 to word are respectively mapped to hash1, and a hash vector, namely, hash1vector, is finally obtained, and in the same way, K hash vectors, namely, hash1vector to hash K vector, are finally output by the hash layer.
3. And obtaining the vector set according to the word vector set and the hash vector set.
In some embodiments, the word vector and the K hash vectors may be concatenated or pasted, resulting in the vector set.
In some embodiments, each word in the word set is input to the generalization processing layer to be converted, and after a vector set is obtained, a vector of each word is obtained, and for the vector of each word, the word is represented in two dimensions of a statement and a candidate pair, so that the finally obtained vector of each word corresponds to two matrices, namely a statement matrix and a candidate pair matrix. The following respectively introduces the first sentence in the corpus set and the first candidate pair in the candidate pair set as examples:
1. matrix of statements
For example, a first matrix is obtained by a first sentence mapping, and the first matrix is obtained according to the number of words corresponding to the first sentence after word segmentation, the vector dimension output after the character layer generalization, and the vector dimension set during the hash layer generalization.
In some embodiments, the first matrix may be represented by L1 (char _ N + hash _ N). Wherein, L1 is the number of words after sentence segmentation, char _ N is the vector dimension output after char level generalization, and hash _ N is the vector dimension set by hash lookup table (hash lookup table).
2. With respect to candidate pair matrices
For example, the first candidate pair corresponds to a second matrix obtained according to the number of words corresponding to the candidate pair after word segmentation, the vector dimension output after generalization processing of the character layer, and the vector dimension set during generalization processing of the hash layer.
In some embodiments, the second matrix may be represented by L2 (char _ N + hash _ N). L2 is the word number of the candidate entity and the candidate hypernym in the first candidate pair after being respectively participated, char _ N is the vector dimension output after char level generalization, and hash _ N is the hash vector dimension.
108. And training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM.
Compared with the existing mechanism, in the embodiment of the application, after an entity set and a candidate hypernym set are extracted, the entities in the entity set are respectively combined with the hypernyms in the candidate hypernym set to obtain a candidate pair set, the candidate pairs and sentences associated with the candidate pairs are respectively constructed into a piece of prediction data, and the sentences associated with the candidate pairs in the prediction data are subjected to generalization processing; performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets; the method comprises the steps of inputting all words in the word set into a generalization layer for conversion to obtain a vector set, obtaining the vector set through generalization layer processing, rapidly converging on the basis of a small amount of prediction data, training based on the vector set, reducing the number of parameters required for training and predicting, improving the efficiency of training data, and reducing the generation cost and the training time of the training data. In addition, through the generalized layer processing in the embodiment of the application, the characteristics that deep learning excessively depends on the number of training samples and convergence is slow can be reduced, better performance can be automatically achieved through training of a small amount of data directly, and features do not need to be extracted manually.
For ease of understanding, the method for training data provided in the embodiments of the present application is described below by taking a specific application scenario as an example. As shown in fig. 2, the embodiment of the present application may include:
step 1: and performing word segmentation on the sentences in the corpus set, obtaining candidate pairs based on the corpus set, and performing sentence generalization by using the candidate pairs.
Each sentence in the corpus set firstly uses a named entity recognition technology to obtain an entity set contained in the sentence, then all possible nouns, noun phrases and the like are used as candidate hypernym sets, and any two-by-two combination of the entities in the entity set and the hypernyms in the candidate hypernym sets is used as candidate pairs. Then, each candidate pair is used, and sentences corresponding to the candidate pairs are constructed into a prediction data, and the sentences are generalized.
Named Entity Recognition (NER) is a basic task of natural language processing, and aims to recognize named entities such as names of people, places, organizational structures and the like in a corpus set. Since these named entities are increasing in number, they are usually not possible to be listed exhaustively in dictionaries, and their construction methods have their own regularity, the recognition of these words is usually handled independently from the task of lexical morphological processing (e.g. chinese segmentation), called named entity recognition. The named entity recognition technology is an indispensable component of various natural language processing technologies such as information extraction, information retrieval, machine translation, question and answer systems and the like.
Considering that the entities in the candidate pair and the candidate hypernym may be from different sentences, for example, the corpus set includes the following two sentences:
(1) famous stars such as Liu De Hua and Yao morning attend evening meetings.
(2) Famous stars such as Liu De Hua and Yao morning have evolved a movie together.
Then, when constructing prediction data here, a plurality of generalization statements appear, but only one candidate pair is corresponded.
The following famous stars attend the evening with the sentence "Liu De Hua and Yao morning, etc. "is an example. Liu De Hua and Yao morning belong to entities named by characters, and famous stars are corresponding candidate superior words. By combining the entities and candidate hypernyms, 2 candidate pairs can be obtained: (Liu De Hua, famous star), (Yao Chen, famous star). And then constructing corresponding prediction data based on the two candidate pairs, and combining the entities in the candidate pairs and the sentences in which the candidate hypernyms are located:
(1) prediction data 1: pair (Liu De Hua, famous star), and generalized sentence (Nr and Tag such as Yao morning attend evening).
(2) Prediction data 2: pair (morning of yao, famous star), generalized sentence (Tag such as Liu De Hua and Nr attend evening).
Step 2: each word after the word segmentation is carried out on the sentence is processed through a generalization processing layer to generate a word vector, namely, each word is converted through the generalization processing layer, so that the parameter quantity can be effectively reduced, and quick convergence can be realized on a small amount of training data.
And step 3: the data processed by the generalization layer is trained and predicted by using an LSTM network, and is input as a candidate pair and a sentence corresponding to the candidate pair.
In some embodiments of the present application, the following describes an LSTM network architecture, and the flow of the generalization layer process is performed using the LSTM network architecture.
As shown in FIG. 3, the LSTM network structure includes a softmax classifier, a statement template, a pair constraint template, and a generalization layer.
The generalization layer comprises a character layer (Char level) and a hash layer (hash level), the Char level comprises a bidirectional LSTM and a character lookup table (Char lookup table), the Char lookup table can contain N different Char, N can be 1-2 ten thousand, and the value of N is not limited in the application.
The hash level comprises a hash function and a hash lookup table, the hash lookup table comprises K hash buckets, K can be set according to experience, and the value of K is not limited in the application.
The vector of each char and hash in the respective lookup table can be M (20-50) dimension.
The softmax classifier is modeled by a polynomial distribution, and can be divided into multiple mutually exclusive categories, and can map (compress) any real vector in one K dimension into a real vector in another K dimension. The softmax classifier refers to an output layer of an artificial neural network.
Statement templates refer to LSTM that process the statement matrix.
A pair constrained template refers to the LSTM that processes the pair matrix.
Firstly, a processing principle of a generalization processing layer:
the generalized layer processing flow comprises the following steps: the word level is replaced by the char level, and the hash mapping is used as the hash vector.
1. charlevel replaces word level.
As shown in fig. 4, for each word in the word lookup table after word segmentation, a vector of each char is obtained through the char lookup table (char1 … char), and then the result (n vectors) is combined with the word to generate a new word vector through a bidirectional LSTM, that is, information of the word itself is retained, and the problem of parameter explosion caused by the original use of the word lookup table can be greatly reduced.
2. And using the hash mapping to obtain a hash vector.
As shown in FIG. 5, a hash function is used to map N words in each word lookup table into K hash buckets, where K may be much smaller than N, so as to ensure the reduction of parameter order. One hash vector is shared by forcibly compressing multiple words together. By sharing the hash vector mechanism, the training speed can be greatly increased and better results can be obtained on fewer training data sets.
Wherein, a corresponding hash vector is obtained through each hash bucket, a shared hash vector refers to that N words are mapped to a hash1 bucket to a hash K bucket respectively, and N words share a hash1vector for N words mapped to a hash1 bucket.
Therefore, the statement corresponding to the pair is represented by the vector of the statement by using the LSTM network structure, the pair is represented by the vector by using the LSTM network structure, the two types of vectors are classified together, data convergence can be rapidly completed through the data obtained by the two dimensions by using the pair information and the statement information, and the parameter explosion phenomenon is remarkably reduced.
And secondly, carrying out generalization processing based on an LSTM network structure.
The following describes a process of generalization processing using the LSTM network structure (including steps 1 to 4):
1. initializing a char lookup table matrix and a hash lookup table matrix.
In some embodiments, a random initialization may be used.
2. For a candidate pair and a sentence after word segmentation, an input vector of each word can be obtained through generalization layer processing:
(a) for the sentences, a sentence matrix of L1 (char _ N + hash _ N) can be obtained.
L1 is the number of words after sentence word segmentation, char _ N is the vector dimension output after charlevel generalization processing, and hash _ N is the vector dimension set by the hash lookup table.
And (3) pasting the char _ N, hash _ N obtained in the step (a) together, and further obtaining the statement matrix output of the generalization layer.
(b) For pair, a pair matrix of L2 (char _ N + hash _ N) may be obtained.
L2 is the word number after word segmentation of the candidate entity and the candidate hypernym in the pair, char _ N is the vector dimension output after charlevel generalization, and hash _ N is the vector dimension set by the hash lookup table.
And (4) pasting the apr _ N, hash _ N obtained in the step (b) together to obtain the pair matrix output of the generalization layer.
(c) The sentence matrix is input into the sentence model, and the pair matrix is input into the pair constraint model.
3. And (c) combining two results append obtained after respectively processing by the sentence pattern model and the pair constraint model in the step (c), and finally obtaining and outputting a (sentence pattern h1+ pair constraint h2) dimensional vector.
The sentence h1 is an h 1-dimensional vector output after being processed by the sentence model, the pair constraint h2 is an h 2-dimensional vector output after being processed by the pair constraint model, and the apend means that a plurality of vectors are directly spliced together.
4. And classifying the (sentence pattern h1+ pair constraint h2) dimensional vector spliced in the step 3 by using a softmax classifier.
Any technical feature in the embodiment corresponding to any one of fig. 1 to 5 is also applicable to the embodiment corresponding to fig. 6 to 8 in the present application, and the subsequent similarities are not repeated.
A method for training data in the present application is described above, and a method and an apparatus for executing the training data are described below. The apparatus may be a functional module installed on a terminal device or a server, or may be a combination of a functional module and a hardware module, and the present application is not limited in particular.
Referring to fig. 6, the apparatus includes:
the acquisition module is used for acquiring a corpus set to be processed;
the processing module is used for extracting an entity set from the corpus set, and the entity set comprises a plurality of named entities;
extracting a candidate hypernym set from the entity set;
combining the entities in the entity set with the upper words in the candidate upper word set respectively to obtain a candidate pair set, wherein the candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to the combination of the entities with incidence relations and the upper words;
respectively constructing the candidate pairs and sentences associated with the candidate pairs into a set of prediction data, and generalizing the sentences associated with the candidate pairs in the prediction data;
performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets;
inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set;
and training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM.
In the embodiment of the application, after the entity set and the candidate hypernym set are extracted by the processing module, the entities in the entity set are respectively combined with the hypernyms in the candidate hypernym set to obtain a candidate pair set, the candidate pairs and sentences associated with each candidate pair are respectively constructed into a part of prediction data, and the sentences associated with the candidate pairs in the prediction data are subjected to generalization processing; performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets; and inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set, and reducing the magnitude of data through generalization layer processing, so as to rapidly converge on the basis of a small amount of prediction data, further reduce the number of parameters required for training and prediction, and further improve the efficiency of training data.
Optionally, in some embodiments of the present application, the generalization processing layer includes a character layer and a hash layer, and the processing module is specifically configured to:
respectively inputting each word in the word set into the character layer, and respectively converting the words input into the character layer into word vectors in the character layer to obtain a word vector set;
respectively inputting all words in the word set into the hash layer, and respectively converting the words input into the hash layer into hash vectors on the hash layer to obtain a hash vector set;
and obtaining the vector set according to the word vector set and the hash vector set.
Optionally, in some embodiments of the present application, the term set includes N terms, and the processing module is specifically configured to:
matching the first word with the characters in the character lookup table to obtain n vectors corresponding to the n characters, and generating a word vector by matching the n vectors with the first word according to the biphase LSTM, wherein the first word refers to a word to be trained and predicted in the word set.
Optionally, in some embodiments of the present application, the processing module is specifically configured to:
and respectively mapping the N words to K hash buckets by using a hash function, respectively compressing the N words in each hash bucket to obtain K hash vectors, wherein each hash vector corresponds to the N words, N and K are positive integers, and N is greater than K.
Optionally, in some embodiments of the present application, the processing module is specifically configured to:
and splicing the word vectors and the K hash vectors to obtain the vector set.
Optionally, in some embodiments of the present application, a first matrix is obtained by a first sentence in the corpus set, where the first matrix is obtained according to the number of words corresponding to the first sentence after word segmentation, a vector dimension output after the character layer generalization processing, and a vector dimension set during the hash layer generalization processing;
and the second matrix is obtained according to the number of corresponding words after the word segmentation of the candidate pair, the vector dimension output after the generalization processing of the character layer and the vector dimension set during the generalization processing of the hash layer.
The server and the terminal device in the embodiment of the present application are described above from the perspective of the modular functional entity, and the network authentication server and the terminal device in the embodiment of the present application are described below from the perspective of hardware processing. It should be noted that, in the embodiment corresponding to fig. 6 in this application z, the entity device corresponding to the obtaining module may be an input/output unit, and the entity device corresponding to the processing module may be a processor. The apparatus shown in fig. 6 may have the structure shown in fig. 7, when an apparatus has the structure shown in fig. 7, the processor and the input/output unit in fig. 7 implement the same or similar functions of the processing module and the obtaining module provided in the apparatus embodiment corresponding to the apparatus, and the memory in fig. 7 stores program codes that the processor needs to call when executing the method of training data.
As shown in fig. 8, for convenience of description, only the portions related to the embodiments of the present application are shown, and details of the specific technology are not disclosed, please refer to the method portion of the embodiments of the present application. The terminal device can be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale terminal (POS), a vehicle-mounted computer, and the like.
Fig. 8 is a block diagram illustrating a partial structure of a terminal device related to the apparatus for training data provided in the embodiment of the present application. Referring to fig. 8, the terminal device includes: radio Frequency (RF) circuit 88, memory 820, input unit 830, display unit 840, sensor 850, audio circuit 860, wireless fidelity (WiFi) module 870, processor 880, and power supply 890. Those skilled in the art will appreciate that the terminal device configuration shown in fig. 8 does not constitute a limitation of the terminal device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following specifically describes each constituent component of the terminal device with reference to fig. 8:
the RF circuit 88 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to the processor 880; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuitry 88 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 88 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email), Short Message Service (SMS), etc.
The memory 820 may be used to store software programs and modules, and the processor 880 executes various functional applications of the terminal device and data processing by operating the software programs and modules stored in the memory 820. The memory 820 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal device, and the like. Further, the memory 820 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 830 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the terminal device. Specifically, the input unit 830 may include a touch panel 831 and other input devices 832. The touch panel 831, also referred to as a touch screen, can collect touch operations performed by a user on or near the touch panel 831 (e.g., operations performed by the user on the touch panel 831 or near the touch panel 831 using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 831 may include two portions, i.e., a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 880, and can receive and execute commands from the processor 880. In addition, the touch panel 831 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 830 may include other input devices 832 in addition to the touch panel 831. In particular, other input devices 832 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 840 may be used to display information input by a user or information provided to the user and various menus of the terminal device. The Display unit 840 may include a Display panel 841, and the Display panel 841 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, touch panel 831 can overlay display panel 841, and when touch panel 831 detects a touch operation thereon or nearby, communicate to processor 880 to determine the type of touch event, and processor 880 can then provide a corresponding visual output on display panel 841 based on the type of touch event. Although in fig. 8, the touch panel 831 and the display panel 841 are two separate components to implement the input and output functions of the terminal device, in some embodiments, the touch panel 831 and the display panel 841 may be integrated to implement the input and output functions of the terminal device.
The terminal device may also include at least one sensor 850, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 841 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 841 and/or backlight when the terminal device is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the terminal device, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal device, detailed description is omitted here.
Audio circuitry 860, speaker 861, microphone 862 may provide an audio interface between the user and the terminal device. The audio circuit 860 can transmit the electrical signal converted from the received audio data to the speaker 861, and the electrical signal is converted into a sound signal by the speaker 861 and output; on the other hand, the microphone 862 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 860, and outputs the audio data to the processor 880 for processing, and then transmits the audio data to, for example, another terminal device via the RF circuit 88, or outputs the audio data to the memory 820 for further processing.
WiFi belongs to short distance wireless transmission technology, and the terminal device can help the user send and receive e-mail, browse web page and access streaming media, etc. through WiFi module 870, which provides wireless broadband internet access for the user. Although fig. 8 shows WiFi module 870, it is understood that it does not belong to the essential constitution of the terminal device, and may be omitted entirely as needed within the scope not changing the essence of the application.
The processor 880 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, and performs various functions of the terminal device and processes data by operating or executing software programs and/or modules stored in the memory 820 and calling data stored in the memory 820, thereby performing overall monitoring of the terminal device. Optionally, processor 880 may include one or more processing units; preferably, the processor 880 may integrate an application processor, which mainly handles operating systems, user interfaces, applications, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 880.
The terminal device also includes a power supply 890 (e.g., a battery) for powering the various components, which may be logically coupled to the processor 880 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.
Although not shown, the terminal device may further include a camera, a bluetooth module, and the like, which are not described herein.
In the embodiment of the present application, the processor 880 included in the terminal device further has a function of controlling and executing the method flow executed by the apparatus shown in fig. 6. For example, the processor 880, by invoking instructions in the memory 820, performs the following:
acquiring a corpus set to be processed;
extracting an entity set from the corpus set, wherein the entity set comprises a plurality of named entities;
extracting a candidate hypernym set from the entity set;
combining the entities in the entity set with the upper words in the candidate upper word set respectively to obtain a candidate pair set, wherein the candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to the combination of the entities with incidence relations and the upper words;
respectively constructing the candidate pairs and sentences associated with the candidate pairs into a set of prediction data, and generalizing the sentences associated with the candidate pairs in the prediction data;
performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets;
inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set;
and training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM.
Fig. 9 is a schematic diagram of a server 920 according to an embodiment of the present disclosure, where the server 920 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 922 (e.g., one or more processors) and a memory 932, and one or more storage media 930 (e.g., one or more mass storage devices) for storing applications 1542 or data 944. Memory 932 and storage media 930 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 922 may be provided in communication with storage medium 930 to execute a sequence of instruction operations in storage medium 930 on server 920.
The Server 920 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958, and/or one or more operating systems 941, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.
The steps performed by the apparatus shown in fig. 6 in the above-described embodiment may be based on the server structure shown in fig. 9. For example, the processor 922, by invoking instructions in the memory 932, performs the following:
acquiring a corpus set to be processed;
extracting an entity set from the corpus set, wherein the entity set comprises a plurality of named entities;
extracting a candidate hypernym set from the entity set;
combining the entities in the entity set with the upper words in the candidate upper word set respectively to obtain a candidate pair set, wherein the candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to the combination of the entities with incidence relations and the upper words;
respectively constructing the candidate pairs and sentences associated with the candidate pairs into a set of prediction data, and generalizing the sentences associated with the candidate pairs in the prediction data;
performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets;
inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set;
and training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The technical solutions provided by the present application are introduced in detail, and the present application applies specific examples to explain the principles and embodiments of the present application, and the descriptions of the above examples are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (13)

1. A method of training data, the method comprising:
acquiring a corpus set to be processed;
extracting an entity set from the corpus set, wherein the entity set comprises a plurality of named entities;
extracting a candidate hypernym set from the entity set;
combining the entities in the entity set with the upper words in the candidate upper word set respectively to obtain a candidate pair set, wherein the candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to the combination of the entities with incidence relations and the upper words;
constructing the candidate pairs and sentences associated with the candidate pairs into a set of prediction data respectively, and generalizing the sentences associated with the candidate pairs in the prediction data to obtain the prediction data represented by the candidate pairs and the generalized sentences;
performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets;
inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set;
and training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM.
2. The method of claim 1, wherein the generalization processing layer comprises a character layer and a hash layer, and the converting each term in the term set input to the generalization processing layer to obtain the converted term set comprises:
respectively inputting each word in the word set into the character layer, and respectively converting the words input into the character layer into word vectors in the character layer to obtain a word vector set;
respectively inputting all words in the word set into the hash layer, and respectively converting the words input into the hash layer into hash vectors on the hash layer to obtain a hash vector set;
and obtaining the vector set according to the word vector set and the hash vector set.
3. The method of claim 2, wherein the word set comprises N words, the inputting each word in the word set into the character layer, respectively converting the words input into the character layer into word vectors at the character layer, respectively, and obtaining the word vector set comprises:
matching the first word with the characters in the character lookup table to obtain n vectors corresponding to the n characters, and generating a word vector by matching the n vectors with the first word according to the biphase LSTM, wherein the first word refers to a word to be trained and predicted in the word set.
4. The method according to claim 2 or 3, wherein the respectively inputting the words in the word set into the hash layer, respectively converting the words input into the hash layer into hash vectors at the hash layer, and obtaining a hash vector set comprises:
the method comprises the steps of using a Hash function to map N words to K hash buckets respectively, compressing the N words in each hash bucket respectively to obtain K hash vectors, wherein each hash vector corresponds to the N words, N and K are positive integers, and N is larger than K.
5. The method of claim 4, wherein the deriving the set of vectors from the set of word vectors and the set of hash vectors comprises:
and splicing the word vectors and the K hash vectors to obtain the vector set.
6. The method according to claim 5, wherein after the words in the word set are input into a generalization layer and converted to obtain a vector set, a first matrix is obtained for a first sentence in the corpus set, and the first matrix is obtained according to the number of words corresponding to the first sentence after word segmentation, the vector dimension output after generalization processing by the character layer, and the vector dimension set during generalization processing by the hash layer;
and the second matrix is obtained according to the number of corresponding words after the word segmentation of the candidate pair, the vector dimension output after the generalization processing of the character layer and the vector dimension set during the generalization processing of the hash layer.
7. An apparatus for training data, the apparatus comprising:
the acquisition module is used for acquiring a corpus set to be processed;
the processing module is used for extracting an entity set from the corpus set, and the entity set comprises a plurality of named entities;
extracting a candidate hypernym set from the entity set;
combining the entities in the entity set with the upper words in the candidate upper word set respectively to obtain a candidate pair set, wherein the candidate pair set comprises a plurality of candidate pairs, and the candidate pairs refer to the combination of the entities with incidence relations and the upper words;
constructing the candidate pairs and sentences associated with the candidate pairs into a set of prediction data respectively, and generalizing the sentences associated with the candidate pairs in the prediction data to obtain the prediction data represented by the candidate pairs and the generalized sentences;
performing word segmentation processing on the associated sentences of each candidate pair to obtain word sets;
inputting each word in the word set into a generalization processing layer for conversion to obtain a vector set;
and training and predicting the vector set according to the prediction data and the long-short term memory artificial neural network LSTM.
8. The apparatus according to claim 7, wherein the generalization processing layer includes a character layer and a hash layer, and the processing module is specifically configured to:
respectively inputting each word in the word set into the character layer, and respectively converting the words input into the character layer into word vectors in the character layer to obtain a word vector set;
respectively inputting all words in the word set into the hash layer, and respectively converting the words input into the hash layer into hash vectors on the hash layer to obtain a hash vector set;
and obtaining the vector set according to the word vector set and the hash vector set.
9. The apparatus of claim 8, wherein the set of words comprises N words, and wherein the processing module is specifically configured to:
matching the first word with the characters in the character lookup table to obtain n vectors corresponding to the n characters, and generating a word vector by matching the n vectors with the first word according to the biphase LSTM, wherein the first word refers to a word to be trained and predicted in the word set.
10. The apparatus according to claim 8 or 9, wherein the processing module is specifically configured to:
the method comprises the steps of using a Hash function to map N words to K hash buckets respectively, compressing the N words in each hash bucket respectively to obtain K hash vectors, wherein each hash vector corresponds to the N words, N and K are positive integers, and N is larger than K.
11. The apparatus of claim 10, wherein the processing module is specifically configured to:
and splicing the word vectors and the K hash vectors to obtain the vector set.
12. The apparatus according to claim 11, wherein a first matrix is obtained for a first sentence in the corpus, and the first matrix is obtained according to a number of words corresponding to the first sentence after word segmentation, a vector dimension output after generalization processing through the character layer, and a vector dimension set during generalization processing through the hash layer;
and the second matrix is obtained according to the number of corresponding words after the word segmentation of the candidate pair, the vector dimension output after the generalization processing of the character layer and the vector dimension set during the generalization processing of the hash layer.
13. A computer storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 6.
CN201711269292.9A 2017-12-05 2017-12-05 Method and device for training data and storage medium Active CN110019648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711269292.9A CN110019648B (en) 2017-12-05 2017-12-05 Method and device for training data and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711269292.9A CN110019648B (en) 2017-12-05 2017-12-05 Method and device for training data and storage medium

Publications (2)

Publication Number Publication Date
CN110019648A CN110019648A (en) 2019-07-16
CN110019648B true CN110019648B (en) 2021-02-02

Family

ID=67185955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711269292.9A Active CN110019648B (en) 2017-12-05 2017-12-05 Method and device for training data and storage medium

Country Status (1)

Country Link
CN (1) CN110019648B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765244B (en) * 2019-09-18 2023-06-06 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for obtaining answering operation
US11501070B2 (en) 2020-07-01 2022-11-15 International Business Machines Corporation Taxonomy generation to insert out of vocabulary terms and hypernym-hyponym pair induction

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086549B2 (en) * 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning
CN105808525A (en) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN106407211A (en) * 2015-07-30 2017-02-15 富士通株式会社 Method and device for classifying semantic relationships among entity words
CN106570179A (en) * 2016-11-10 2017-04-19 中国科学院信息工程研究所 Evaluative text-oriented kernel entity identification method and apparatus
CN106649819A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Method and device for extracting entity words and hypernyms
CN106919977A (en) * 2015-12-25 2017-07-04 科大讯飞股份有限公司 A kind of feedforward sequence Memory Neural Networks and its construction method and system
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
WO2017130089A1 (en) * 2016-01-26 2017-08-03 Koninklijke Philips N.V. Systems and methods for neural clinical paraphrase generation
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
CN107273357A (en) * 2017-06-14 2017-10-20 北京百度网讯科技有限公司 Modification method, device, equipment and the medium of participle model based on artificial intelligence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176799B2 (en) * 2016-02-02 2019-01-08 Mitsubishi Electric Research Laboratories, Inc. Method and system for training language models to reduce recognition errors

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086549B2 (en) * 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning
CN106407211A (en) * 2015-07-30 2017-02-15 富士通株式会社 Method and device for classifying semantic relationships among entity words
CN106919977A (en) * 2015-12-25 2017-07-04 科大讯飞股份有限公司 A kind of feedforward sequence Memory Neural Networks and its construction method and system
WO2017130089A1 (en) * 2016-01-26 2017-08-03 Koninklijke Philips N.V. Systems and methods for neural clinical paraphrase generation
CN105808525A (en) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN106570179A (en) * 2016-11-10 2017-04-19 中国科学院信息工程研究所 Evaluative text-oriented kernel entity identification method and apparatus
CN106649819A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Method and device for extracting entity words and hypernyms
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
CN107273357A (en) * 2017-06-14 2017-10-20 北京百度网讯科技有限公司 Modification method, device, equipment and the medium of participle model based on artificial intelligence

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Efficient Training of Very Deep Neural Networks for Supervised Hashing;Ziming Zhang等;《arXiv:1511.04524v2》;20160421;第1-9页 *
Recurrent highway networks;Julian Georg Zilly等;《arXiv:1607.03474v5》;20170704;20170704 *
Unsupervised Video Hashing by Exploiting Spatio-Temporal Feature;Chao Ma等;《International Conference on Neural Information Processing》;20160929;第511-518页 *
基于LSTM的语义关系分类研究;胡新辰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160215;I138-2096 *
基于循环神经网络的依存句法分析模型研究;张俊驰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170615;I138-1573 *
特征耦合泛化及其在文体挖掘中的应用;李彦鹏;《中国博士学位论文全文数据库 信息科技辑》;20120515;I138-126 *

Also Published As

Publication number Publication date
CN110019648A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
TWI684148B (en) Grouping processing method and device of contact person
CN111553162B (en) Intention recognition method and related device
CN108228270B (en) Starting resource loading method and device
CN108280458B (en) Group relation type identification method and device
CN104239535A (en) Method and system for matching pictures with characters, server and terminal
CN110019825B (en) Method and device for analyzing data semantics
CN114444579B (en) General disturbance acquisition method and device, storage medium and computer equipment
CN111597804B (en) Method and related device for training entity recognition model
CN108279904A (en) Code compiling method and terminal
CN107633051A (en) Desktop searching method, mobile terminal and computer-readable recording medium
WO2021159877A1 (en) Question answering method and apparatus
CN110852109A (en) Corpus generating method, corpus generating device, and storage medium
CN109241079A (en) Method, mobile terminal and the computer storage medium of problem precise search
CN110019648B (en) Method and device for training data and storage medium
CN110597957B (en) Text information retrieval method and related device
CN115022098A (en) Artificial intelligence safety target range content recommendation method, device and storage medium
CN111241815A (en) Text increment method and device and terminal equipment
CN114428842A (en) Method and device for expanding question-answer library, electronic equipment and readable storage medium
WO2021073434A1 (en) Object behavior recognition method and apparatus, and terminal device
CN107317930A (en) A kind of layout method of desktop icons, device and computer-readable recording medium
CN117093766A (en) Information recommendation method, related device and storage medium of inquiry platform
CN110781274A (en) Question-answer pair generation method and device
CN111723783A (en) Content identification method and related device
CN112750427B (en) Image processing method, device and storage medium
CN116257657B (en) Data processing method, data query method, related device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant