CN111950264B - Text data enhancement method and knowledge element extraction method - Google Patents

Text data enhancement method and knowledge element extraction method Download PDF

Info

Publication number
CN111950264B
CN111950264B CN202010777706.4A CN202010777706A CN111950264B CN 111950264 B CN111950264 B CN 111950264B CN 202010777706 A CN202010777706 A CN 202010777706A CN 111950264 B CN111950264 B CN 111950264B
Authority
CN
China
Prior art keywords
words
word
entity
similarity
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010777706.4A
Other languages
Chinese (zh)
Other versions
CN111950264A (en
Inventor
程良伦
牛伟才
王德培
张伟文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202010777706.4A priority Critical patent/CN111950264B/en
Publication of CN111950264A publication Critical patent/CN111950264A/en
Application granted granted Critical
Publication of CN111950264B publication Critical patent/CN111950264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text data enhancement method and a knowledge element extraction method, wherein the text data enhancement method comprises a process of screening similar texts from a first supplementary database and a second supplementary database, wherein the first supplementary database is derived from a knowledge base in the field close to a basic data set, and the second supplementary database is derived from synonyms of entity words in the basic data set. The data enhancement method can be used for efficiently supplementing a large amount of basic data with fewer sources, and the knowledge element extraction model which is trained based on the data set enhanced by the enhancement method has higher generalization capability and extraction accuracy.

Description

Text data enhancement method and knowledge element extraction method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a knowledge element extraction technology.
Background
Along with the rapid development of internet technology, the construction of an industrial domain knowledge base can be better applied to domain intelligent question-answering and intelligent decision-making, the intellectualization of industrial manufacture is promoted, a large amount of electronic text information is generated in the industrial production process and is dispersed in a maintenance diagnosis table, an internet community and a factory database of workers, and if the unstructured and semi-structured electronic text information can be constructed into a knowledge base with extremely high knowledge density, the utilization rate of domain knowledge can be greatly improved.
How to process these text information quickly and efficiently is an important point of interest in the field of natural language processing, where recognition of named entities is particularly critical. The recognition of domain knowledge element entities can extract important knowledge units from structured and semi-structured text data, wherein the knowledge units are usually the most representative words in a specific domain, and after the entities are correctly recognized, the relationship extraction, event extraction and knowledge base construction can be further completed. It can be seen that the quality of the named entity recognition effect directly affects the subsequent information extraction task.
Existing named entity recognition methods fall into three general categories: rule and dictionary based methods, statistical machine learning based methods, and deep learning based methods. The learning method based on the rules and the dictionaries requires huge manpower labeling because a large number of rules and dictionaries are formulated, and is limited by professional knowledge, and certain fields can only formulate the rules and the dictionaries by experts, so that the recognition cost is high and the efficiency is low; the method based on statistical machine learning mainly comprises a hidden Markov model, a maximum entropy model, a support vector machine and a conditional random field model, wherein the recognition effect mainly depends on various feature combinations selected by the model, such as part-of-speech features, position features, context features and the like of words, and entity recognition needs to be carried out through a large-scale training corpus; the entity recognition technology based on deep learning is the most mainstream method at present, firstly, word vectors trained in advance are used as input of a neural network, then, the semantic extraction is carried out on texts through the neural network layer, and the extracted sentence characteristics are subjected to a global normalization function (Softmax) layer or a conditional random field to predict the labels of each word. Although the recognition effect of deep learning on the named entity recognition technology is far better than that of statistical machine learning and rule-based methods, the realization of model prediction capability and generalization capability of the method requires enough high-quality marking data sets as support, otherwise, fitting situations can occur, expected recognition accuracy is difficult to obtain, and the industrial field often lacks enough marking data sets to optimize parameters of a training model.
Disclosure of Invention
The invention aims to provide a method for enhancing text data, which can efficiently and largely supplement basic data with fewer sources, can overcome the problem of model accuracy caused by too close of the supplemental data and the basic data, and remarkably improves generalization capability and extraction accuracy of a model.
The invention also aims to provide a method for obtaining accurate knowledge element extraction based on the enhanced text data.
The invention firstly discloses the following technical scheme:
A method of text data enhancement comprising the process of screening similar text from a first supplemental database derived from a knowledge base in a domain proximate to a base dataset and a second supplemental database derived from synonyms of entity words in the base dataset.
The term of the entity in the scheme refers to the term representing the entity.
The basic data set refers to a data set which contains certain text data and needs data enhancement, and is preferably a data set after marking is completed.
The similar fields refer to fields in which entity words are the same or similar in terms of products, functions, technical processes and the like.
Such as in the grid power domain and in the electronics domain. The three-phase transformer in the power grid field, for example, is represented by the name of a toroidal transformer in the loudspeaker electronics in the electronics field.
Or in the field of ceramic production and in the field of refractory materials, for example in inorganic nonmetallic materials. It is embodied, for example, in mullite raw materials required for ceramic production, called kyanite, mullite, sillimanite, or the like in refractory materials. In the two fields, a series of mullite reactions based on mullite also have the same process, and the names are different.
By expanding the corpus information containing the entities in the similar fields, the data size of the entity words can be improved, and the generalization capability of the model can be improved.
Such close-in knowledge bases may be from the internet, raw material recipes or worker operating manuals, etc.
It will be appreciated that the data in the first and second supplemental databases should be in the form of text.
In some embodiments, the first supplemental database is crawled through the web page with the entity words it contains, and the second supplemental database is crawled through the web page with the synonyms of the entity words it contains.
The web page in this embodiment is preferably a web page with a high content of knowledge, such as wikipedia.
In some embodiments, the similar text is determined by:
s51: and carrying out word segmentation and labeling on the short text from the first supplementary database and the short text from the second supplementary database, and calculating the word vector cosine similarity between the separated entity words, namely the entity word similarity.
S52: and calculating the cosine similarity of word vectors among other words except the separated entity words, pairing the same-part-of-speech words with the similarity larger than a threshold value into overlapped words, and calculating the weighted similarity of the overlapped words under part-of-speech characteristics, namely the similarity of the overlapped words.
Preferably, the threshold in S52 is set to 0.5, i.e. the words with similarity greater than 0.5 are overlapping words.
S53: and carrying out weighted average on the entity word similarity and the overlapped word similarity to obtain the text similarity.
And performing iterative computation for text similarity on texts in the first supplemental database and the second supplemental database, wherein two texts with the largest text similarity obtained in each iteration are similar texts.
In some embodiments, the synonym is obtained by synonym fission comprising: and acquiring words similar to the cosine of the word vector of the entity words in the basic data set from the corpus, namely synonyms of the entity words.
In some embodiments, the number of synonyms per fission is set to 1-4, preferably 3.
In some embodiments, the Word vector is obtained by Word2Vec model conversion.
In some embodiments, the synonym fission is realized by a Word2Vec model.
The Word2Vec model used may be trained via encyclopedia, hundred degrees, and/or microblog corpus.
The word vector trained by the model has a certain priori knowledge, and the synonyms have semantic similarity and are specifically similar in cosine distance.
The invention further discloses a knowledge element extraction method which is realized through an extraction model which is completed through training, and the training of the model is based on the labeled data set which is enhanced through the data enhancement method.
In some embodiments, the extraction model is a bidirectional long-short-term memory network model.
In some embodiments, the extraction model includes an input layer, a word embedding layer, a bi-directional LSTM layer, and a normalized exponential function layer.
In some embodiments, the input layer is an index of each word in the sentence in a vocabulary, the vocabulary being obtained by traversing all the data.
In addition, in order to enhance the representation information of the words, in some specific embodiments, the word embedding layer uses pre-trained Chinese word vectors, the training corpus of the word vectors is preferably Chinese encyclopedia and microblog data, and the dimension of the word vectors is preferably 300 dimensions.
Additionally, to enhance the presentation information of the words, in some embodiments, the character embedding and word embedding of the words are stitched together, wherein the character specifically refers to each Chinese character in the words.
The character is preferably embedded in the randomly initialized 100-dimensional word vector and updated during the training process.
In some embodiments, the hidden layer dimension of the bi-directional LSTM layer is set to 256 dimensions, and finally, the forward LSTM and the backward LSTM are spliced together to obtain a sentence representation of 512 dimensions.
In some embodiments, the bidirectional LSTM of each time step is input to a normalized exponential function layer, i.e., a softmax function, to obtain a value between 0 and 1, where the label with the largest corresponding value is the physical label of the location.
The method can effectively solve the problem that the industrial field lacks enough structured knowledge base, can realize the prior knowledge base of similar industrial scenes and the possibility of supplementing basic data through synonyms by expanding the training data set through text similarity, and simultaneously, by screening and integrating data of two sources, the method not only remarkably enhances the scale of the data set, but also effectively solves the problem of low model generalization capability caused by over-strong relevance of entities of a single source, and remarkably improves the model accuracy.
According to the enhancement method disclosed by the invention, the manually crawled marine industry news text is subjected to data enhancement, wherein the used knowledge base in the similar field is a military industry database, the original 1000 data sets can be expanded into 1300 data sets, and the entity recognition effect is improved by 3%.
Drawings
Fig. 1 is a schematic flow chart of a data enhancement method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of the extraction model used in embodiment 1 of the present invention.
Detailed Description
The present invention will be described in detail with reference to the following examples and drawings, but it should be understood that the examples and drawings are only for illustrative purposes and are not intended to limit the scope of the present invention in any way. All reasonable variations and combinations that are included within the scope of the inventive concept fall within the scope of the present invention.
Knowledge element extraction is performed through a flow as shown in fig. 1.
Specifically, the text data enhancement is performed first, that is, the data expansion is performed on the basis of the existing text data set.
Wherein an existing text dataset, i.e. a basic dataset, may be obtained by collecting electronic text generated during an industrial production process, such as electronic text scattered among a repair diagnosis list of workers, an internet community, a factory database, etc.
The method is carried out on the basis of the entity words in the basic data set, so that the labeling of the entity words in the sample in the basic data set can be identified, namely, the labeling data set is obtained first.
Based on the annotation dataset, data enhancement is performed by:
S1 selecting entity word library of one of the supplementary data sources
The entity word library may be from an existing knowledge base having similar industrial fields as the underlying dataset, and should contain a plurality of entity words under different entity types, such as selecting an existing knowledge base that includes entity word 10, entity word 11, … … entity word 1n, entity word 20, entity word 21, … … entity word 2m, entity word k1, entity word k2, … … entity word kl under entity type 1, entity type 2, … … entity type k, respectively. The entity word library is the first supplementary data source.
S2, selecting entity words to be expanded in annotation data set
The labeling data set is a set of short texts, wherein non-entity words in each text can be labeled 0, and entity words can be labeled Yn, wherein n represents the entity type to which the entity words belong.
The entity types in S1 and S2 can be determined or adjusted according to different situations, for example, when the entity identification of the general scene is performed, the entity types can be divided into time, place, person, organization and the like, and when the sample of the labeling data set is less, the entity types can be further finely adjusted according to the application field.
And selecting entity words to be expanded in the annotation data set.
S3, carrying out word vector conversion on the entity words to be expanded
The conversion may be implemented by the Word2Vec model proposed by Google in 2013.
The model is trained on a large scale on a massive data set, can quickly and effectively express a word into a vector form, and has good word clustering and other functions.
Word vectors obtained through Word embedding operation of the Word2Vec model can be understood as a distributed low-dimensional dense real number vector, wherein cosine distances among Word vectors representing similar semantics are closer, similarity among words can be calculated through Word vector comparison,
S3, performing synonym fission on the entity word to be expanded
Synonym fission can be realized through direct rough calculation of a Word2Vec model, and a plurality of synonyms similar to entity words are generated, and a synonym library consisting of the synonyms, namely a second supplementary data source.
In the synonym fission process, the number of synonyms in each fission cannot be set to be too large, otherwise, the relativity between entity words is lost, and the relativity between the fissioned words and the semantics of the original entity words is lost.
Wherein Word2Vec model comes from the co-training of the encyclopedia, hundred degrees and microblog corpus of 256G.
The similarity threshold may be set to 0.5, i.e., the cosine similarity of different Word vectors is calculated by Word2Vec model, and when it is greater than 0.5, the words are considered similar words.
In each fission, the first 3 similar words with the highest similarity are preferably taken.
The words obtained through synonym fission and the original entity words (namely the entity words to be expanded) often have strong relevance, and the words are directly sent into the model as a supplementary data source to be trained, so that the generalization effect of the model is reduced.
Therefore, after synonyms are obtained, the invention further screens the alternative supplementary data generated from different supplementary data sources, so that a better data supplementing effect is obtained, and the generalization capability of the model is obviously improved, and the specific process is as follows:
s4 obtaining alternative supplementary data
It may further comprise:
s41, selecting one entity word in the second supplementary data source and k entity words belonging to k entity types in the first supplementary data source, and crawling web page texts based on the independent entity words respectively, wherein the web page is preferably a web page with more intellectual content, such as Wikipedia, the crawling content format is set as a short text, and the obtained short text respectively forms a second supplementary database and a first supplementary database according to the source of the crawled entity words.
In order to reduce the noise of the text and increase the recognition effect, the length of the short text to be crawled can be properly fine-tuned according to the field to be trained.
S42, the short texts in the first supplementary database and the second supplementary database are respectively subjected to word segmentation, stop words and part of speech tagging, so that the influence of text noise is reduced, and the process can be realized through LTP Chinese natural language processing kits with the functions of word segmentation, part of speech tagging, dependency syntax analysis and the like.
S43: and storing the words which appear in all texts in the first supplementary database and the second supplementary database into a Word list, establishing a Word list index, and converting each Word in the Word list into a corresponding Word vector through the pre-trained Word2 Vec.
S5, obtaining expansion data from the alternative supplementary data, and adding the expansion data into a labeling data set
Specifically, the text similarity between the word vectors in the word list corresponding to the first supplementary database and the word vectors in the word list corresponding to the second supplementary database is calculated, some texts with the maximum text similarity between the first supplementary database and the second supplementary database are reserved, and the texts are added into the labeling data set, so that the labeling data set is expanded.
The text similarity can be calculated by:
S51: the separation of the entity words from the short text a from the first supplementary database and the short text B from the second supplementary database may also be achieved by finding the vector matrix corresponding to the entity words directly from the pre-trained word vectors. And then calculating the cosine similarity of the word vectors of the separated entity words, wherein the cosine similarity is as follows:
Where a k represents the kth alternative entity term in short text A, B represents the entity term in short text B from the synonym pool of fission, and t represents the dimension.
S52: all words in A, B text except entity words are extracted, and the process can be realized through the word segmentation tool of the LTP toolkit. And then, respectively calculating the word vector cosine similarity of all words except the entity through the formula (1), using the same part-of-speech word pairs with the similarity larger than a threshold value as an overlapped word list, and then carrying out part-of-speech tagging weighted calculation on the overlapped word list, wherein the part-of-speech word similarity calculation formula is as follows:
wherein W is a keyword which represents the same part of speech obtained after cosine similarity calculation is carried out on two texts, m and n represent lengths of the two texts, a i and b i represent keywords with the same part of speech, and p t.w represents a part of speech tagging weight value.
To reduce the impact of irrelevant words on the A, B text similarity score, words with cosine similarity below a certain threshold between words will not be calculated as weighted scores.
The similarity threshold in step S52 may be set to 0.5.
S53: the entity term similarity obtained through S51 and the overlapping term similarity score obtained through S52 are weighted averaged as follows:
namely, the text similarity.
Through iterative calculation, specifically, the following steps are:
And fixing one short text in the second supplementary database, transforming different short texts in the first supplementary database, sequentially performing text similarity calculation, and reserving the text with the maximum similarity score.
And fixing another short text in the second supplementary database, transforming different short texts in the first supplementary database, sequentially performing text similarity calculation, and reserving the text with the maximum similarity score.
And so on.
The first supplemental database text and the second supplemental database text in which the text similarity score is greatest are classified into the same type of annotation data set, and all of them are added to the annotation data set.
S6 model training
After the text data enhancement is completed, a training model is built based on the extended data set as follows:
S61
And (3) transmitting the expanded data set to a bidirectional long-short-time memory network (BiLSTM) model, and extracting semantic information of the short text.
LSTM is an improved version of the cyclic neural network RNN, can effectively solve the problem of information deletion caused by the sequence length in the training process of the RNN, and can extract text data characteristics of an input sequence and implicit association between each word.
In particular, the BiLSTM model may include an input layer, a word embedding layer, a bi-directional LSTM layer, and a normalized exponential function layer.
The input layer is an index of each word in the sentence in a word list, and the word list is obtained by traversing all data. The word embedding layer uses pre-trained Chinese word vectors, the training corpus of the word vectors is Chinese encyclopedia and microblog data, and the dimension of the word vectors is 300 dimensions. After the input layer, the word character embedding and word embedding are spliced together, wherein the character embedding is randomly initialized by a model and updated in the training process. The stitched vector matrix serves as the final input representation of the word. The character is embedded into the 100-dimensional word vector which is randomly initialized and updated in the training process. The hidden layer dimension of the bidirectional LSTM layer is set to 256 dimensions, and a sentence representation of 512 dimensions can be obtained by stitching together the forward LSTM and the backward LSTM. After the bidirectional LSTM of each time step is input into the normalized exponential function layer, a numerical value between 0 and 1 can be obtained, and the label with the largest corresponding numerical value is the entity label of the position.
The bidirectional LSTM network layer comprises three control switches, namely a forgetting gate, a memory gate and an output gate, and the information flows are processed through the control switches.
Specifically, it includes:
Forgetting door control:
the forgetting door can selectively forget the input information in combination with the current input, namely forgetting unimportant information and leaving important information. It is realized by the following formula:
ft=σ(Wf·[ht-1,xt]+bf),
where W f represents the hidden state of the last time step, x t represents the input of the current state, and b f represents the bias matrix.
Memory gate control:
The memory gate can selectively retain information in the input x t of the current time step, i.e. memorize important information in the current input and discard unimportant information. It is realized by the following formula:
it=σ(Wi·[ht-1,xt]+bi)
Wherein W i,bi represents the weight parameters to be learned, Representing the temporary cell state of the current time step for updating the current cell state.
Output gate control:
The output gate can determine which information before the current time step is output, it first calculates the unit state of the current time step, then obtains the sum of the product of the unit state of the last time step and the forgetting gate f t of the current time step and the product of the memory gate i t of the current time step and the temporary unit state, that is, the unit state C t of the current time step, and continuously adjusts in the process as follows:
ot=σ(Wo[ht-1,xt]+bo),
Wherein W o,bo represents the weight parameters that need to be learned.
In the above process, the hidden state of the current step is calculated as follows:
ht=ot*tanh(Ct)。
The forward hidden state and the backward hidden state of each word obtained through the bidirectional LSTM network are spliced together to be used as input of a normalized exponential function layer (softmax), after the softmax is passed, sequence prediction can be carried out on the input short text, the label of the corresponding position is output, namely the label of each word corresponding to the input sequence is obtained, if the word is an entity, the entity type is output, and if the word is not the entity, 0 is output.
S7 knowledge element extraction
And (5) extracting knowledge elements through the model after the training is completed in the step S6.
Example 1
Data expansion is performed based on the following noted data sets:
Table 1: training data sample
The entity Word "transformer" is selected from the above samples and converted into Word vectors by Word2Vec tool.
The method comprises the steps of finding a vector matrix corresponding to entity words through a pre-trained word vector, obtaining similar words of the word vector through a Hadamard LTP tool by utilizing a cosine similarity algorithm, and realizing word fission of an entity word transformer, wherein the words with similarity greater than 0.5 are considered as synonyms, the number of the synonyms obtained by each fission is set to be 3, and in one fission, the following synonyms can be obtained: "three-phase transformer", "transformer coil", "oil-immersed transformer".
And selecting the electronic device database as a third party entity library, and sequentially selecting entity words of a ring transformer, a voltage transformer and a thermal relay as alternative words under the condition that the entity type is the type of equipment.
The wikipedia web pages are respectively crawled by taking synonym 'three-phase transformer' and alternative words 'annular transformer', 'voltage transformer', 'thermal relay' as references, and the following short text contents are obtained:
The three-phase power supply system in China mostly adopts a three-phase power transformer to control the voltage change requirement in the long-distance transmission process, but the three-phase power transformer is often in failure due to the asymmetry of three-phase loads. "
"Sony corporation employs the most advanced toroidal transformers to handle different wave frequency sources to prevent unpredictable failures. "
In order to save the cost of the voltage transformer, the primary side winding and the secondary side winding reduce the voltage level of the primary side, so that the conversion of strong current and weak current can be realized. "
"If the electrical apparatus frequently changes the starting state in the use, then the thermal relay with larger power is generally selected, otherwise, the fault is easy to be caused. "
The short text content obtained by crawling is subjected to the following word processing by using a Hadamard LTP tool package: the word segmentation, the stop word removal and the part of speech tagging according to the word segmentation condition, and storing all the obtained words into a word list, and establishing indexes as follows:
Most of the power supply systems in China adopt three-phase electric transformers to control the voltage change requirement in the long-distance transmission process, but the three-phase electric transformers are often in failure due to the asymmetry of three-phase loads. ]
[index1,index2,……,indexN]
[ Sony Corp uses the most advanced toroidal transformers to handle different wave frequency sources to prevent unpredictable failures ]
[index1,index2,……,indexN]
If the starting state of the electric appliance is frequently changed in the using process, a thermal relay with larger power is generally selected, otherwise, faults are easily caused.
[index1,index2,……,indexN]
……。
All words in the vocabulary are correspondingly converted into Word vectors through the Word2Vec after pre-training.
Text similarity calculation is carried out on texts obtained through the crawling of the synonym three-phase transformer and texts obtained through the crawling of the alternative words annular transformer, voltage transformer and thermal relay, and texts with maximum similarity are reserved, for example:
"sony corporation employs the most advanced toroidal transformers to handle different wave frequency sources to prevent unpredictable failures. And if the starting state of the electric appliance is frequently changed in the using process, a thermal relay with larger power is generally selected, otherwise, faults are easy to cause. "
And adding the labeling form of the data set serving as a supplementary sample into the labeling sample to obtain an expanded data set.
Similarly, the processing is then performed with the synonyms "transformer coil" and the alternatives "tesla coil", "inductance coil", "contactor coil" to obtain a supplementary sample:
The transformer coil has high requirements on the insulation performance of the winding, the most important point is that the transformer coil has enough electric strength, and the principle of the inductance coil is electromagnetic induction, so that certain requirements on the frequency of a signal passing through the coil are met, namely low-frequency passing and high-frequency blocking are short. "
Similarly, crawling of all synonyms and all alternatives is done sequentially, and through the iterative process described above, some complementary samples are obtained together as follows:
The three-phase power supply system in China mostly adopts a three-phase power transformer to control the voltage change requirement in the long-distance transmission process, but the three-phase power transformer is often in failure due to the asymmetry of three-phase loads. "
"Sony corporation employs the most advanced toroidal transformers to handle different wave frequency sources to prevent unpredictable failures. "
"If the electrical apparatus frequently changes the starting state in the use, then the thermal relay with larger power is generally selected, otherwise, the fault is easy to be caused. "
"The transformer coil has high requirements on the insulation properties of the windings, the most important point being sufficient electrical strength. "
The principle of the inductance coil is electromagnetic induction, and certain requirements on the frequency of a signal passing through the coil are met, namely the inductance coil is low-frequency-passing and high-frequency-blocking. "
The extended data set including the original text that the transformer will fail in high temperature in summer and the supplementary words is input into the BiLSTM model shown in figure 2 for training, after the original text is input, the entity transformer is marked Yn in the model output, and the other words are marked 0 as shown in table 2, so that the actual situation is consistent.
Table 2: prediction output sample
The method has the advantages that the model after training is input with the condition that the transformer can fail at high temperature in summer, and the result is 0000Yn00000000, wherein Yn defines equipment for entity types, and the knowledge element extraction method is accurate and effective.
Further, the data enhancement is performed on the manually crawled marine industry news text by using the process of the specific implementation mode, and the military industry database is selected as a third party entity database, so that the result shows that the data set can be expanded from 1000 original data sets to 1300 data sets, and the entity recognition effect of the model is improved by 3%.
The above examples are only preferred embodiments of the present invention, and the scope of the present invention is not limited to the above examples. All technical schemes belonging to the concept of the invention belong to the protection scope of the invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (8)

1. A method for enhancing text data, characterized in that: a process comprising screening similar text from a first supplemental database derived from a knowledge base in a domain proximate to the base dataset and a second supplemental database derived from synonyms of entity words in the base dataset;
The first supplementary database is obtained by crawling the entity words contained in the first supplementary database through a webpage, and the second supplementary database is obtained by crawling the synonyms of the entity words contained in the second supplementary database through the webpage; the similar text is determined by the following process:
S51: word vector cosine similarity between the separated entity words, namely entity word similarity, is calculated by dividing and marking short texts from the first supplementary database and short texts from the second supplementary database;
s52: calculating the cosine similarity of word vectors among other words except the separated entity words, pairing the same-part-of-speech words with the similarity larger than a threshold value into overlapped words, and calculating the weighted similarity of the overlapped words under the part-of-speech characteristics, namely the similarity of the overlapped words;
S53: carrying out weighted average on the entity word similarity and the overlapped word similarity to obtain text similarity;
performing iterative computation for text similarity on texts in the first supplemental database and the second supplemental database, wherein two texts with the largest text similarity obtained in each iteration are similar texts;
The synonyms are obtained through synonym fission, and the synonym fission comprises: acquiring words similar to the cosine of the word vector of the entity words in the basic data set from the corpus, namely synonyms of the entity words; the number of synonyms for each fission is set to 1-4.
2. The text data enhancement method according to claim 1, wherein: the number of synonyms per fission is set to 3.
3. The text data enhancement method according to claim 1 or 2, characterized in that: the Word vector is obtained through Word2Vec model conversion.
4. A knowledge element extraction method is characterized in that: the extraction method is implemented by a trained extraction model, the training of which is based on a labeled dataset enhanced by the text data enhancement method of any of claims 1-3.
5. The knowledge element extraction method according to claim 4, wherein: the extraction model is a bidirectional long-short-term memory network model.
6. The knowledge element extraction method according to claim 5, wherein: the extraction model comprises an input layer, a word embedding layer, a bidirectional LSTM layer and a normalized exponential function layer, wherein the input layer is an index of each word in a sentence in a vocabulary, and the word embedding layer uses a pre-trained Chinese word vector.
7. The knowledge element extraction method according to claim 6, wherein: the dimension of the word vector is set to 300 dimensions, and the hidden layer dimension of the bi-directional LSTM layer is set to 256 dimensions.
8. The knowledge element extraction method according to claim 6, wherein: the input form of the input layer is the combination of characters and words.
CN202010777706.4A 2020-08-05 2020-08-05 Text data enhancement method and knowledge element extraction method Active CN111950264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010777706.4A CN111950264B (en) 2020-08-05 2020-08-05 Text data enhancement method and knowledge element extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010777706.4A CN111950264B (en) 2020-08-05 2020-08-05 Text data enhancement method and knowledge element extraction method

Publications (2)

Publication Number Publication Date
CN111950264A CN111950264A (en) 2020-11-17
CN111950264B true CN111950264B (en) 2024-04-26

Family

ID=73339486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010777706.4A Active CN111950264B (en) 2020-08-05 2020-08-05 Text data enhancement method and knowledge element extraction method

Country Status (1)

Country Link
CN (1) CN111950264B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632993A (en) * 2020-11-27 2021-04-09 浙江工业大学 Electric power measurement entity recognition model classification method based on convolution attention network
CN113158648A (en) * 2020-12-09 2021-07-23 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN113221574A (en) * 2021-05-31 2021-08-06 云南锡业集团(控股)有限责任公司研发中心 Named entity recognition method, device, equipment and computer readable storage medium
CN113779959B (en) * 2021-08-31 2023-06-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Small sample text data mixing enhancement method
CN113901207B (en) * 2021-09-15 2024-04-26 昆明理工大学 Adverse drug reaction detection method based on data enhancement and semi-supervised learning
CN114706975A (en) * 2022-01-19 2022-07-05 天津大学 Text classification method for power failure news by introducing data enhancement SA-LSTM
CN116541535A (en) * 2023-05-19 2023-08-04 北京理工大学 Automatic knowledge graph construction method, system, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN110502644A (en) * 2019-08-28 2019-11-26 同方知网(北京)技术有限公司 A kind of field level dictionary excavates the Active Learning Method of building
CN111143574A (en) * 2019-12-05 2020-05-12 大连民族大学 Query and visualization system construction method based on minority culture knowledge graph

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11003638B2 (en) * 2018-10-29 2021-05-11 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for building an evolving ontology from user-generated content
WO2020154529A1 (en) * 2019-01-23 2020-07-30 Keeeb Inc. Data processing system for data search and retrieval augmentation and enhanced data storage

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
CN110502644A (en) * 2019-08-28 2019-11-26 同方知网(北京)技术有限公司 A kind of field level dictionary excavates the Active Learning Method of building
CN111143574A (en) * 2019-12-05 2020-05-12 大连民族大学 Query and visualization system construction method based on minority culture knowledge graph

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于同义词词林的句子语义相似度方法及其在问答系统中的应用";周艳平等;《计算机应用与软件》;20190812;第36卷(第08期);第65-68+81页 *
"基于多维相似度和情感词扩充的相同产品特征识别";胡龙茂等;《山东大学学报(工学版)》;20200323;第50卷(第02期);第50-59页 *

Also Published As

Publication number Publication date
CN111950264A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
CN111950264B (en) Text data enhancement method and knowledge element extraction method
CN110222160B (en) Intelligent semantic document recommendation method and device and computer readable storage medium
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
US8868469B2 (en) System and method for phrase identification
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
CN110334178B (en) Data retrieval method, device, equipment and readable storage medium
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN106126620A (en) Method of Chinese Text Automatic Abstraction based on machine learning
CN113221559B (en) Method and system for extracting Chinese key phrase in scientific and technological innovation field by utilizing semantic features
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN113609844A (en) Electric power professional word bank construction method based on hybrid model and clustering algorithm
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN114997288A (en) Design resource association method
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
Arora et al. Artificial Intelligence as Legal Research Assistant.
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN109189820A (en) A kind of mine safety accidents Ontological concept abstracting method
CN115017425B (en) Location search method, location search device, electronic device, and storage medium
CN116797195A (en) Work order processing method, apparatus, computer device, and computer readable storage medium
CN113254586B (en) Unsupervised text retrieval method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant