CN110020438B - Sequence identification based enterprise or organization Chinese name entity disambiguation method and device - Google Patents

Sequence identification based enterprise or organization Chinese name entity disambiguation method and device Download PDF

Info

Publication number
CN110020438B
CN110020438B CN201910297022.1A CN201910297022A CN110020438B CN 110020438 B CN110020438 B CN 110020438B CN 201910297022 A CN201910297022 A CN 201910297022A CN 110020438 B CN110020438 B CN 110020438B
Authority
CN
China
Prior art keywords
data
synonym
training
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910297022.1A
Other languages
Chinese (zh)
Other versions
CN110020438A (en
Inventor
顾凌云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai IceKredit Inc
Original Assignee
Shanghai IceKredit Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai IceKredit Inc filed Critical Shanghai IceKredit Inc
Priority to CN201910297022.1A priority Critical patent/CN110020438B/en
Publication of CN110020438A publication Critical patent/CN110020438A/en
Application granted granted Critical
Publication of CN110020438B publication Critical patent/CN110020438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a method and a device for disambiguating an enterprise or organization Chinese name entity based on sequence identification, wherein the method comprises the following steps: crawling an open news data set and cleaning the data to obtain cleaned data; extracting entity words in the cleaned data to obtain preliminary standard data; setting semantic template rules, and screening the preliminary standard data to obtain data to be standard; determining synonymy standard words and synonymy adverbs in the data to be normalized, and determining synonymy word pairs in the data to be normalized; setting a data marking strategy, marking data to be normalized, and adding manually constructed data to obtain training data; pre-training a word vector and a word vector, and combining the word vector and the word vector in the vertical direction to obtain a new vector; training the preprocessed training data by using an Encoder Decoder structure construction model, and storing an optimal index model; and predicting the sample to be predicted by using the optimal index model.

Description

Sequence identification based enterprise or organization Chinese name entity disambiguation method and device
Technical Field
The invention relates to the technical field of entity disambiguation, in particular to a method and a device for disambiguating an enterprise or organization Chinese name entity based on sequence identification.
Background
Entity disambiguation, the concept of which is to avoid the problem of confusion of semantic understanding caused by the same nouns but different meanings by some method. In recent years, along with the development of artificial intelligence technology, the market demand for accurately identifying Chinese synonyms from a long text is more and more obvious, and the demand is more urgent particularly for the legal and financial industries. With the development of natural language processing technology, there are more and more entity disambiguation methods in the Chinese field, and there are entity disambiguation methods based on text classification and entity disambiguation methods based on comprehensive application of knowledge base and deep learning in the market at present. However, these techniques have a disadvantage in that they convert the entity disambiguation problem into a text classification problem, and the following problems exist behind them: 1. the model in the machine learning field cannot extract text context features well. 2. When the method is converted into a text classification mode for processing, the ambiguity condition of each entity word needs to be judged, and a large amount of complex knowledge bases are needed behind the ambiguity condition as supports. Such a situation may lead to a complicated process in the process of constructing the technology required by the project, and may lack good applicability from the aspects of cost control and performance.
In recent years, with the rise of a text language model processing mode of a sequence model with an Encoder-Decoder structure, a new idea is brought to a Chinese field entity disambiguation method. The model structure treats the text as a sequence to be processed, the input is a short text which is constructed in advance and contains a plurality of entity words with the same name, and the output is a character label corresponding to each word in the short text. The method can well contain text context information into the model for training by extracting Chinese character features. Meanwhile, posionembedding based on text participles proposed along with Google is used as the information characteristic of the phrase position, and a processing mode of neural network training and a processing mode of multi-layer attition are embedded, namely a Transformer model structure, and the possibility of obtaining a better result is brought for an entity disambiguation technology based on a sequence recognition idea.
Disclosure of Invention
The invention aims to provide a self-attention-based enterprise public opinion analysis method and a self-attention-based enterprise public opinion analysis device which overcome one of the problems or at least partially solve any one of the problems.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
one aspect of the present invention provides a method for disambiguating an enterprise or organization Chinese name entity based on sequence identification, comprising: crawling an open news data set and performing data cleaning on the news data set to obtain cleaned data; extracting entity words in the cleaned data to obtain preliminary standard data; wherein the entity words include at least one of: company name COM and organization name ORG; setting semantic template rules, and screening the preliminary standard data to obtain data to be standard; determining synonymy standard words and synonymy adverbs in the data to be normalized, and determining synonymy word pairs in the data to be normalized; setting a data marking strategy, marking data to be normalized, adding manual construction data for data enhancement, and obtaining training data; pre-training a word vector and a word vector by using the preliminary standard data, and combining the word vector and the word vector in the vertical direction to obtain a new vector; preprocessing training data; training the preprocessed training data by using an Encoder Decoder structure construction model, and storing an optimal index model; and predicting the sample to be predicted by using the optimal index model, selecting the item with the maximum arrangement probability as an output sequence by using a Beamsearch strategy, and obtaining the synonym sequence of the sample to be predicted.
Wherein, crawl public news data set and carry out data cleaning to news data set, obtain the data after the washing including: crawling public domestic, economic and scientific news data, removing special characters and meaningless symbols, checking whether null values exist or not, and removing the data if the null values exist to obtain cleaned data; extracting entity words in the cleaned data to obtain preliminary standard data, wherein the preliminary standard data comprises the following steps: processing the cleaned data by using a pre-trained Chinese named entity recognition model, extracting entity words of company names and organization names in each sentence as training supplement linguistic data, and simultaneously collecting and sorting long sentences and short sentences, wherein the word number of each sentence is controlled within a preset word number; setting semantic template rules, screening preliminary standard data, and obtaining the data to be standard, wherein the semantic template rules comprise the following steps: setting template rules of < ORG > + verb + < ORG >/< COM > + verb + < ORG >/< ORG > + verb + < COM >/< COM > + verb + < COM > "and manually constructing a high-frequency verb dictionary in the field, and screening data to obtain data to be normalized; and/or determining synonym standard words and synonym adverbs in the data to be normalized, wherein the definition of the synonym pairs in the data to be normalized comprises the following steps: and taking the entity words with a large number of words in one sentence in the data to be normalized as synonym standard words, determining that another entity word in the sentence and the synonym standard words belong to the same class, each word in another entity word is contained in the synonym standard words, and the number of words of another entity word is more than 1, taking another entity word as a synonym adverb, and determining that the synonym standard words and the synonym adverb in the sentence belong to synonym pairs.
The method comprises the following steps of setting a data labeling strategy, labeling data to be normalized, adding artificially constructed data to perform data enhancement, and obtaining training data, wherein the data labeling strategy comprises the following steps: marking SEi as the first character in the synonym in each sentence in the data to be normalized, marking Ii as other characters in the synonym, marking E2 as the first character in the synonym, marking I2 as other characters in the synonym, and marking O as other characters in the sentence which are not in the synonym pair; adding the artificial supplementary data which accord with the semantic template rule into the preliminary standard data to obtain training data, wherein the artificial supplementary data and the data to be standard are mixed together through random arrangement; pre-training a word vector and a word vector by using preliminary standard data, and combining the word vector and the word vector in the vertical direction to obtain a new vector comprises the following steps: training to obtain a Word vector by adopting a Word2vec model with a Skip structure, adding entity words into a dictionary formed after Word segmentation, training to obtain a Word vector, and combining the Word vector and the Word vector in the vertical direction to obtain a new vector; and/or pre-processing the training data comprises: separating a labeling sequence and a Chinese sequence from the training data, filtering stop words of the Chinese sequence, establishing a dictionary, and coding the text sequence according to dictionary indexes.
The method comprises the following steps of training preprocessed training data by utilizing an Encode Decoder structure construction model, and storing an optimal index model, wherein the method comprises the following steps: the method comprises the steps of adopting a model with an Encoder-Decoder structure, extracting sequence characteristics through convolutional neural networks with convolutional kernel numbers of 3, 4 and 5 in an Encoder encorder, serializing through a bidirectional recursive neural network respectively, adding self attribute to generate a corresponding attention weight value as an intermediate state value output by an Encoder end, forming a Decoder through 2 layers of bidirectional recursive neural networks at the Decoder end, inputting a target sequence at the previous moment into the Decoder respectively, acting with the intermediate state layer, and generating a target sequence of the next time step.
The predicting the sample to be predicted by using the optimal index model comprises the following steps: and (3) setting the Beamsearch size value to be 3 by using a Beamsearch strategy, and selecting the item with the maximum arrangement probability as an output sequence to obtain the synonym sequence of the sample to be predicted.
In another aspect, the present invention provides a device for disambiguating an enterprise or organization Chinese name entity based on sequence identification, comprising: the data set construction module is used for crawling an open news data set and cleaning the data of the news data set to obtain cleaned data; extracting entity words in the cleaned data to obtain preliminary standard data; wherein the entity words include at least one of: company name COM and organization name ORG; setting semantic template rules, and screening the preliminary standard data to obtain data to be standard; determining synonymy standard words and synonymy adverbs in the data to be normalized, and determining synonymy word pairs in the data to be normalized; the data labeling module is used for setting a data labeling strategy, labeling the data to be normalized, adding manual construction data, and performing data enhancement to obtain training data; the vector training module is used for pre-training a word vector and a word vector by using the preliminary standard data and combining the word vector and the word vector in the vertical direction to obtain a new vector; the preprocessing module is used for preprocessing the training data; the model training module is used for training the preprocessed training data by utilizing an Encoder Decoder structure construction model and storing an optimal index model; and the prediction module is used for predicting the sample to be predicted by using the optimal index model, selecting the item with the maximum arrangement probability as an output sequence by using a Beamsearch strategy, and obtaining the synonym sequence of the sample to be predicted.
The data set construction module crawls an open news data set and performs data cleaning on the news data set in the following mode to obtain cleaned data: the data set building module is specifically used for crawling open domestic, economic and scientific news data, removing special characters and meaningless symbols, checking whether null values exist or not, and removing the data if the null values exist to obtain cleaned data; the data set construction module extracts entity words in the cleaned data in the following mode to obtain preliminary standard data: the data set construction module is specifically used for processing the cleaned data by using a pre-trained Chinese named entity recognition model, extracting entity words of company names and organization names in each sentence as training supplement linguistic data, and meanwhile, collecting and sorting long and short sentences, wherein the word number of each sentence is controlled within a preset word number; the data set construction module sets semantic template rules in the following mode, screens the preliminary standard data and obtains the data to be standard: the data set building module is specifically used for carrying out data screening by setting a 'ORG > + verb + < ORG >/< COM > + verb + < ORG >/< ORG > + verb + < COM >/< COM > + verb + < COM >' template rule and a high-frequency verb dictionary in the field of manual construction to obtain data to be normalized; and/or the data set construction module determines the synonym standard words and synonym adverbs in the data to be normalized in the following mode, and determines the synonym pairs in the data to be normalized: the data set construction module is specifically used for taking the entity words with a large number of words in one sentence in the data to be normalized as synonym standard words, determining that another entity word in the sentence and the synonym standard words belong to the same class, each word in the another entity word is contained in the synonym standard words, and the number of words of the another entity word is greater than 1, then taking the another entity word as a synonym adverb, and determining that the synonym standard words and the synonym adverbs in the sentence belong to synonym pairs.
The data labeling module sets a data labeling strategy in the following mode, labels data to be normalized, adds artificially constructed data to perform data enhancement, and obtains training data: the data labeling module is specifically used for labeling SEi as the first character in the synonym in each sentence in the data to be normalized, labeling Ii as the other characters in the synonym, labeling E2 as the first character in the synonym, labeling I2 as the other characters in the synonym, and labeling O as the other characters in the sentence which are not in the synonym pair; adding the artificial supplementary data which accord with the semantic template rule into the preliminary standard data to obtain training data, wherein the artificial supplementary data and the data to be standard are mixed together through random arrangement; the vector training module pre-trains the word vector and the word vector by using the preliminary standard data in the following way, and combines the word vector and the word vector in the vertical direction to obtain a new vector: the vector training module is specifically used for training to obtain a Word vector by adopting a Word2vec model with a Skip structure, adding an entity Word into a dictionary formed after Word segmentation, training to obtain a Word vector, and combining the Word vector and the Word vector in the vertical direction to obtain a new vector; and/or the preprocessing module preprocesses the training data by: and the preprocessing module is specifically used for separating the training data into a labeling sequence and a Chinese sequence, filtering stop words of the Chinese sequence, establishing a dictionary and coding the text sequence according to dictionary indexes.
The model training module trains the preprocessed training data by utilizing an Encoder Decoder structure construction model in the following mode, and stores an optimal index model: the model training module is specifically used for adopting a model with an Encoder-Decoder structure, extracting sequence characteristics through convolutional neural networks with convolutional kernel numbers of 3, 4 and 5 in an Encoder encorder, serializing the sequence characteristics through a bidirectional recursive neural network respectively, adding self attribute to generate a corresponding attention weight value as an intermediate state value output by an Encoder end, forming a Decoder through 2 layers of bidirectional recursive neural networks at the Decoder end, and inputting a target sequence at the previous moment into the Decoder respectively to act on the intermediate state layer to generate a target sequence of the next time step.
The prediction module predicts a sample to be predicted by using the optimal index model in the following way: and the prediction module is specifically used for setting the Beamsearch size value to be 3 by using a Beamsearch strategy, and selecting the item with the maximum arrangement probability as an output sequence to obtain the synonym sequence of the sample to be predicted.
Therefore, the method and the device for disambiguating the Chinese name entity of the enterprise or the organization based on the sequence identification provided by the embodiment of the invention build an effective data set by building a simple semantic template and a manual data tagging form and crawling news corpora, and use the effective data set as a model training basis; constructing a pre-training word vector by utilizing self-crawl 500w news corpora, wherein the dimension of the word vector is controlled to be 500 dimensions; and (4) carrying out model construction, and providing a model structure capable of well extracting context information of the text sequence from the language model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of an enterprise public opinion analysis method based on self-attention according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data set constructing process provided by an embodiment of the invention;
FIG. 3 is a schematic diagram of a data annotation strategy according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a model Encoder end according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a model Decoder according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an enterprise public opinion analysis device based on self-attention according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The method and the device for disambiguating the Chinese name entity of the enterprise or the organization based on the sequence identification provided by the embodiment of the invention aim to:
firstly, in the small and micro enterprise public opinion gathering system, when a user inputs a sentence to search an enterprise or an organization name, whether the same kind of entity words occur is identified, if the same kind of entity words occur, the system regards all the names as the same object, and therefore the situations that the searching efficiency is slow and the user experience is poor due to entity ambiguity are avoided.
Secondly, the invention provides a new Chinese name entity word disambiguation method by utilizing seq2seq model structure according to the idea of converting the entity disambiguation problem into the sequence labeling problem without a complex knowledge base.
Fig. 1 is a flowchart illustrating a method for disambiguating an enterprise or organization chinese name entity based on sequence identification according to an embodiment of the present invention, and referring to fig. 1, the method for disambiguating an enterprise or organization chinese name entity based on sequence identification according to an embodiment of the present invention includes:
and S1, crawling the public news data set and performing data cleaning on the news data set to obtain cleaned data.
Specifically, first, the public news data is crawled, and the crawled public news data set is subjected to data cleaning (a specific process may refer to fig. 2), as an optional implementation manner of the embodiment of the present invention, the public news data set is crawled and the data cleaning is performed on the news data set, and the obtained cleaned data includes: and crawling public domestic, economic and scientific news data, removing special characters and meaningless symbols, checking whether null values exist or not, and removing the data if the null values exist to obtain cleaned data.
S2, extracting entity words in the cleaned data to obtain preliminary standard data; wherein the entity words include at least one of: company name COM and organization name ORG;
as an optional implementation manner of the embodiment of the present invention, extracting entity words in the cleaned data to obtain preliminary specification data includes: and processing the cleaned data by using a pre-trained Chinese named entity recognition model, extracting entity words of company names and organization names in each sentence as training supplementary linguistic data, and simultaneously collecting and sorting long and short sentences, wherein the word number of each sentence is controlled within the preset word number.
In specific implementation, a crawled public news data set can be processed through a Chinese named entity model (NER system) which is constructed in advance, entity words with company names (COM) and organization names (ORG) in texts in the news data set are extracted to serve as supplementary linguistic data used for word vector training, long sentences and short sentences are collected and sorted, and the length of each sentence is controlled within 500 characters.
And S3, setting semantic template rules, and screening the preliminary specification data to obtain the data to be specified.
Specifically, as an optional implementation manner of the embodiment of the present invention, setting a semantic template rule, and screening the preliminary specification data to obtain the data to be specified includes: setting a template rule of < ORG > + verb + < ORG >/< COM > + verb + < ORG >/< ORG > + verb + < COM >/< COM > + verb + < COM > + and manually constructing a high-frequency verb dictionary in the field, and screening data to obtain data to be normalized.
In specific implementation, sentences with the two types of entity words in each news text input instance are extracted, then screening is carried out by setting simple semantic template rules of "< ORG > + verb + < ORG >/< COM > + verb + < ORG >/< ORG > + verb + < COM >", and simple sentences which are consistent with the rules, such as "Chinese construction banks are simply called construction banks and" located in Shanghai ", are extracted as data sets to be normalized. And the template rule intermediate verb in the fields of the company name and the organization name is referred by manually selecting a high-frequency verb dictionary. After the data are processed, the obtained data set is characterized in that each sentence only contains entity words such as ORG or COM, and each sentence only contains 2-4 entity words.
S4, determining synonym standard words and synonym adverbs in the data to be normalized, and determining synonym pairs in the data to be normalized.
As an optional implementation manner of the embodiment of the present invention, determining a synonym standard word and a synonym adverb in data to be normalized, and determining a synonym pair in the data to be normalized includes: and taking the entity words with a large number of words in one sentence in the data to be normalized as synonym standard words, determining that another entity word in the sentence and the synonym standard words belong to the same class, each word in another entity word is contained in the synonym standard words, and the number of words of another entity word is more than 1, taking another entity word as a synonym adverb, and determining that the synonym standard words and the synonym adverb in the sentence belong to synonym pairs.
In specific implementation, after the previous operation, the entity words in each sentence are extracted, and the word with the larger number of the entity words in the sentence is selected as the synonym standard word of the sentence, if another entity word belongs to the same class as the standard word, and each word in another entity word is contained in the standard word, and the number of the words is larger than 1, the word is called as a synonym, and meanwhile, the two entity words in the sentence belong to a synonym pair. For example, in the phrase "the chinese construction bank is simply called" construction bank ", which is one of the larger banks owned by the country in china", the entity words "the chinese construction bank" and "the construction bank" belong to the same company name (ORG), and each word in the entity word "the construction bank" is included in the entity word "the chinese construction bank", so that the two words belong to the synonym pair, and the word with the longest word number is the synonym standard word. In a sentence, if the synonym standard word and the synonym adverb appear, the synonym standard word and the synonym adverb are shown to have the same meaning semantically, namely, the ideogram information of the ideogram adverb can be represented, and the ambiguity property of the ideogram can be eliminated.
And S5, setting a data labeling strategy, labeling the data to be normalized, adding manual construction data, and performing data enhancement to obtain training data.
As an optional implementation manner of the embodiment of the present invention, setting a data labeling strategy, labeling data to be normalized, adding artificially constructed data, and performing data enhancement to obtain training data includes: marking SEi as the first character in the synonym in each sentence in the data to be normalized, marking Ii as other characters in the synonym, marking E2 as the first character in the synonym, marking I2 as other characters in the synonym, and marking O as other characters in the sentence which are not in the synonym pair; and adding the artificial supplementary data which accord with the semantic template rule into the preliminary specification data to obtain training data, wherein the artificial supplementary data and the data to be specified are mixed together through random arrangement processing.
In specific implementation, the data tagging strategy (see fig. 3) used in the embodiment of the present invention is to determine the word with the longest word number as a synonym standard word from two words belonging to a synonym pair in a sentence, and tag the first word as a letter E1, and tag the other words as I1; the same applies to the synonyms with a smaller number of words, the synonyms belonging to the adverb are labeled E2 for the first word and I2 for the other words. The words in the sentence that are not related to the entity word are all marked as letter O, while the other words are less than the same entity word of the synonym standard, the first word is marked as E2, and the other words of the word are marked as I2. For example, in the sentence "the chinese construction bank is simply called a construction bank or a construction bank, which is one of the large banks in china", the "chinese construction bank" is a synonymous standard word of the sentence, and is labeled as "E1I 1", and the "construction bank" is labeled as a corresponding adverb as "E2I 2", and the labeling manner of the "construction bank" is the same as that of the "construction bank". Finally, the other words in the sentence are labeled "O". The artificial supplementary data which accords with the semantic template rules are added into a training set to play a role in data enhancement. The artificial supplementary data reaches 2 ten thousand texts, and is mixed with the texts which are crawled and processed in the prior art in a disordering way, and the texts are randomly arranged and processed.
And S6, pre-training the word vector and the word vector by using the preliminary standard data, and combining the word vector and the word vector in the vertical direction to obtain a new vector.
As an optional implementation manner of the embodiment of the present invention, pre-training a word vector and a word vector by using preliminary normative data, and combining the word vector and the word vector in a vertical direction to obtain a new vector includes: training to obtain a Word vector by adopting a Word2vec model with a Skip structure, adding entity words into a dictionary formed after Word segmentation, training to obtain a Word vector, and combining the Word vector and the Word vector in the vertical direction to obtain a new vector.
In specific implementation, the pre-training word vector used in the embodiment of the invention is obtained by setting a parameter window to be 5 and training sentences conforming to the rules of the constructed semantic template from 500 thousands of crawled sentences. The embodiment of the invention adopts a Word2vec model with a Skip structure, and the dimensionality of the finally obtained Word vector is 500 dimensions. Meanwhile, the embodiment of the invention also carries out word segmentation on the text, adds the entity words of the company name (COM) or the organization name (ORG) acquired in the step S1 into a dictionary formed after word segmentation, thereby carrying out word vector training, wherein the dimensionality of the obtained word vector is also 500 dimensionalities. In order to better extract the background information contained in the text, the embodiment of the invention combines the obtained word vector and the word vector in the vertical direction to obtain a new vector. And then embedding the obtained new vector as a pre-training vector into a training model.
And S7, preprocessing the training data.
As an optional implementation of the embodiment of the present invention, the preprocessing the training data includes: separating a labeling sequence and a Chinese sequence from the training data, filtering stop words of the Chinese sequence, establishing a dictionary, and coding the text sequence according to dictionary indexes.
In specific implementation, the operations of text data cleaning and data preprocessing are performed on the training data. In the text data set to be obtained, a regular expression is used to remove special symbols, then a pre-constructed deactivation word list is introduced, and the word strength auxiliary words such as 'of' and 'get' which are meaningless to training in the text are removed. And then, segmenting the labeled sentences according to Chinese characters, and processing the sentences into training required training data text input sequences, namely, the characters are as follows: x1, X2, …, Xn. And simultaneously processing the label field of the corresponding sentence sequence into a target data text output sequence at the time of T0, namely: y1, Y2, …, Yn. And then splicing an identifier of "< EOS >" at the tail of a sentence of the target data sequence to represent the predicted end position of the sequence, wherein the new target sequence obtained at the moment is called a text output sequence at the time of T1 and is characterized in that: y1, Y2, …, Yn, < EOS >.
And S8, training the preprocessed training data by using an Encoder Decoder structure construction model, and storing an optimal index model.
As an optional implementation manner of the embodiment of the present invention, the training of the preprocessed training data is performed by using an encorder Decoder structure construction model, and the storing of the optimal index model includes: the method comprises the steps of adopting a model with an Encoder-Decoder structure, extracting sequence characteristics through convolutional neural networks with convolutional kernel numbers of 3, 4 and 5 in an Encoder encorder, serializing through a bidirectional recursive neural network respectively, adding self attribute to generate a corresponding attention weight value as an intermediate state value output by an Encoder end, forming a Decoder through 2 layers of bidirectional recursive neural networks at the Decoder end, inputting a target sequence at the previous moment into the Decoder respectively, acting with the intermediate state layer, and generating a target sequence of the next time step.
In specific implementation, the model structure used in the embodiment of the invention is an Encoder-Decoder structure, and the structure principle is that a section of Training data text sequence is input and processed by an Encoder (Encoder) to generate the middle hidden layer C. And then adding matrices (Target | T ═ T0) generated by embedding layer processing of the Target data text sequence at the time of inputting the T0 to obtain a matrix [ C, Target | T ═ T0 ]. The Decoder (Decoder) takes this matrix as input so as to predict the Target Data sequence that should be output in the next time period T1. The Convolutional Neural Network (CNN) can extract local information of a text well, but due to the influence of local receptive fields, the network is not suitable for extracting entity relationships with long distances in the text. And the convolution networks with different convolution kernels are connected in parallel, so that the text sequence information can be further extracted. Meanwhile, because the model based on the Recurrent Neural Network (RNN) structure, such as BIlstm, can solve the problem that the distance between two entity words in a sentence is too long or the two entity words contain a third entity word, in the model Encoder structure (as shown in fig. 4), the embodiment of the invention respectively sets three continuous convolution kernels for performing convolution text operation, and then each layer of CNN network passes through one layer of BIlstm so as to ease the problem caused by long-distance learning of the entity words. And finally, obtaining the hidden state C of each word in the text sequence through the action of a self-attribute mechanism. Secondly, in the word embedding stage, the word vectors and the word segmentation vectors obtained by segmenting each sentence are combined to carry out embedding operation, so that the effect of model learning is improved.
In the model Decoder structure (as shown in fig. 5), the hidden layer tensor obtained from the Encoder is added to the target tensor at the time T0, and is input into the model for training, and finally, the data obtained after softmax is the data at the time T1. It should be noted that the target data T0 and T1 are obtained through a data preprocessing stage.
In the embodiment of the invention, the sentence marking character obtained in each sentence is used as a target data sequence, and the corresponding sentence is a training data sequence. The sequence predicted end position is represented by concatenating the "< EOS >" character for each row of target data. This data at time T1 is found after it is aligned with the target original data. Correspondingly, the data at the time T0 is Target raw data. And finally, splicing the obtained Encoder and Decoder to train, namely a model training stage provided by the embodiment of the invention.
The following steps will describe the mathematical theory and detailed operation of the model in detail, and the model training can be performed by the following steps:
s81, setting convolution kernels of 3, 4 and 5 to extract sequence features:
if the word vector set in the embodiment of the present invention has a transverse dimension d, the tensor dimension obtained through the embedding layer has a dimension xi(batch, d), which can be characterized as:
Figure BDA0002026906810000111
wherein
Figure BDA0002026906810000112
The matrix is shown to be spliced according to rows, and n represents the number of input instances. The sequence information is passed through by setting different word window sizes h, and finally, a feature matrix calculation formula obtained by each layer of word window is as follows:
ci=f(w1xi:i+h-1+b)
in the above formula, the function f represents the activation function of the convolutional network, and i refers to tensor xiB represents an offset term in the network, W1Representing the hyper-parameter to be trained, is initialized to 0. After convolution operation, the sequence { X1:h,X2:h+1,...,Xn-h+1:nThe corresponding feature set D can be generated, i.e. it can be represented as the following structure:
D=[c1,c2,...,cn-h+1],D∈Rn-h+1
on the obtained D set, selecting D through maxpoling operationjMax { D } is the eigenvector obtained when the word window size is h.
S82, set Dropout layer, neuron proportion set to 50%:
in order to prevent the over-training fitting phenomenon, a Dropout layer is arranged after the maximum pooling layer is passed, so that 50% of the neurons randomly screened in the Dropout layer stop parameter updating, the weight value is reserved, and the other 50% of the neurons can still perform parameter updating in a gradient descending manner.
S83, setting a Bilstm layer, setting the number of the neurons as a word vector transverse dimension d:
in order to extract context characteristics of tensors among neural network nodes well and reduce influences caused by long-distance dependence, the embodiment of the invention adds a Bidirectional Lstm layer for processing, and enables the tensors to obtain well-stored sequence characteristics after convolution processing. Characteristic matrix D after Dropout layer screeningjThe update formula input to the partial network is as follows:
forward update calculation: h ist1=f1(w21Dj+v21ht-1+b1)
Backward update calculation: h ist2=f1(w22Dj+v22ht+1+b2)
Hidden layer output formula: g ═ G (U | h)t1;ht2|+c)
Wherein h ist1Representing the hidden state quantity, h, generated during the forward calculation of LSTMt2Representing the hidden layer generated when the LSTM is computed backwards (i.e., backwards). Where w and v represent training parameters and b is the corresponding bias term, both are initialized to 0. And finally obtaining the tensor G which is the obtained characteristic sequence. In the embodiment of the invention, 3 same branches are in parallel connection to extract text information, so that the obtained characteristic sequence can be identified by letters as follows: g1, G2, G3.
And S84, introducing a Concatenate layer, and merging tensors obtained by the three branches:
after the previous processing, the sequence tensors are respectively obtained as G1, G2 and G3. And selecting the last dimension of the sequence tensor to splice so as to obtain a new sequence tensor Gn, wherein each line of the tensor Gn represents a vector formed by each word.
S85, introducing self-attention mechanism to generate corresponding hidden state layer C:
the sequence tensor X processed by the embedding layer is regarded as a source sequence and is characterized by (X)1,X2,……,Xn) And the size is (batch, d). The sequence Gn is regarded as<key.value>Such a key-value pair formation. And the sequence tensor X obtained by the embedding operation is used as Query to calculate the attention weight value. Since Gn is produced at XiTherefore, attention mechanism is put into the Encoder structure for training. Firstly, defining a Query, key, value calculation formula, and the process is as follows:
Query=WQX
Key=WKGn
Value=WVGn
wherein WQ、WK、WVIs the parameter to be trained. Then, the cosine similarity is calculated to represent the similarity between Query and the ith keyword Keyi in the sequence, and the calculation process is as follows:
Simi=(Query×Keyi)/(||Query||×||Keyi||)
therefore, the attention weight a of each key can be obtained by normalizing the result by using a softmax methodi. The calculation process is as follows, wherein i represents each Key worth indexing, KeyiThe whole represents the word vector for each line of arbitrary construction Gn:
Figure BDA0002026906810000131
Figure BDA0002026906810000132
finally hidden state parameter C in EncoderjThe calculation formula is as follows, and each keyword Keyi generates a state parameter CjMay be different, wherein the letter a denotes the corresponding attention weight aiAnd L represents the sequence length:
Figure BDA0002026906810000141
s86, constructing two BILSTM layers to be connected in series to form a Decoder end, and completing the model training stage:
the hidden state layer C generated at the Encoder end is obtainedjThen, we preprocess the data to obtain Target sequence data (Y) at time T01,Y2,…,Yn) Conducting Embedding operation to obtain matrix Y, and adding CjAnd Y is finally subjected to softmax action as the Decoder end input result, and Target sequence data (Y) at the time of T1 is output1,Y2,…,Yn,<EOS>). The calculation formula is as follows:
Y|T=T1=f1(Cj,Y|T=T0)
T1=T0+1
where f1 denotes the decoder-side nonlinear variation function, and T1 is the next time of T0. In the training, a cross entropy loss function is selected as a loss value gradient iteration process in the training. Finally, selecting a proper epoch by combining the loss change curve, and storing the model with the optimal index after parameter adjustment.
And S9, predicting the sample to be predicted by using the optimal index model, selecting the item with the maximum arrangement probability as an output sequence by using a Beamsearch strategy, and obtaining the synonym sequence of the sample to be predicted.
As an optional implementation manner of the embodiment of the present invention, predicting a sample to be predicted by using an optimal index model includes: and (3) setting the Beamsearch size value to be 3 by using a Beamsearch strategy, and selecting the item with the maximum arrangement probability as an output sequence to obtain the synonym sequence of the sample to be predicted.
In the specific implementation, in the prediction stage of the model, the Training data are preprocessed, and a "< GO >" character is spliced in front of each line of data to represent the starting position of a prediction sequence. And the model prediction stage and the training stage share weights, so that prediction operation can be performed after tensor input after the acquired text sequence is subjected to embedding processing. In the prediction sample, the words formed by the words with the flags "EI", "I1", "E2", and "I2" in the corresponding sentence are the chinese synonyms.
In the model prediction phase, target data is missing. Thus, after the set of sequence data to be predicted is processed into encoded form, the end of its sequence is added with an < EOS > sequence terminator. And then setting the Beemsearch size to be 3, then inputting the sequence into the model, taking the sequence of the previous time step as the input of the next time step, outputting probability matrixes of different text sequences at each time step, and finally stopping prediction after the algorithm traverses the < EOS > mark in the sequence. Finally, a text sequence with the maximum output probability is selected as a prediction sequence.
Assuming that there are only two words, a and B, in a corpus, the training process is as follows: the first time period outputs the probability P (A), P (B) of generating two words A and B, and then uses [ A, B]TAs input for the next time step, the sequential probability matrix of P (AB | a), P (AA | a), P (AB | B), … … P (BB | B) will be output at this time. The next time period is analogized, but each time a sequence generates 3 words, the sequence with the highest probability among all sequences that retain the first 3 word permutations is selected. Finally, the entity word synonymy identification sequence is generated through all the strategies. After the prediction operation, the synonym pair condition of the entity words of enterprises or organizations in a complete sentence can be obtained, so that the semantic understanding problem caused by different synonym adverbs when a user searches is avoided, and the ambiguity of the name entity words can be effectively reduced. Synonym collections will also be dynamically added in the background while the user is using the system. If the user inputs related enterprise or organization abbreviations, the system can also quickly locate the target searched by the user, which also greatly improves the searching efficiency.
Therefore, the method for disambiguating the Chinese name entity of the enterprise or the organization based on the sequence bar, provided by the embodiment of the invention, makes the application of the Chinese entity disambiguation to the business of enterprise public opinion analysis possible. The method provided by the embodiment of the invention has the following advantages:
1. the news corpora are crawled by constructing a simple semantic template and an artificial data labeling form, so that an effective data set is constructed and serves as a model training basis.
2. And (3) constructing a pre-training word and word vector by utilizing self-crawling 500w news corpora, and controlling the dimension of the word vector to be 500 dimensions.
3. And (4) carrying out model construction, and providing a model structure capable of well extracting context information of the text sequence from the language model.
According to the characteristics that the Chinese entity disambiguation has a data set shortage and the existing method cannot quantify and needs specific analysis, the method firstly provides a scheme for constructing a labeled data set, and realizes the function of converting disordered original text data into training data under supervised learning. Secondly, the method provides a new model structure for text sequence processing, and compared with the traditional method, the method has the following advantages:
1. the traditional method of converting Chinese entity disambiguation into classification processing is eliminated, and synonyms are distinguished by constructing a large number of knowledge bases to carry out rule matching. This method is too complicated in cost and inconvenient in operation.
2. Compared with the traditional statistical machine learning method based on hidden Markov, the method can better extract the text sequence characteristics and improve the adaptability of the model to the scene of Chinese entity disambiguation compared with the method for generating the text vector in a word frequency mode.
3. Compared with the traditional method, the method can better cope with the situation that the text sequence is learned in a long distance.
Fig. 6 is a schematic structural diagram of an enterprise or organization chinese name entity disambiguation apparatus based on sequence identification according to an embodiment of the present invention, which is applied to the enterprise or organization chinese name entity disambiguation method based on sequence identification, and only the structure of the enterprise or organization chinese name entity disambiguation apparatus based on sequence identification is briefly described below, but otherwise, please refer to the related description of the enterprise or organization chinese name entity disambiguation method based on sequence identification, and no further description is given here. Referring to fig. 6, the apparatus for disambiguating an entity of a name of an enterprise or an organization based on sequence identification according to an embodiment of the present invention includes:
the data set building module 601 is configured to crawl an open news data set and perform data cleaning on the news data set to obtain cleaned data; extracting entity words in the cleaned data to obtain preliminary standard data; wherein the entity words include at least one of: company name COM and organization name ORG; setting semantic template rules, and screening the preliminary standard data to obtain data to be standard; determining synonymy standard words and synonymy adverbs in the data to be normalized, and determining synonymy word pairs in the data to be normalized;
the data labeling module 602 is configured to set a data labeling strategy, label data to be normalized, add artificially constructed data, and perform data enhancement to obtain training data;
the vector training module 603 is configured to pre-train a word vector and a word vector using the preliminary normative data, and merge the word vector and the word vector in a direction perpendicular to the word vector to obtain a new vector;
a preprocessing module 604 for preprocessing training data;
a model training module 605, configured to train the preprocessed training data by using an encorder Decoder structure construction model, and store an optimal index model;
and the prediction module 606 is configured to predict the sample to be predicted by using the optimal index model, select the item with the largest distribution probability as an output sequence by using a Beamsearch strategy, and obtain a synonym sequence of the sample to be predicted.
Therefore, the enterprise or organization Chinese name entity disambiguation device based on sequence identification makes it possible to apply Chinese entity disambiguation to business of enterprise public opinion analysis.
As an optional implementation manner of the embodiment of the present invention, the data set constructing module 601 crawls the published news data set and performs data cleaning on the news data set in the following manner to obtain cleaned data: the data set building module 601 is specifically configured to crawl public domestic, economic, and scientific news data, remove special characters and meaningless symbols, check whether null values exist, and remove the data if null values exist, so as to obtain cleaned data.
As an optional implementation manner of the embodiment of the present invention, the data set constructing module 601 extracts entity words in the cleaned data in the following manner to obtain preliminary specification data: the data set building module 601 is specifically configured to process the cleaned data by using a pre-trained chinese named entity recognition model, extract entity words of company names and organization names in each sentence as training supplementary linguistic data, and perform long and short sentence collection and sorting at the same time, where the word number of each sentence is controlled within a preset word number.
As an optional implementation manner of the embodiment of the present invention, the data set constructing module 601 sets a semantic template rule in the following manner, and filters the preliminary specification data to obtain data to be specified: the data set building module 601 is specifically used for screening data by setting a template rule of "< ORG > + verb + < ORG >/< COM > + verb + < ORG >/+ verb + < COM >" and manually building a high-frequency verb dictionary in the field to obtain data to be normalized.
As an optional implementation manner of the embodiment of the present invention, the data set constructing module 601 determines the synonym standard word and the synonym adverb in the data to be normalized by the following method, and defines the synonym pair in the data to be normalized: the data set building module 601 is specifically configured to use an entity word with a large number of words in a sentence in the data to be normalized as a synonym standard word, determine that another entity word in the sentence and the synonym standard word belong to the same class, and each word in the another entity word is included in the synonym standard word, and the number of words of the another entity word is greater than 1, then use the another entity word as a synonym adverb, and determine that the synonym standard word and the synonym adverb in the sentence belong to a synonym pair.
As an optional implementation manner of the embodiment of the present invention, the data labeling module 602 sets a data labeling policy by the following method, labels data to be normalized, adds artificially constructed data, and performs data enhancement, to obtain training data: the data labeling module 602 is specifically configured to label a first word in a synonym in each sentence in the data to be normalized as SEi, label other words in the synonym as Ii, label a first word in a synonym as E2, label other words in the synonym as I2, and label other words in the sentence that are not in the synonym pair as O; and adding the artificial supplementary data which accord with the semantic template rule into the preliminary specification data to obtain training data, wherein the artificial supplementary data and the data to be specified are mixed together through random arrangement processing.
As an optional implementation manner of the embodiment of the present invention, the vector training module 603 pre-trains the word vector and the word vector by using the preliminary normative data, and combines the word vector and the word vector in the vertical direction to obtain a new vector: the vector training module 603 is specifically configured to train to obtain a Word vector by using a Word2vec model with a Skip structure, add an entity Word into a dictionary formed after Word segmentation, train to obtain a Word vector, and merge the Word vector and the Word vector in a vertical direction to obtain a new vector.
As an optional implementation manner of the embodiment of the present invention, the preprocessing module 604 preprocesses the training data by: the preprocessing module 604 is specifically configured to separate a labeling sequence from a chinese sequence from the training data, filter stop words in the chinese sequence, establish a dictionary, and encode the text sequence according to dictionary indexes.
As an optional implementation manner of the embodiment of the present invention, the model training module 605 trains the preprocessed training data by using an encorder Decoder structure construction model in the following manner, and stores an optimal index model: the model training module 605 is specifically configured to adopt a model with an Encoder-Decoder structure, extract sequence features through convolutional neural networks with convolutional kernels of 3, 4, and 5, serialize through a bidirectional recurrent neural network, add self attention to generate a corresponding attention weight value as an intermediate state value output by an Encoder end, form a Decoder through 2 layers of bidirectional recurrent neural networks at the Decoder end, and input a target sequence at a previous time into the Decoder to act on the intermediate state layer to generate a target sequence of a next time step.
As an optional implementation manner of the embodiment of the present invention, the prediction module 606 predicts the sample to be predicted by using the optimal index model in the following manner: the prediction module 606 is specifically configured to use a Beamsearch strategy, set the Beamsearch size value to 3, and select the item with the largest distribution probability as the output sequence to obtain the synonym sequence of the sample to be predicted.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for disambiguating an enterprise or organization Chinese name entity based on sequence identification is characterized by comprising the following steps:
crawling an open news data set and performing data cleaning on the news data set to obtain cleaned data;
extracting entity words in the cleaned data to obtain preliminary standard data; wherein the entity words include at least one of: company name COM and organization name ORG;
setting a semantic template rule, and screening the preliminary standard data to obtain data to be standard;
determining synonym standard words and synonym adverbs in the data to be normalized, and determining synonym pairs in the data to be normalized;
setting a data marking strategy, marking the data to be normalized, adding artificially constructed data to enhance the data, and obtaining training data;
pre-training a word vector and a word vector by using the preliminary standard data, and combining the word vector and the word vector in the vertical direction to obtain a new vector;
preprocessing the training data;
training the preprocessed training data by utilizing an Encoder Decoder structure construction model, and storing an optimal index model;
predicting a sample to be predicted by using the optimal index model, selecting the item with the maximum arrangement probability as an output sequence by using a Beamsearch strategy, and obtaining a synonym sequence of the sample to be predicted;
wherein, the determining the synonym standard words and synonym adverbs in the data to be normalized and the determining the synonym pairs in the data to be normalized includes:
and taking the entity words with a large number of words in one sentence in the data to be normalized as synonym standard words, determining that another entity word in the sentence and the synonym standard words belong to the same class, each word in the another entity word is contained in the synonym standard words, and the number of words of the another entity word is more than 1, taking the another entity word as a synonym, and determining that the synonym standard words and the synonym in the sentence belong to synonym pairs.
2. The method of claim 1, wherein crawling and data-cleansing public news data sets comprises:
crawling public domestic economic and scientific news data, removing special characters and meaningless symbols, checking whether null values exist or not, and removing data if the null values exist to obtain the cleaned data;
extracting entity words in the cleaned data to obtain preliminary standard data comprises the following steps:
processing the cleaned data by using a pre-trained Chinese named entity recognition model, extracting entity words of company names and organization names in each sentence as training supplementary linguistic data, and simultaneously collecting and sorting long and short sentences, wherein the word number of each sentence is controlled within a preset word number; and/or
Setting semantic template rules, and screening the preliminary standard data to obtain data to be standard, wherein the data to be standard comprises the following steps:
setting a template rule of < ORG > + verb + < ORG >/< COM > + verb + < ORG >/< ORG > + verb + < COM >/< COM > + verb + < COM > + and manually constructing a high-frequency verb dictionary in the field, and screening data to obtain the data to be normalized.
3. The method of claim 1,
the setting of the data marking strategy is used for marking the data to be normalized, adding artificially constructed data for data enhancement, and the obtaining of training data comprises the following steps:
marking SEi as a first character in the synonym in each sentence in the data to be normalized, marking Ii as other characters in the synonym, marking E2 as a first character in the synonym, marking I2 as other characters in the synonym, and marking O as other characters in the sentence which are not in the synonym pair;
adding manual construction data which accord with semantic template rules into the preliminary specification data to obtain the training data, wherein the manual construction data and the data to be specified are mixed together through random arrangement;
pre-training a word vector and a word vector by using the preliminary normative data, and combining the word vector and the word vector in the vertical direction to obtain a new vector comprises:
training to obtain the Word vector by adopting a Word2vec model with a Skip structure, adding the entity Word into a dictionary formed after Word segmentation, training to obtain the Word vector, and combining the Word vector and the Word vector in the vertical direction to obtain the new vector; and/or
The preprocessing the training data comprises:
separating a labeling sequence and a Chinese sequence from the training data, filtering stop words of the Chinese sequence, establishing a dictionary, and coding the text sequence according to dictionary indexes.
4. The method of claim 1, wherein the training the preprocessed training data with an Encoder Decoder structure building model, and wherein storing an optimal index model comprises:
the method comprises the steps of adopting a model with an Encoder Decoder structure, extracting sequence characteristics through a convolutional neural network with convolutional kernel numbers of 3, 4 and 5 in an Encoder, serializing through a bidirectional recursive neural network respectively, adding self attribute to generate a corresponding attention weight value as an intermediate state value output by an Encoder end, forming a Decoder through 2 layers of bidirectional recursive neural networks at the Decoder end, inputting a target sequence at the previous moment into the Decoder respectively, acting with the intermediate state layer, and generating a target sequence of the next time step.
5. The method of claim 4, wherein the predicting the sample to be predicted by using the optimal index model comprises:
and setting the Beamsearch size value to be 3 by using a Beamsearch strategy, and selecting the item with the maximum arrangement probability as an output sequence to obtain the synonym sequence of the sample to be predicted.
6. An enterprise or organization Chinese name entity disambiguation apparatus based on sequence identification, comprising:
the data set construction module is used for crawling an open news data set and carrying out data cleaning on the news data set to obtain cleaned data; extracting entity words in the cleaned data to obtain preliminary standard data; wherein the entity words include at least one of: company name COM and organization name ORG; setting a semantic template rule, and screening the preliminary standard data to obtain data to be standard; determining synonym standard words and synonym adverbs in the data to be normalized, and determining synonym pairs in the data to be normalized;
the data marking module is used for setting a data marking strategy, marking the data to be normalized, adding artificially constructed data to enhance the data, and obtaining training data;
the vector training module is used for pre-training a word vector and a word vector by using the preliminary standard data, and combining the word vector and the word vector in the vertical direction to obtain a new vector;
a preprocessing module for preprocessing the training data;
the model training module is used for training the preprocessed training data by utilizing an Encoder Decoder structure construction model and storing an optimal index model;
the prediction module is used for predicting a sample to be predicted by using the optimal index model, selecting the item with the maximum arrangement probability as an output sequence by using a Beamsearch strategy, and obtaining a synonym sequence of the sample to be predicted;
wherein: the data set construction module determines the synonym standard words and the synonym adverbs in the data to be normalized in the following mode, and determines the synonym pairs in the data to be normalized:
the data set building module is specifically configured to use, as an synonym standard, an entity word with a large number of words in a sentence in the data to be normalized, determine that another entity word in the sentence and the synonym standard belong to the same class, and each word in the another entity word is included in the synonym standard, and the number of words of the another entity word is greater than 1, then use the another entity word as a synonym, and determine that the synonym standard and the synonym in the sentence belong to a synonym pair.
7. The apparatus of claim 6,
the data set construction module crawls an open news data set and performs data cleaning on the news data set in the following mode to obtain cleaned data:
the data set building module is specifically used for crawling open domestic economic and scientific news data, removing special characters and meaningless symbols, checking whether null values exist or not, and removing data if the null values exist to obtain the cleaned data;
the data set construction module extracts entity words in the cleaned data in the following mode to obtain preliminary standard data:
the data set construction module is specifically used for processing the cleaned data by using a pre-trained Chinese named entity recognition model, extracting entity words of company names and organization names in each sentence as training supplementary linguistic data, and meanwhile, collecting and sorting long and short sentences, wherein the word number of each sentence is controlled within a preset word number;
the data set construction module sets semantic template rules in the following mode, screens the preliminary standard data and obtains data to be standard: and/or
The data set building module is specifically used for screening data by setting a' ORG > + verb + < ORG >/< COM > + verb + < ORG >/< ORG > + verb + < COM >/< COM > + verb + < COM > "template rule and manually building a high-frequency verb dictionary in the field to obtain the data to be normalized.
8. The apparatus of claim 6,
the data marking module sets a data marking strategy in the following mode, marks the data to be normalized, adds artificially constructed data to enhance the data, and obtains training data:
the data labeling module is specifically configured to label a first word in the synonym in each sentence in the data to be normalized as SEi, label other words in the synonym as Ii, label a first word in the synonym as E2, label other words in the synonym as I2, and label other words in the sentence that are not in the synonym pair as O; adding manual construction data which accord with semantic template rules into the preliminary specification data to obtain the training data, wherein the manual construction data and the data to be specified are mixed together through random arrangement;
the vector training module pre-trains a word vector and a word vector by using the preliminary normative data in the following way, and combines the word vector and the word vector in the vertical direction to obtain a new vector:
the vector training module is specifically used for training to obtain the Word vector by adopting a Word2vec model with a Skip structure, adding the entity Word into a dictionary formed after Word segmentation, training to obtain the Word vector, and combining the Word vector and the Word vector in the vertical direction to obtain the new vector;
and/or
The preprocessing module preprocesses the training data by:
the preprocessing module is specifically configured to separate a labeling sequence and a Chinese sequence from the training data, filter stop words in the Chinese sequence, establish a dictionary, and encode a text sequence according to dictionary indexes.
9. The apparatus of claim 6, wherein the model training module trains the preprocessed training data with an Encoder Decoder structure building model by storing an optimal index model as follows:
the model training module is specifically used for adopting a model with an Encoder Decoder structure, extracting sequence characteristics through convolutional neural networks with convolutional cores of 3, 4 and 5 respectively in the Encoder ENCODER, serializing the sequence characteristics through a bidirectional recursive neural network respectively, adding self attention to generate a corresponding attention weight value as an intermediate state value output by the ENCODER end, forming a Decoder through 2 layers of bidirectional recursive neural networks at the Decoder end, and inputting a target sequence at the previous moment into the Decoder respectively to act with the intermediate state layer to generate a target sequence of the next time step.
10. The apparatus of claim 9, wherein the prediction module predicts the sample to be predicted using the optimal index model by:
the prediction module is specifically configured to use a Beamsearch strategy, set a Beamsearch size value to 3, and select the item with the largest distribution probability as an output sequence to obtain a synonym sequence of the sample to be predicted.
CN201910297022.1A 2019-04-15 2019-04-15 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device Active CN110020438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910297022.1A CN110020438B (en) 2019-04-15 2019-04-15 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910297022.1A CN110020438B (en) 2019-04-15 2019-04-15 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device

Publications (2)

Publication Number Publication Date
CN110020438A CN110020438A (en) 2019-07-16
CN110020438B true CN110020438B (en) 2020-12-08

Family

ID=67191295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910297022.1A Active CN110020438B (en) 2019-04-15 2019-04-15 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device

Country Status (1)

Country Link
CN (1) CN110020438B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7230622B2 (en) * 2019-03-25 2023-03-01 日本電信電話株式会社 Index value giving device, index value giving method and program
CN110516233B (en) * 2019-08-06 2023-08-01 深圳数联天下智能科技有限公司 Data processing method, device, terminal equipment and storage medium
CN111079418B (en) * 2019-11-06 2023-12-05 科大讯飞股份有限公司 Named entity recognition method, device, electronic equipment and storage medium
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium
CN111079435B (en) * 2019-12-09 2021-04-06 深圳追一科技有限公司 Named entity disambiguation method, device, equipment and storage medium
CN111259087B (en) * 2020-01-10 2022-10-14 中国科学院软件研究所 Computer network protocol entity linking method and system based on domain knowledge base
CN111339319B (en) * 2020-03-02 2023-08-04 北京百度网讯科技有限公司 Enterprise name disambiguation method and device, electronic equipment and storage medium
CN111581335B (en) * 2020-05-14 2023-11-24 腾讯科技(深圳)有限公司 Text representation method and device
CN111832294B (en) * 2020-06-24 2022-08-16 平安科技(深圳)有限公司 Method and device for selecting marking data, computer equipment and storage medium
CN111814479B (en) * 2020-07-09 2023-08-25 上海明略人工智能(集团)有限公司 Method and device for generating enterprise abbreviations and training model thereof
CN112069826B (en) * 2020-07-15 2021-12-07 浙江工业大学 Vertical domain entity disambiguation method fusing topic model and convolutional neural network
CN112017643B (en) * 2020-08-24 2023-10-31 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and related device
CN111737407B (en) * 2020-08-25 2020-11-10 成都数联铭品科技有限公司 Event unique ID construction method based on event disambiguation
CN113326380B (en) * 2021-08-03 2021-11-02 国能大渡河大数据服务有限公司 Equipment measurement data processing method, system and terminal based on deep neural network
CN113761942B (en) * 2021-09-14 2023-12-05 合众新能源汽车股份有限公司 Semantic analysis method, device and storage medium based on deep learning model
CN113609825B (en) * 2021-10-11 2022-03-25 北京百炼智能科技有限公司 Intelligent customer attribute tag identification method and device
CN114398492B (en) * 2021-12-24 2022-08-30 森纵艾数(北京)科技有限公司 Knowledge graph construction method, terminal and medium in digital field

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN107391613A (en) * 2017-07-04 2017-11-24 北京航空航天大学 A kind of automatic disambiguation method of more documents of industry security theme and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8112402B2 (en) * 2007-02-26 2012-02-07 Microsoft Corporation Automatic disambiguation based on a reference resource
US8856119B2 (en) * 2009-02-27 2014-10-07 International Business Machines Corporation Holistic disambiguation for entity name spotting
CN104111973B (en) * 2014-06-17 2017-10-27 中国科学院计算技术研究所 Disambiguation method and its system that a kind of scholar bears the same name
CN106407180B (en) * 2016-08-30 2021-01-01 北京奇艺世纪科技有限公司 Entity disambiguation method and device
CN108572960A (en) * 2017-03-08 2018-09-25 富士通株式会社 Place name disappears qi method and place name disappears qi device
CN107784125A (en) * 2017-11-24 2018-03-09 中国银行股份有限公司 A kind of entity relation extraction method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN107391613A (en) * 2017-07-04 2017-11-24 北京航空航天大学 A kind of automatic disambiguation method of more documents of industry security theme and device

Also Published As

Publication number Publication date
CN110020438A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN110020438B (en) Sequence identification based enterprise or organization Chinese name entity disambiguation method and device
Wang et al. Learning to extract attribute value from product via question answering: A multi-task approach
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
Gasmi et al. LSTM recurrent neural networks for cybersecurity named entity recognition
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN111046179B (en) Text classification method for open network question in specific field
CN109800437A (en) A kind of name entity recognition method based on Fusion Features
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN113515632B (en) Text classification method based on graph path knowledge extraction
CN113128233B (en) Construction method and system of mental disease knowledge map
CN113239142A (en) Trigger-word-free event detection method fused with syntactic information
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN116127090A (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
Tao et al. News text classification based on an improved convolutional neural network
CN113254602B (en) Knowledge graph construction method and system for science and technology policy field
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN117033423A (en) SQL generating method for injecting optimal mode item and historical interaction information
CN111666375A (en) Matching method of text similarity, electronic equipment and computer readable medium
Nouhaila et al. Arabic sentiment analysis based on 1-D convolutional neural network
CN115827871A (en) Internet enterprise classification method, device and system
CN113312903B (en) Method and system for constructing word stock of 5G mobile service product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant