CN116127971A

CN116127971A - English push named entity extraction method and device based on subjective and objective word list

Info

Publication number: CN116127971A
Application number: CN202211458427.7A
Authority: CN
Inventors: 林铄浩; 高云鹏; 高鑫; 霍朦雨; 万怀宇
Original assignee: Beijing Zhipu Huazhang Technology Co ltd
Current assignee: Beijing Zhipu Huazhang Technology Co ltd
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-05-16

Abstract

The invention relates to an English push named entity extraction method and device based on subjective and objective word list, belonging to the technical field of data processing; the problem that a large number of subjective words in English push influence the recognition performance of the subsequent named entity is solved; the named entity extraction method of the invention comprises the following steps: acquiring English texts in multiple fields, and constructing a corpus; word segmentation and word frequency statistics are carried out on texts in a corpus, and a subjective word list is constructed through screening; preprocessing English push text to be identified to obtain standard push text; extracting all noun phrases in the standard text by using the grammatical dependency analysis model, preprocessing the noun phrases based on a subjective word list, and constructing a noun phrase set NP _p The method comprises the steps of carrying out a first treatment on the surface of the Based on noun phrase set NP _p Noun phrases in (a)And constructing a tree-shaped father-son level structure and extracting named entities to obtain a named entity identification result of the English push.

Description

English push named entity extraction method and device based on subjective and objective word list

Technical Field

The invention relates to the technical field of text processing, in particular to an English push named entity extraction method and device based on subjective and objective word list.

Background

With the rapid development of the internet and information industry, massive text data is continuously generated, how to efficiently obtain useful information from the massive text data becomes a current research hotspot, an information extraction technology is generated, and named entity identification is a subtask of information extraction, so that a specified entity is extracted from the massive text data. In the field of natural language processing application, named entity recognition is a basic task of multiple natural language processing applications such as information retrieval, machine translation, emotion analysis and the like, and therefore, the named entity recognition method has great significance and value for research of the named entity recognition.

At present, research techniques for identifying named entities are mainly divided into four different methods: rule-based methods, unsupervised learning methods, traditional supervised machine learning methods, and deep learning-based methods. However, due to the informal nature of the tweet, the presence of a large number of subjective words therein can affect the performance of named entity (i.e., noun phrase) recognition.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide a method and a device for extracting english-language push named entities based on subjective and objective vocabulary; the method solves the problem that a large number of subjective words influence the recognition performance of the subsequent named entity in English pushing in the prior art.

The aim of the invention is mainly realized by the following technical scheme:

on the one hand, the invention provides an English push named entity extraction method based on subjective and objective word list, which comprises the following steps:

acquiring English texts in multiple fields, and constructing a corpus;

word segmentation and word frequency statistics are carried out on texts in the corpus, and a subjective word list is constructed through screening;

preprocessing English push text to be identified to obtain standard push text;

extracting all noun phrases and noun clauses in the standard text by using a grammatical dependency analysis model, preprocessing the noun phrases and noun clauses based on the subjective word list, and constructing a set NP _p ；

Based on the set NP _p And constructing a tree-shaped father-son level structure of noun phrases and noun clauses in the English pushing text, and extracting named entities to obtain named entity identification results of the English pushing text.

Further, the constructing the tree-shaped parent-child level structure and extracting the named entity comprises the following steps:

based on the set NP _p The method comprises the steps that the containing relation of noun phrases and noun clauses in a tree form is established by taking noun phrases or noun clauses containing at least one noun phrase as a parent string and the noun phrases contained in the parent string as child strings;

Extracting core nouns based on all lattice structures of nouns of each father string, and storing the core nouns in a named entity set

Removing all sub strings in each parent string, and merging the rest content into a new character string

The set formed by all substrings is marked as cp; />

If the character string

Meeting the corresponding preset condition, and the character string is +.>

Save to named entity set->

Otherwise, based on the character string +.>

Extracting all noun phrases and noun clauses by using the grammatical dependency analysis model again, and storing the extracted noun phrases and noun clauses into a named entity set +.>

Incorporating the substring set cp into the named entity set +.>

And reconstructing a tree-shaped father-son level structure to extract the named entities and obtain a named entity identification result.

Further, reconstructing the tree-shaped father-son level structure for multiple times to extract named entities, and obtaining named entity identification results; the j-th reconstruction of the tree-shaped father-son level structure is used for extracting named entities, and the obtained named entity set is recorded as

j is an integer greater than 1; reconstructing tree-shaped parent-child level structure for j-1 th time to extract named entities, and recording the obtained named entity set as +.>

When set->

And (2) with collection->

When the difference is empty, the set +. >

And outputting the result as a named entity recognition result of English pushing.

Further, the character string

The corresponding preset conditions include: character string->

Only one word except the substring is contained, and the word is a prepositionOr conjunctions, and the character string +.>

The number of substrings is 2, the absolute value of the difference between the lengths of two of the substrings is not more than 2, and the lengths of both substrings are not more than 3.

Further, the extracting the core noun based on the noun all-lattice structure of each parent string includes:

traversing left to a first capitalized word after a last '"from the'" based on the noun all lattice structure; extracting the content between the words with the first capitalization to the YY in the XX YY's or XX YY' z structure for the XX YY's and XX YY' z structure to obtain core nouns of all grids of nouns; extracting the first capitalized word from the XX Yys 'and XX YYz' structures to YYs or YYz in the XX Yys 'or XX Yz' structures to obtain the core nouns of all the lattices of the nouns, wherein X and Y represent arbitrary words.

Further, the obtaining english texts in a plurality of fields, and constructing a corpus include:

Acquiring named entities of a plurality of fields, and merging to obtain an initial entity list L _init ；

Acquiring the initial entity list L _init Entity IDs corresponding to each entity in the list are obtained

Acquiring and the entity ID list

Entity IDs of other entities of entity existence relationship corresponding to each entity ID, and entity ID list +.>

Entity IDs corresponding to other entities in the same category of each entity are combined to obtain an entity ID list

To be used for

Replace->

Acquiring a list of said entity IDs +.>

Other entity of the entity existence relationship corresponding to each entity ID and entity ID list +.>

Entity IDs corresponding to other entities of the same class of each entity are combined to obtain an entity ID list +.>

Acquisition of

The abstracts of the entities corresponding to all the entity IDs are constructed based on the abstracts obtained, and the corpus is obtained.

Further, the constructing to obtain a subjective vocabulary includes:

word segmentation is carried out on the abstract in the corpus, word frequency is counted, words with word frequency smaller than a first preset threshold value and word length of 1 are removed, and an objective word list Objws is obtained;

acquiring nouns, adverbs, adjectives, numbers, prepositions, conjunctions and articles with word frequency ordering smaller than a second preset threshold value and with the Kelins star level larger than 1 in an ECDICT dictionary, and obtaining a word list Trws;

Removing words in the vocabulary obj from the vocabulary Trws to obtain a subjective vocabulary sub jws.

Further, the method performs preprocessing on the noun phrases based on the subjective word list to construct a noun phrase set NP _p The method comprises the steps of carrying out a first treatment on the surface of the Comprising the following steps: traversing all noun phrases, and removing end-of-word articles, graduated words and stop words in the noun phrases; removing subjective words from the subjective word list based on the subjective word list to obtain the pretreated noun phrase to obtain the nounPhrase collection NP _p 。

Further, removing the end-of-word articles, the graduated words and the subjective words by using corresponding regular expressions respectively;

for stop words, removing by using an AC automaton; the AC automaton is obtained through construction of a pre-constructed stop word list.

In another aspect, the present invention also provides a computer device comprising at least one processor, and at least one memory communicatively coupled to the processor;

the memory stores instructions executable by the processor for execution by the processor to implement the english-language-push named entity extraction method of the invention.

The beneficial effects of this technical scheme:

1. aiming at informal English text, the invention considers that a great amount of subjective words exist in the text, influences the downstream task of natural language processing, utilizes the text on knowledge websites such as wikipedia which is revised by multiple parties and normalized by words, and constructs two word lists, respectively represents subjective words and objective words, and is convenient for subsequent corresponding content filtering.

2. Aiming at the problems that the noun phrase extracted based on the grammatical dependency analysis model of the Transformer possibly comprises some unnecessary information (such as invalid conjunctions, prepositions and the like) or various noun phrases and noun clauses exist, the invention filters the invalid information through a multi-word list, and simultaneously builds tree structure on the identified noun phrase set to extract recursive nouns, thereby improving the accuracy of named entity identification.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

Fig. 1 is a flowchart of a method for extracting english-language push named entities according to an embodiment of the invention.

Detailed Description

Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.

The method for extracting the English push named entity based on the subjective and objective vocabulary in the embodiment is shown in fig. 1, and comprises the following steps:

step S1: acquiring English texts in multiple fields, and constructing a corpus;

compared with informal text of the push text, the text on knowledge websites such as wikipedia is more formal text which is revised by multiple parties and has standard words, so that the subsequent training of the depth model is facilitated; meanwhile, considering that a large number of subjective words possibly are applied in the text, the downstream task of natural language processing is affected, two word lists are constructed, subjective words and objective words are respectively represented, and corresponding content filtering is facilitated to follow-up.

Specifically, when constructing a corpus, named entities in multiple fields are first obtained, and combined to obtain an initial entity list L _init The method comprises the steps of carrying out a first treatment on the surface of the In particular, entities from different domains, including politics, agriculture, information technology, religious, video entertainment, video games, education, medicine, etc., may be specified empirically, and several entities of most recent interest to the general public may be listed based on google index.

Further obtain initial entity list L _init Entity IDs corresponding to each entity in the list are obtained

Preferably, the SPARQL interface of dbpetia can be utilized to query the corresponding wikipedic IDs of each entity to form an entity ID list

Further utilizing SPARQL interface of DBpedia to obtain list of entity ID

Other entity IDs (such as "capital" and another associated entity "Beijing") of entity existence relationship corresponding to each entity ID, and entity ID list>

Specifically, the Category may be a wiki Category to which each entity is assigned, an attribute of the wiki entity, and an original name in wikipedia is Category.

To be used for

Replace->

Acquisition and entity ID List->

Through the method of the present embodiment, the resulting entity ID list is constructed +.>

The total number of entities involved is about 13 tens of thousands.

Further acquisition of

The abstracts of the entity pages corresponding to all the entity IDs are constructed to obtain a corpus based on all the abstracts obtained.

Step S2: word segmentation and word frequency statistics are carried out on texts in the corpus, and a subjective word list is constructed through screening;

Specifically, after a corpus is constructed, word segmentation is carried out on abstracts in the corpus, word frequencies are counted, words with word frequencies smaller than a first preset threshold value and with word lengths of 1 are removed, and an objective word list Objws is obtained; specifically, in this embodiment, the first threshold is set to 300, and the word with the word frequency smaller than 300 and the word length equal to 1 is removed to form the objective word list Objws.

Further acquiring nouns, adverbs, adjectives, numbers, prepositions, conjunctions and articles with word frequency ordering smaller than a second preset threshold value and with the Kelins star level larger than 1 in the ECDICT dictionary, and obtaining a word list Trws; in this embodiment, the second threshold is set to 10000, that is, words with word frequency ordering smaller than 10000 and a kolin star level larger than 1 are used in the word list Trws.

The words in the vocabulary obj are removed from the vocabulary Trws, the rest words are mostly nice, beautiful words representing subjective feelings, and the subjective vocabulary sub jws is constructed.

Step S3: preprocessing English push text to be identified to obtain standard push text;

specifically, the pretreatment of the text can be performed by the existing methods such as text standardization, text word segmentation, text cleaning, word standardization and the like. The text preprocessing for English pushing can also perform normalization, removal, restoration and other operations by recognizing the morphemes in a specific format.

In this embodiment, aiming at the characteristics of english tweets, when preprocessing is performed, tweet semantic standardization, informal morpheme standardization, irregular capitalization word correction, sentence end value information extraction and clause extraction are performed on the english tweets respectively, so as to obtain regular standard tweet texts, and specifically, tweet preprocessing can be performed through the following steps:

step S301: the text-pushing semantic standardization comprises the following steps: restoring label semantics, extracting secondary morphemes, restoring special semantic punctuation and separating word end punctuation:

(1) Restoring tag semantics:

the labels in the pushup refer to the paragraphs beginning with "#", such as "# BlackLivesMatter", which is characterized by the words in the labels being generally merged, not containing spaces and punctuation. Aiming at the problem of poor label readability in the push, the label is subjected to semantic reduction by a label semantic reduction method based on a generalized hump naming method, a dictionary and greedy search algorithm compound strategy, namely if the label accords with the generalized hump naming method (Generalized Camel Case, GCC for short), the label semantic reduction is performed according to the naming rule, otherwise, the label semantic reduction of the compound strategy is performed by a matching method of the fused dictionary and greedy algorithm.

Specifically, the labels in the pushers are first extracted and a label set is formed. Judging whether each label accords with the generalized hump naming method, if so, carrying out semantic restoration by using a regular expression, and outputting a semantic restoration result of the label. And if not, carrying out semantic restoration by a matching method of a fusion dictionary and a greedy algorithm.

The hump naming method is a phrase writing method without space and punctuation marks, and the application range of the hump naming method is expanded to phrases containing numbers, serial numbers, how-uppercase words are connected and the like. The syntax is shown in Table 1;

table 1 generalized hump naming grammar table

In the embodiment, the grammar rules of all generalized hump algorithms are matched through the following regular expressions, and semantic reduction is carried out on the labels conforming to the generalized hump naming method.

(？P<content>[A-Z]？[a-z]+|[A-Z]+(？＝[A-Z][a-z]+|$|\d)|\d+(？:st|nd|rd|th)？|(？<＝[a-z])[A-Z][a-z]+？(？＝[^a-z]|$))。

Further, the matching method of the fusion dictionary and the greedy algorithm comprises the following steps:

firstly, word inquiry is carried out through an ECDICT dictionary, and if no hit result exists in the ECDICT dictionary, wordNinja based on a greedy search algorithm is utilized for disassembly, and the method is specifically:

firstly, constructing a fuzzy query statement, namely ' SELECT ' FROM stardict WHERE sw Like hashtag ', wherein ' hashtag ' is a target label, and querying whether the label exists in an ECDICT dictionary;

If the current tag cannot be matched with any ECDICT record, performing tag semantic restoration by using WordNinja, namely performing a maximum greedy matching algorithm on a known vocabulary;

if the current label can be matched with the word in the ECDICT dictionary, directly inquiring the ECDICT dictionary to judge whether the current label is composed of only one word, and returning the word if the current label is composed of only one word; otherwise, the similarity between the current label and all ECDICT records obtained by retrieval is calculated by using a heuristic Gestalt pattern matching algorithm, and then the phrase which is the most similar is returned. Specifically, if the current tag is composed of only one word, the corresponding word is matched; if the current tag is a phrase, there is no matching result. Specifically, there is no space in the middle of the text label, for example, if the current label is "SameAs", although the term "Same As" exists in the ecdic, there is no space between "Same" and "As" in the current label, so that the text label is not matched; similarly, if the current tag has only one word, such as "same", then the corresponding entry will be matched.

(2) Extracting the secondary morpheme:

the secondary morphemes in the push refer to morphemes such as emoji, pigment, and mood words expressing the mood or habit of an author, and the grammar of "@ user name" representing a user;

Wherein, because "@ username" is a kind of text grammar, its structure does not provide semantic information; so for the "@ username" grammar, a string is randomly generated, with the uppercase beginning followed by 5 random lowercase letters as the nickname of the user being referred to;

in addition, the push is crawled from the twitter and then subjected to HTML encoding, and is first converted into UTF8 or other encoding format, i.e., decoded, or referred to as transcoding. However, there may be some HTML text conversion failure in this process, and thus processing is required. For the text with decoding errors in the push, determining the replacement content according to the decoding table and replacing the content; for example, "& amp" needs to be replaced by "and", and for invalid morphemes such as "RT", multimedia links, mailbox addresses and the like, which are used for representing replies at the beginning, the corresponding regular expressions are used for removing. Wherein, the invalid morpheme refers to language elements which cannot provide valid information, have errors or are not English in English.

(3) Restoring special semantic punctuation, including:

removing paired square brackets "[ ]"; reducing the equal sign "=", to "equal to";

for the wave number "-" or hyphen "-" of the successor name, the wave number and hyphen of the possible successor name are obtained according to the following regular expression:

(？<＝(？P<isPunc>[^\s])){0,1}(？P<preS>\s{0,})(？P<p>[-\～])(？＝(？P<la stS>\s{0,})(？P<name>([A-Z][^\s]+\s？)+))；

And then, carrying out special judgment on the ligature: if the punctuation currently being determined is a hyphen and there is no space before and after it, and the previous element is a non-punctuation, then it is considered a word forming hyphen rather than a hyphen having special semantics.

Finally, if the punctuation (referring to the wave number and hyphen) is followed by a name, the content to be restored should also consider the context morphemes of the wave number and hyphen, as shown in table 2.

TABLE 2 Table of reduced content corresponding to the relative positions of the wave number or hyphen and other morphemes in the sentence

/>

For the wave numbers "-", which are adjacent to the numbers, the wave numbers which may be connected with the numbers before and after are obtained according to the following regular expression:

(？P<preFigure>(？:\d|_NUM)){0,1}\s{0,}(？P<target>\～)\s{0,}(？＝(？P<l astFigure>(？:\d|_NUM)))；

where NUM represents a possible digital word.

If the wave number is preceded by a number and followed by a number, it is replaced with "approbamateey"; if both the front and back are numbers replaced with "to".

(4) Separating word end punctuation:

end punctuation refers to punctuation at both the beginning and end of a word in order to facilitate subsequent clause extraction. Spaces are added on two sides of punctuation (word end punctuation) positioned before and after a word; in addition, since all the cells of plural nouns in English (such as "strators '" in "My strators ' pair") are matched by the rules, the single prime sign ' "in the words of" z ' "and" s ' "possibly representing all the cells of plural nouns is replaced by" @ # "to occupy place so as to play the effect of protecting the cells from being split, and the" @ # "is replaced back to the original", after all punctuation processing is completed.

After semantic standardization is carried out by the method of the embodiment, semantic information of the text label, the secondary morphemes, the special semantic punctuation and the word end punctuation can be restored, and the problem that text contents acquired in subsequent processing are incomplete due to the fact that morphemes which possibly contain semantics and are in a non-standard form are filtered out is avoided.

Step S302: informal morpheme standardization this embodiment employs multi-source english vocabulary in conjunction with BERT for informal morpheme standardization.

This step deals mainly with four abbreviations: common abbreviations with end punctuation (including geographic, date, and physiological or medical three categories), latin abbreviations, preposition abbreviations, and combined abbreviations of "person-to-morbid verbs or auxiliary verbs". Wherein, except for the common abbreviation with the end punctuation of the sentence, the abbreviation types are all obtained by collecting and sorting through Wikipedia except for the construction of an ECDICT dictionary; in addition, all matches are done through "free element regular expressions".

Specifically, the free element is an element in a free state that meets one of the following characteristics:

a) The element is preceded by a space and followed by an end of sentence;

b) The element is preceded by a period and followed by a space;

c) The elements are blank spaces in front and behind;

d) The sentence has only the target element.

The free element regular expression is: (.

In particular, in the "preposition abbreviation" reduction, "2" and "4" in the free state are reduced to "and" for ", whereas in some contexts 2 and 4 do not represent prepositions but purely represent quantities. To avoid the error reduction problem, the present embodiment uses the BERT model to determine whether the free "2" and "4" in the given text need to be reduced to the prepositions "to" and "for", i.e. convert the problem into a binary problem of "whether to reduce", which is referred to as "24 reduction problem" for short.

Firstly, using a clause extraction algorithm based on a dual stack structure to divide all abstracts in Abs, simultaneously keeping sentences with free 2, 4, to or for, replacing 2, 4, to or for with a [ MASK ] tag, and adding [ CLS ] and [ SEP ] tags at the beginning and end of the sentence. The result of the processing was noted as PData.

Adding a full connection layer with a two-class SoftMax structure on the basis of BERT base model (cased);

The training model is finely tuned, PData is firstly divided into training, verifying and testing sets according to the proportion of 6:2:2; freezing the BERT weights; and (3) inputting the embedded vector about [ MASK ] output by BERT into a full-connection layer with a two-class SoftMax structure, calculating binary cross entropy, and obtaining a 24-reduction model through loss iteration.

And inputting the English push to be recognized into a trained 24 reduction model, and judging and reducing the 2 and the 4 to obtain the standard 2 and 4 representation modes.

Step S303: non-canonical uppercase word correction based on word morphology;

firstly, extracting large-scale texts from Wikipedia as a formal text data set; meanwhile, in order to increase the generalization capability of the model on the informal text, the IMDB movie comment data set containing a large amount of informal text is combined on the basis of the formal text data set so as to expand the processing capability of the training set on the informal text.

Further, obtaining morpheme discrete embedding; in particular, when the human is used for judging the correct form of a word in a sentence, the embodiment mainly considers the case of the word and the adjacent words thereof and the factors of the words such as the common part of speech of the word, and the embodiment adopts the following characteristics as the embedded vector of each morpheme in the push:

Dimension 1: id of the morpheme in ECDICT (from 1, 0 indicates an unregistered word)

Dimension(s) 2: the morpheme comprises the following forms: only numbers, full lowercase, full uppercase, capitalization, partial uppercase in words, only period end punctuation, only paired punctuation, only connective punctuation, only other punctuation, various miscellaneous and empty placeholders.

Dimensions 3, 4, 5: the morpheme is used in ECDICT with the part of speech of the first three frequencies (0 indicates that the value is null), when the part of speech of the element is not uniform for 3 kinds, for example, apple has only the part of speech of noun, and the last two dimensions are null.

Wherein, the common words refer to 8000 words or words with a kolin star class of at least 0 in the front 8000 words of the word frequency sequence of the current generation corpus provided in ECDICT.

Morphemes refer to various types of words (the classification criteria may be parts of speech, word frequencies, etc.), punctuation marks, numbers, etc. The connection punctuation is a connection punctuation which generally represents context-bearing relationships in text, including sum "&", cross "-", colon ":", equal to "=", underline "_", or number "|", wave number "-"; other punctuation refers to other punctuations in the ASCII punctuation set, except for "end punctuation", "paired punctuation", and "connection punctuation".

Furthermore, since the fully lowercase words in the normal english text are the main components, and the existing training corpus is mostly the longer formal text or half formal text, if the original corpus data is directly input as a model, the more normal text in the input corpus can cause the model to bias the target to be classified as the normal text, and the label is unbalanced, so the embodiment adopts a complex sampling mode of combining radius sliding window downsampling and negative sampling to sample, and a perceptron training data set is obtained and recorded as Sam= { s ₁ ,s ₂ ,…,s _n }。

Specifically, downsampling includes: the words with the nonstandard capitalized word patterns are used as core morphemes, WR is used as a radius (1 in the embodiment), and the neighbor morphemes are introduced as training corpus, wherein the nonstandard capitalized word patterns comprise words with complete capitalization, capitalization and partial capitalization in the words. For example, "He was ORDERED to leave russia," centered around "ORDERED" with a radius of 1, the resulting training corpus is "was ORDERED to". If there is not enough morpheme in the WR radius range of the core morpheme, the core morpheme is replaced by an empty character string; in addition, if the ranges between clauses, which are taken by centering on a plurality of core morphemes, overlap each other, only the end-to-end two terminal ranges of all the overlapping ranges are considered, and all the overlapping ranges in the middle are merged. For example, "I really LOVE to SWIM in river" should be centered on "LOVE" and "SWIM" with a radius of 1, and the two obtained corpus are "real LOVE to" and "to SWIM in", and since "to" repeatedly appears in the two corpora, the two corpora are combined into one corpus: "really LOVE to SWIM in".

Negative sampling includes: constructing negative examples on each clause obtained by downsampling to enhance the fitting capability of the model to the negative examples, wherein the method specifically comprises the following steps:

directly retaining the original sentence without modification; or alternatively, the process may be performed,

randomly converting a completely lowercase word in a clause into a completely uppercase word according to a certain number (the default value is 30% of the word quantity of the current input sentence); or alternatively, the process may be performed,

all fully lowercase words are converted to either fully uppercase or capitalized, the goal of the conversion being to convert to the opposite or the same style with 50% probability each, depending on the style in which the current fully lowercase word is adjacent to the one containing the uppercase word.

Further, dividing the perceptron training data set into training, verifying and testing sets according to the proportion of 6:2:2; and constructing a classified multi-layer perceptron and training by taking the binary cross entropy as a loss function. The multi-layer perceptron comprises an Embedding layer, a hiding layer and an output layer:

an Embedding layer: for converting discrete features of each morpheme in each clause in the perceptron training data set Sam into continuous vector features. For training clauses s _i E Sam for each morpheme { w in clause ₁ ,w ₂ ,…,w _n }∈s _i Embedding (i.e. v) _j ＝Embedding(w _j ),j∈[1,n]) Splicing the head and the tail to obtain the current clause s _i Hidden representations of (a)

(wherein->

Representing a vector concatenation operation);

hidden layer: the method is characterized by comprising 5 layers of fully connected layers, wherein each layer (marked as l) adopts ELU as an activation function, the sensitivity of neurons to negative values is improved in a mode of improving the negative half-axis slope layer by layer, and Dropout is added according to the discarding rate d to perform regularization operation. Formalized definition of hidden representations of layers is:

H _l+1 ＝Dropout(ELU(WH _l +b,0.05*l),d)

output layer: for outputting the classification probability.

Through this step, the non-canonical uppercase words in the english pushers to be identified are converted into the canonical full lowercase or capitalized form.

Step S304: extracting English sentence-pushing end value information based on one-dimensional conjugated cellular automata;

splitting the text of the to-be-recognized text processed in the steps S301-S303 into three sections: a beginning tag HS, a mixture MI of natural text and tags located in sentences, and an ending tag HE; wherein the first word of MI is the next non-tagged text of HS and the last word of MI is the last non-tagged text of HE. After three segments of data with complete segmentation are obtained, the data is progressive from MI to MI.

Specifically, for the right-to-left one-dimensional evolution from MI to HS, the current cell index is denoted as HS _index Each cell neighbor is an adjacent word in HS, a forward evolution rule set SR is set, and a state set SS= { cut-off, reservation }. The forward evolution rule set SR is specifically as follows (or relationship to each other):

(1) The last word or last phrase of HS is a qualified word; (matching longest accepted words with an AC automaton constructed based on accepted words)

(2) The last word of MI is a preposition;

(3) MI first word is the lower case beginning;

(4) Querying two words adjacent to HS and MI through ECDICT is not a common word;

(5) And merging two words adjacent to the HS and the MI and inquiring whether corresponding compound words exist through ECDICT.

Wherein, the conjecture refers to the collection of prepositions (phrases), conjunctions (phrases), moral verbs and auxiliary verbs. The acceptors are obtained by matching the longest acceptors with an AC automaton constructed based on the acceptors. For example, building an AC automaton in the set { as, as well as } can ensure that when matching "Mary as well as John" it is "as well as" that is matched instead of "as".

When the automaton evolves under the constraint of the forward evolution rule set SR, the "cut-off" and "reserved" states in the state set SS can be specifically expressed as: if the current cell does not meet the SR, the HS is truncated to HS [ HS ] _index :]And terminating the automaton operation, and if the forward evolution rule set SR is satisfied, then HS in HS _index The corresponding element remains in the foremost position of the MI.

Further, the one-dimensional evolution from MI to HE from left to right, and the backward evolution rule set ER and the forward evolution rule set SR are in a conjugated relationship, and the contents thereof are as follows (or relationship with each other):

(1) The last word of MI is the adapting word;

(2) HE first word is a preposition;

(3) HE first word is the beginning of the lowercase and MI last word is not the end of the sentence;

(4) Querying HE and MI adjacent two words through ECDICT is not a common word;

(5) The two words adjacent to the MI are merged and queried through the ecact whether the corresponding compound word exists.

And evolving from MI to HS and HE through an automaton, reserving words meeting the forward evolution rule set SR and the backward evolution rule set ER, and obtaining a sentence end value information extraction result.

Step S305: clause extraction based on a dual stack structure, comprising:

if the last element of the sentence (which may be a letter or punctuation mark) is not the end punctuation of the sentence, then add ""; initializing two stack structures S for storing punctuation in a sentence _pair And S is _ter Respectively representing a left paired punctuation which has been traversed and a final punctuation which has been traversed;

All elements (including words and punctuations, etc.) resulting from splitting the current sentence by space are then traversed:

if the current element is the end punctuation of the sentence, then compare S _pair And S is _ter Judging whether the starting position of current clause interception is from the last punctuation point or from the last paired symbol; and then go to S _ter Recording the current index as a candidate starting index of the end punctuation of the next sentence;

if at present S _ter Is not null and the current element is S _ter The right paired punctuation of the last element, then intercept clauses from the last paired punctuation to the end of the current element; then judging whether the last sentence is in the range of the current paired symbol pair, if so, thenSpring S _pair And S is _ter Or else pop up only S _pair A last element;

if the current element is the left paired punctuation, the method goes to S _pair Pressing in the index of the current element;

wherein, when the clause is intercepted, the clause is replaced by a random nickname so as to avoid repeated matching.

After the processing, if there are unmatched pairs of punctuations in the sentence, the punctuations are directly removed.

And obtaining the standard text of the push text through the pretreatment process.

Step S4: extracting all noun phrases and noun clauses in the standard text by using a grammatical dependency analysis model, preprocessing the noun phrases and noun clauses based on the subjective word list, and constructing a set NP _p ；

Specifically, firstly, inputting a standard text (which can include clauses in the original text and complete labels before 0 is executed) into a grammar dependency analysis model based on a Transformer trained by using a large corpus so as to analyze the part of speech of each word and the part of speech in the sentence; extracting noun phrases or noun clauses Np related in the push according to the grammar dependency analysis result _i Where i ε {1,2, …, n }, n is the number of noun phrases and noun clauses, and the set of all noun phrases and noun clauses extracted is np= { NP ₁ ,Np ₂ ,…Np _i …,Np _n }. The noun phrases and noun clauses extracted by the grammatical dependency analysis model may include longer noun phrases including a plurality of named entities or noun clauses including a plurality of noun phrases.

Further, the noun phrases and noun clauses in the set NP are preprocessed based on the subjective word list, and the set NP is constructed _p The method comprises the steps of carrying out a first treatment on the surface of the Comprising the following steps: traversing all noun phrases and noun clauses, and removing word end articles, graduated words and stop words in the noun phrases and noun clauses; based on the subjective word list, removing subjective words in the subjective word list to obtain a preprocessed set NP _p 。

Specifically, for the end-of-word article: matching whether the word at the beginning of the phrase is "a", "an" or "the", if so, removing;

For the adjectives: removing after matching by adopting the following regular expression, (;

where NUM represents possible digital words such as:

for subjective words: removing all subjective words existing in the phrase by using the following free element regular expressions based on the subjective word list Subjws;

(？<＝\s)k$|^k(？＝\s)|(？<＝\s)k(？＝\s)|^k$；

where k may be replaced with any piece of free text that it is desired to match, such as "hello", "#", etc.

For stop words: since the stop words include words such as "you" and "you rself", the AC automaton is built based on the stop word list, and the longest matching term in the noun phrase is removed, for example, matching to "you" and "you rself" at the same time, and then "you rself" is removed. According to the characteristics of English pushing and the tasks of the embodiment, the constructed stop word list comprises:

it,its,itself,i,me,my,myself,we,us,our,ours,ourselves,ye,u,you,your,ur,yours,urs,yourselves,yourself,thyself,thine,thee,thou,he,him,his,himself,she,her,herself,they,them,their,theirselves,diz,this,that,these,those,there,here,thing,other,another,others,some,something,someday,somewhere,somehow,someone,somebody,sometime,sometimes,somewhat,every,everything,everywhere,everyone,everybody,everyday,any,anything,anyone,anybody,anyway,anymore,no,nothing,none,null,nowhere,nobody,just,mere,each,many,much,more,most,who,whom,what,where,which,how,whoever,whomever,whosoever,whomsoever,yes,such,not,do,don,dont,does,doesn,doesnt,did,didn,didnt,true,false,is,are,was,were,be,have,has,had,having。

step S5: based on the set NP _p And constructing a tree-shaped father-son level structure of noun phrases and noun clauses in the English pushing text, and extracting named entities to obtain named entity identification results of the English pushing text.

Due to aggregation NP _p The term phrase may include a longer clause such as a noun clause, and a noun phrase may include a plurality of noun phrases at the same time, so that a tree-shaped father-son relationship between the phrases needs to be established on the output noun phrases and the noun clause set, and then hierarchical structural analysis is performed.

Specifically, based on the inclusion relation of noun phrases, noun phrases and noun clauses containing at least one noun phrase are used as parent strings, the noun phrases contained in the parent strings are used as child strings, and a tree-shaped father-child level structure is constructed; namely: utilizing aggregate NP _p All noun phrases and noun clauses in the tree-shaped father-son level structure NP _t If a noun phrase has multiple parent strings, the longest parent string is taken as the parent node of the father string, and all NPs are noted _t All parent strings in (1) are collected as FP _t ＝{fp ₁ ,fp ₂ ,…,fp _n }. Saving noun phrases for which parent strings do not exist to a named entity collection

/>

Specifically, NP is identified using the following regular expression _t All lattice structures of nouns of each parent string, extracting core nouns,

([A-Z0-9]\w+\s)*[\w^('s|'z|z'|s')]*(([s|z](？＝'))|(？＝'s|'z))；

that is, traversing left from the '"to the first capitalized word following the last'" based on the noun all lattice structure; for XX YY's and XX YY' z structures, extracting the content between the first capitalized word and YY in the XX YY's or XX YY' z structures to obtain core nouns of all the lattices of the nouns; for the structures XX Yys 'and XX YYz', extracting the first capitalized word to the content between YYs or YYz in the structures XX Yys 'or XX Yz', obtaining the core nouns of all the nouns, and storing the obtained core nouns of all the lattices to a named entity set

Wherein X and Y represent arbitrary words.

Further, all the sub strings in each parent string are removed by using the free element regular expression, and the rest contents are combined into a new character string

The set formed by all substrings is marked as cp; in practical application, the embodiment sorts the substrings according to the length, and removes the substrings according to the sequence from long to short according to the substring length, so as to preferentially remove the substring with large information quantity.

If the character string

Meeting the corresponding preset condition, and the character string is +.>

Save to named entity set->

Otherwise, based on the character string +.>

Extracting all noun phrases and noun clauses by using the grammatical dependency analysis model again, and extracting the extracted noun phrases and noun clausesNoun clauses saved to named entity set +.>

Finally, the substring set cp is incorporated into the named entity set +.>

Is a kind of medium.

To name entity sets

Substitute set NP _p Reconstructing a tree-shaped father-son level structure to extract named entities; preferably, the tree-shaped father-son level structure can be reconstructed for a plurality of times to extract the named entities, so that the named entity identification result is obtained; the jth reconstruction of the tree-shaped father-son level structure is used for extracting named entities, and the obtained named entity set is marked as +. >

When set->

And (2) with collection->

When the difference is empty, the set +.>

Wherein, the character string

The preset conditions of (2) include: character string->

Only one word except the substring contained and the word is mediumA word or a hyphen, and the number of the child strings of the parent string is 2, the absolute value of the difference between the lengths of the two child strings is less than or equal to 2, and the lengths of the two child strings are less than or equal to 3; if the character string->

If the preset condition is satisfied, the character string is +.>

Save to named entity set->

Preferably, if said character string +.>

If the corresponding preset condition is not satisfied, the character string can be judged

Whether or not it meets the following conditions: this->

Is a non-empty string and is +.>

Not all the morphemes in the list are invalid information (namely prepositions, conjunctions, punctuations, subjective words, stop words and articles), if the information is met, extracting noun phrases and noun clauses by using the grammatical dependency analysis model again, and storing the extracted noun phrases and noun clauses into a named entity set>

Reconstructing a tree-shaped father-son level structure and extracting named entities; if not, the character string can be directly removed >

In summary, according to the english-language-pushing named entity extraction method based on the subjective and objective word list provided by the embodiment of the invention, the two word lists are constructed by using texts on knowledge websites such as wikipedia which are revised by multiple parties and normalized by words, and subjective words and objective words are respectively represented, so that the problem that a large number of subjective words exist in the informal english-language-pushing text and influence the downstream task of natural language processing is solved. Aiming at the problem that the noun phrase extracted by the grammatical dependency analysis model based on the Transformer possibly comprises some unnecessary information (such as invalid conjunctions, prepositions and the like) or various noun phrases and noun clauses exist, filtering the invalid information through a multi-word list, and constructing tree structure on the identified noun phrase set to extract recursive nouns, thereby improving the accuracy of named entity identification.

In another embodiment of the invention, a computer device is provided that includes at least one processor and at least one memory communicatively coupled to the processor; the memory stores instructions executable by the processor for execution by the processor to implement the english-language-push named entity extraction method of the foregoing embodiments.

Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. The English push named entity extraction method based on the subjective and objective word list is characterized by comprising the following steps of:

acquiring English texts in multiple fields, and constructing a corpus;

preprocessing English push text to be identified to obtain standard push text;

2. The method for extracting named entities from english pushers according to claim 1, wherein the constructing a tree-shaped parent-child level structure and extracting named entities comprises:

The set formed by all substrings is marked as cp;

if the character string

Meeting the corresponding preset condition, and the character string is +.>

Save to named entity set->

Otherwise, based on the character string +.>

Extracting all noun phrases and noun clauses by using the grammatical dependency analysis model again, and storing the extracted noun phrases and noun clauses into a named entity set +. >

Incorporating the substring set cp into the named entity set +.>

3. The method for extracting named entities from English push text according to claim 2, wherein the tree-shaped parent-child level structure is reconstructed for a plurality of times to extract the named entities, and a named entity recognition result is obtained; the j-th reconstruction of the tree-shaped father-son level structure is used for extracting named entities, and the obtained named entity set is recorded as

When set->

And (2) with collection->

When the difference is empty, the set +.>

4. The english-language-push named-entity extraction method of claim 2, wherein the character string

The corresponding preset conditions include: character string->

Only one word except the substring is included, and the word is a preposition or a ligature, and the character string ++>

5. The method for extracting english-language-push named-entity according to claim 2, wherein extracting the core noun based on the noun-all lattice structure of each parent string comprises:

6. The method for extracting english-language-push named-entity according to claim 1, wherein the steps of obtaining english texts in a plurality of fields, and constructing a corpus include:

Acquiring the entity ID list

Other entity IDs associated with the entity corresponding to each entity ID in the list, and entity ID list

To be used for

Replace->

Acquiring the entity ID list->

Other entity IDs associated with the entity corresponding to each entity ID, and an entity ID list +.>

Acquisition of

7. The method for extracting english-language push named entities according to claim 1 or 6, wherein the constructing to obtain the subjective vocabulary includes:

8. The method for extracting english-language-context named entity according to claim 1, wherein the preprocessing the noun phrases based on the subjective word list is performed to construct a noun phrase set NP _p The method comprises the steps of carrying out a first treatment on the surface of the Comprising the following steps: traversing all noun phrases, and removing end-of-word articles, graduated words and stop words in the noun phrases; based on the subjective word list, removing subjective words in the subjective word list to obtain the preprocessed noun phrases to obtain the noun phrase set NP _p 。

9. The method for extracting English push named entity according to claim 8, wherein,

removing word end articles, graduated words and subjective words by using corresponding regular expressions respectively;

10. A computer device comprising at least one processor and at least one memory communicatively coupled to the processor;

the memory stores instructions executable by the processor for execution by the processor to implement the subject and objective vocabulary-based english-language pushout named entity extraction method of any of claims 1-9.