CN117195887A

CN117195887A - English news element extraction method based on core word diffusion

Info

Publication number: CN117195887A
Application number: CN202311222086.8A
Authority: CN
Inventors: 曾祥健; 苏俊斌; 余清楚; 牛榕婷
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-12-08

Abstract

The application provides a WHO element extraction method under English news scene, which comprises the steps of cleaning and preprocessing data of network news, combining article titles, guide languages and article topics into text data, and analyzing a tree structure of the text data; extracting all nouns in all noun phrases from subtrees of a text data tree structure, screening core words from the noun phrases, recalling an extracted core word set, sequencing and calculating a characteristic value of each core word, and extracting a first core word of a sequencing result as a last core word; expanding the core word into a section of complete element text, traversing leaf nodes corresponding to the core word, searching simple sentence nodes in the direction of father nodes, connecting all texts of intermediate links from the simple sentence nodes to the leaf nodes in series, and finally generating element text; the application realizes the extraction of WHO elements on a plurality of news samples, has higher accuracy, can improve the recall rate of core words and expands recall sets.

Description

English news element extraction method based on core word diffusion

Technical Field

The application relates to the field of computer technology and natural language processing, in particular to an English news element extraction method based on core word diffusion.

Background

In the age of information explosion today, news stories play a vital role, providing the public with a worldwide window of understanding, however, for a full understanding of a news event, just a topic or short story is not sufficient. To meet the public demand for accurate, complete news stories, journalists and news practitioners have adopted a compact but comprehensive approach-5W 1H, namely, "What (What)", "why (Why)", "When (white)", "Where (white)", "Who (white)", and "How (How)".

For researchers in the news field, they need to analyze a lot of news data of a certain event, wherein, extracting 5W1H elements is a common approach. The traditional manual extraction method has low efficiency, and can not meet the quantitative analysis requirement of scientific researchers in the big data age, so a computer-assisted news element extraction means is needed to assist the scientific researchers in processing a large amount of news data.

News element extraction is also a question-answering system, and the 5W general flow of news text has three items: preprocessing, candidate phrase extraction and candidate scoring. The input data for the system is typically text, such as headlines, derivatives, and text, and there are other systems that use speech recognition technology (ASR) to convert broadcast speech into text and then input the text into the system, the result of which is five phrases, one for each of 5W, that together represent the dominant event for a given news text. The preprocessing task performs sentence splitting, sentence tokenization, and tokenization using NLP with words, including part-of-speech tokenization, coreference resolution, NER, or semantic role token (SRL).

The prior art processes each sentence from an article, and each sentence is processed according to the following flow:

recall: traversing the index cluster set returned by coreNLP, extracting a first noun phrase NP, if the right sibling branch is a verb phrase VP and the direct father node is S, putting the word corresponding to the node into a recall candidate, and if the node contains a verb phrase VP subtree, discarding the node.

Sequencing: the number of times of recall candidate words in the reference digestion set, the positions of the recall candidate words in the whole article and whether the recall candidate words are named entities are calculated respectively according to the features of the recall candidate words, and the numbers obtained by the three parts are multiplied by 0.095,0.9,0.005 respectively for weighted summation.

Recall in the prior art depends on the reference resolution set returned by coreNLP, however coreNLP is a model-based algorithm, and reference resolution results are not returned for all news, which would result in no recall in the prior art; in the prior art, recall methods have typically placed a number of restrictions to avoid irrelevant candidate word interference rankings, which can result in a smaller recall set that cannot extract the correct WHO element from the article.

Since determining the subject of analysis is a very important proposition for news researchers, in the present application, the focus is on the question of extracting Who (wha) answers to news.

Disclosure of Invention

The present application proposes the following technical solution to one or more of the above technical drawbacks of the prior art.

The method for extracting the English news elements based on the core word diffusion is characterized by comprising the following steps:

s1, recall step, namely screening out a subset meeting the requirement of core words from text data of network news, specifically, extracting all nouns in all noun phrases corresponding to WHO elements of the text data to obtain a core word candidate set;

s2, sorting, namely forming a candidate core word initial matrix by using the core word candidate set in S1, calculating a characteristic value for each candidate core word in the candidate core word initial matrix, carrying out weighted summation, sorting the candidate core words from large to small according to a result obtained by the weighted summation, obtaining a first core word of the sorting result as a final core word according to parameter requirements, and outputting a sorted candidate core word matrix;

and S3, a step of generating text element, namely expanding the last core word in S2 into a complete element text, specifically, diffusing the core word text according to the tree structure of the text data in S1, finding out leaf nodes corresponding to the last core word in S2 in the tree structure of the text data, searching simple sentence nodes towards the father node direction of the leaf nodes, connecting all texts passing through an intermediate link from the simple sentence nodes to the leaf nodes in series to form a final WHO element text.

Still further, the step S1 further includes the following steps before recall: and (3) cleaning the network news in the step S1, and combining the article titles, the guide languages and the article topics of the network news into the text data.

The application has the technical effects that the accuracy of extracting WHO elements can be further improved by cleaning the data before executing the step S1, the integrity of the data is ensured, and the reliability of the subsequent steps is improved.

Further, the calculating of the feature value in S2 includes calculating a position offset value, a word frequency of the candidate word, a word2vec value, and Jaccard coefficients of sentences where the article title and the candidate word are located, and finally determining whether the candidate core word is a named entity, if the candidate word is a named entity word, setting the feature value to 1, otherwise setting the feature value to 0.

Further, the position offset value is calculated according to the internal position of the sentence where the candidate core word is located and the position where the full text is located, and the calculation mode is as follows:

wherein x is a position offset value, a, b are weight coefficients, L _{token_length} Is the total word number of sentences where candidate core words are located, L _{s_length} Is the total sentence number, i _{token_i} Is the specific position index, i of the sentence where the core candidate word is located _{s_i} Is the subscript of the sentence in which the core candidate word is located.

The position bias value of the core word can describe the position factor of the candidate core word more intuitively in data, and the earlier candidate core word is more likely to be the last core word in general.

Still further, the word frequency candidates are obtained by: and counting word frequencies of the candidate core words in the candidate core word set, and carrying out normalization processing on statistics of the word frequencies.

Further, the word2vec value is obtained by: and loading a word2vec model to predict word embedding sequences of the article titles and word embedding sequences of sentences in which candidate core words are located, and then solving cosine similarity of the two word embedding sequences.

The English words are converted into vectors, and the words and English sentences of the English text are analyzed more generally, so that the correlation between the candidate core words and the article titles can be further analyzed.

Still further, the Jaccard coefficients are obtained by: and comparing the similarity and the difference between the noun set contained in the article title and all candidate core word sets in sentences where the candidate core words are located.

By analyzing the similarity between the candidate core words and the article title nouns through Jaccard coefficients, the candidate core words can be more intuitively seen as the last core word.

The application also proposes a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method as claimed in any of the preceding claims.

The application has the technical effects that: the method solves the problems of small recall set and low recall rate of the traditional news element extraction method, screens article core words from the recall set, diffuses the core words into phrases, realizes the extraction of WHO elements of English articles, and improves the accuracy of WHO element extraction.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings.

Fig. 1 is a general flow of a WHO element extraction method in an english news scenario according to an embodiment of the application.

Fig. 2 is a tree structure diagram of english sentence parsing according to an embodiment of the application.

Fig. 3 is a flowchart of recall steps of a WHO element extraction method in an english news scenario according to an embodiment of the application.

Fig. 4 is a flowchart of a sorting step of a WHO element extraction method in an english news scenario according to an embodiment of the application.

Fig. 5 is a flowchart of a text element generating step of a WHO element extraction method in an english news scene according to an embodiment of the application.

Fig. 6 is a tree structure diagram of english sentence parsing according to an embodiment of the application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

Fig. 1 shows an overall flowchart of a WHO element extraction method in an english news scenario of the application.

The method comprises the following steps:

It should be noted that, the step S1 further includes the following steps before recall: and (3) cleaning the network news in the step S1, and combining the article titles, the guide languages and the article topics of the network news into the text data.

It should be noted that the cleaning of the text data of the web news includes, but is not limited to, filth character filtering and advertisement text culling.

After the text data is cleaned, the text data is processed and recombined by extracting the tree structure of the text through coreNLP and marking the parts of speech of the words.

Note that coreNLP is a set of NLP tools at the university of stanford, and integrated utility functions include word segmentation, part-of-speech tagging, and syntactic analysis.

In a specific exemplary embodiment, for example, the sentence "The Yankees went to great lengths to incorporate many features of their old stadium (bottom) into their new one (top)", the tree structure of which is parsed with coreNLP extraction is shown in fig. 2:

ROOT is the ROOT node, S is a child of ROOT, "(NP (DT The) (NNPS Yankees))" is a child of S, (VP (VBD went) is a child of S and is a sibling of (NP (DT The) (NNPS Yankees)), where NP represents a noun phrase and VP represents a verb phrase;

the tree structure represents The hierarchical relationship of sentences, for example, "The Yankees" is The subject, "went" is The predicate, "The Yankees" and "went" are flat, and "great length hs" modifies "went" and is a child node of "went" in The exemplary sentence.

Each sentence contains token data corresponding to the number of words, and each token is an annotation to a word and contains position information of an index word, original text word information and pos part-of-speech information.

It should be noted that, the initial matrix of candidate core words in S2 is used to store various feature values of candidate words.

It should be noted that, the calculating of the feature value includes calculating a position bias value, a word frequency of the candidate word, a word2vec value, and a Jaccard coefficient of a sentence where the article title and the candidate word are located, and finally judging whether the candidate core word is a named entity, if the candidate word is a named entity word, setting the feature value to be 1, otherwise setting the feature value to be 0.

It should be noted that, the position offset value is calculated according to the internal position of the sentence where the candidate word is located and the position where the full text is located, and the calculation mode is as follows:

wherein x is a position offset value, a, b are weight coefficients, L _{token_length} Is the total word number of sentences where the candidate words are located, L _{s_length} Is the total sentence number, i _{token_i} Is the specific position index, i of the sentence in which the candidate word is located _{s_i} Is the subscript of the sentence in which the candidate word is located.

The position bias value is used to describe a position factor of the candidate core word, and the earlier the candidate core word is, the higher the probability of the candidate core word being the last core word is.

It should be noted that, the word frequency of the candidate word is obtained through the following steps: and counting word frequencies of the candidate core words in the candidate core word set, and carrying out normalization processing on statistics of the word frequencies.

It should be noted that the word2vec value is obtained through the following steps: and loading a word2vec model to predict word embedding sequences of the article titles and word embedding sequences of sentences in which candidate core words are located, and then solving cosine similarity of the two word embedding sequences.

It should be noted that, word2vec is a correlation model for generating word vectors, and is mainly used to embed a high-dimensional space with all word numbers in a continuous vector space with a much lower dimension, and each word or phrase is mapped into a vector on a real number domain to realize word vectorization, so as to measure the correlation between candidate core words and article titles.

It should be noted that the higher the correlation between a sentence and an article title, the higher the probability that the sentence is a key sentence, and the article title contains important information of english news text.

It should be noted that the Jaccard coefficient is obtained by the following steps: and comparing the similarity and the difference between the noun set contained in the article title and all candidate core word sets in sentences where the candidate core words are located.

It should be noted that, the Jaccard coefficient is a ratio of an intersection size between two sets and a union size between two sets, and is used to compare similarity and variability between limited sample sets, where the greater the Jaccard coefficient value, the higher the sample similarity.

Fig. 3 shows a recall step flowchart of a WHO element extraction method in an english news scenario according to an embodiment of the application. Comprising the following steps:

s3-1, analyzing tree structure data of an English news article from text data of the English news article;

s3-2, traversing all subtrees of the root node, judging whether the subtrees are noun phrases NP, if the subtrees are noun phrases NP, further judging whether father nodes of the subtrees are simple sentences S, and if the subtrees are not noun phrases NP, continuing traversing all subtrees of the root node;

s3-3, judging whether the father node of the noun phrase NP subtree is a simple sentence S or not according to the S3-2, if the father node of the noun phrase NP subtree is the simple sentence S, extracting all nouns from the subtree of the noun phrase NP, otherwise, continuing to traverse all subtrees of the root node; and outputting all the candidate nouns until all the subtrees of the root nodes are traversed.

In an exemplary embodiment, the tree structure data of The article shown in fig. 2 is parsed from english text "The Yankees went to great lengths to incorporate many features of their old stadium (bottom) into their new one (top)", all labels of The ROOT node ROOT are subtrees of noun phrase NP, and subtrees of noun phrase NP whose parent node is simple sentence S are found, resulting in candidate core word set { "The Yankees" }.

In an exemplary embodiment, the tree structure data shown in fig. 6 is parsed from the english text "NEW satellite images appear to confirm that North Korea is building a tunnel to carry out secret underground nuclear tests." and all labels of the ROOT node ROOT are subtrees of the noun phrase NP, and subtrees of the noun phrase NP whose parent node is the simple sentence S are found, resulting in the candidate core word set { "NEW satellite images", "North Korea" }.

It should be noted that the recall step aims to extract a subset meeting the requirements of core words from english news articles.

Note that WHO elements generally correspond to noun phrases, so that all nouns in all noun phrases need to be extracted from subtrees.

It should be noted that the noun phrase must be a direct child node of a simple sentence, specifically, in terms of data structure, label of the subtree corresponding to the noun phrase is equal to NP, and label of the direct parent node of the noun phrase is equal to S.

In one exemplary embodiment, the term "The Yankees" is a direct child of The simple sentence "S", the label of The subtree to which The term "The Yankees" corresponds is equal to NP, and The label of its direct parent is equal to S, so The term "Yankees" is recalled into The candidate core word set.

In a specific embodiment, the recall step of the application solves the problem that the recall is not caused by the fact that the reference digestion set returned by corenlp in the prior art is dependent on and the recall is not caused, and simultaneously, the application further solves the problems of smaller recall set and lower recall rate caused by a large number of restrictions for avoiding the interference ordering of irrelevant candidate words in the prior art.

Fig. 4 shows a flowchart of a sorting step of a WHO element extraction method in an english news scenario according to an embodiment of the application. Comprising the following steps:

s4-1, inputting the candidate core words and extracting article titles;

s4-2, generating a candidate core word initial matrix;

s4-3, traversing each candidate core word;

s4-4, calculating the characteristic value of the candidate core word in S4-3;

s4-5, carrying out weighted summation on the characteristic values of the S4-4;

s4-6, taking TopN core words as final core words according to the total sorting result, and outputting a final core word matrix.

Note that n=1 in S4-5, i.e., the first core word of the total score of the ranking result is taken as the last core word.

It should be noted that, the candidate core word initial matrix described in S4-2 is used to store various feature values of the candidate word.

It should be noted that, the calculating of the feature value in S4-3 includes calculating a position offset value, a word frequency of candidate words, a word2vec value, and a Jaccard coefficient of a sentence where an article title and the candidate words are located, and finally determining whether the candidate core word is a named entity, if the candidate word is a named entity word, setting the feature value to 1, otherwise setting the feature value to 0.

It should be noted that the word2vec model is a related model for generating word vectors, and is mainly used for embedding a high-dimensional space with a dimension of all word numbers into a continuous vector space with a much lower dimension, and each word or phrase is mapped into a vector on a real number domain to realize word vectorization, so as to measure the relevance of candidate core words and article titles.

It should be noted that, the sentence similarity calculation by word2vec includes word segmentation of a sentence, converting each word into a one-dimensional vector based on the word2vec model, adding the vectors of all the words, representing the sentence with any length as a vector with a fixed dimension, and calculating cosine similarity values between the word vectors of the two sentences for similarity comparison.

In one specific exemplary embodiment, the position offset value is calculated for the "recalled core word" Yankees "from the english text" The Yankees went to great lengths to incorporate many features of their old stadium (bottom) into their new one (top),

and according to a position offset value calculation formula:

total number of words L of sentence in which core word "Yankees" is located _{token_length} 20, the specific position index i of the sentence in which the core word "Yankees" is located _{token_i} For 2, the weight coefficients a, b are calculated according to 0.5, since in this embodiment there is only one sentence, the total number of sentences L _{s_length} 1, subscript i of sentence in which core word "Yankees" is located _{s_i} For 1, substituting the position offset value formula and performing calculation to obtain x=0.5 (20-2+1)/20+0.5 (1-1+1)/1=0.975.

Likewise, in a specific exemplary embodiment, for the english text "An ice bridge holding a vast Antarctic ice shelf in place has shattered and may herald a wider collapse caused by global warming, a scientific acid sacturday," It's amazing how the ice has ruptured, "said David Vaughan, a glaciologist with the British Antarctic survey," Two days ago It was intact, "he acid, referring to a satellite image of the Wilkins ice shell.the satellite picture, by the European Space Agency, showed that a strip of ice about 25miles long that is believed to pin the ice shelf in place had snapped.The loss of the ice bridge could mean a wider breakup of the ice shelf,which is about the size of Connecticut," the core word recalled according to the method described in the recall step is the second "ice" appearing in the sentence, i.e., "ice" as a noun, for which the position offset value is calculated;

and according to a position offset value calculation formula:

total number of sentences L where core word "ice" is located _{token_length} 27, the specific position index i of the sentence in which the core word "ice" is located _{token_i} Weight coefficients a, b are calculated according to 0.5, total sentence number L is 8 _{s_length} 5, subscript i of sentence in which core word "ice" is located _{s_i} For 1, substituting the position offset value formula and performing calculation to obtain x=0.5 (27-8+1)/27+0.5 (5-1+1)/5=0.870.

Fig. 5 shows a generating element text flowchart of a WHO element extraction method in an english news scene according to an embodiment of the application. Comprising the following steps:

s5-1, traversing each candidate core word in the candidate core word matrix;

s5-2, extracting subscripts of sentences in which the last core word is located;

s5-3, traversing all subtrees of the sentence;

s5-4, finding leaf nodes to which the final core words belong in the subtrees;

s5-5, starting from the leaf node of S5-4, finding out the node to which the Jian Shangou node S belongs;

and S5-6, connecting the simple sentence node to the branch path of the final core word leaf node to serve as a WHO element, and forming WHO element text.

In a specific embodiment, the steps including recall and sorting are executed, the last core word "Yankees" is extracted, the subscript of The last core word "Yankees" is extracted, the leaf node of The label which belongs to "Yankees" is found upwards from The leaf node where The last core word "Yankees" is located, the parent node of The node is a simple sentence node S, and a branch path from The simple sentence node S to The node where The last core word "Yankees" belongs is connected to serve as a WHO element, so that The result "The Yankees" is The generated WHO element text.

Similarly, in a specific embodiment, the subscript of the last core word "Korea" is extracted after the steps of recall and sequencing are performed on english text "NEW satellite images appear to confirm that North Korea is building a tunnel to carry out secret underground nuclear tests", a leaf node with NP is found upward from the leaf node where the last core word "Korea" is located, the parent node of the node is simply a sentence node S, and a branch path from the simple sentence node S to the node where the last core word "Korea" is located is connected as a WHO element, so that the result "North Korea" is the generated WHO element text.

It should be noted that, for the second embodiment illustrated in the present figure, the conventional extraction method cannot extract WHO elements, but the present method can extract "North Korea" and has a high accuracy.

The application has the technical effects that: the recall method aims at extracting candidate words from news texts, and directly determines whether WHO elements can be extracted or not.

The sorting mode aims at selecting the most representative word from the candidate words as a core word, directly determines the accuracy of WHO element extraction, and compared with the prior art, the sorting method is redesigned, and expands the characteristic value variety aiming at the recall extraction method provided by the application.

The two methods are inexhaustible, and compared with the traditional method, the method can realize successful and accurate extraction of the WHO elements on the text of which part cannot be extracted.

Finally, what should be said is: the above embodiments are merely for illustrating the technical aspects of the present application, and it should be understood by those skilled in the art that although the present application has been described in detail with reference to the above embodiments: modifications and equivalents may be made thereto without departing from the spirit and scope of the application, which is intended to be encompassed by the claims.

Claims

1. The method for extracting the English news elements based on the core word diffusion is characterized by comprising the following steps:

2. The method for extracting english news elements based on core word diffusion according to claim 1, wherein the step S1 further comprises the steps of, before recall: and (3) cleaning the network news in the step S1, and combining the article titles, the guide languages and the article topics of the network news into the text data.

3. The method for extracting english news elements based on core word diffusion according to claim 1, wherein the calculating of the feature value in S2 includes calculating a position offset value, a word2vec value, and a Jaccard coefficient of a sentence where an article title and a candidate word are located, respectively, and finally determining whether the candidate core word is a named entity, if the candidate word is a named entity word, setting the feature value to 1, otherwise setting the feature value to 0.

4. The method for extracting english news elements based on core word diffusion according to claim 3, wherein the position offset value is calculated according to the position inside the sentence where the candidate core word is located and the position where the full text is located, and the calculation method is as follows:

5. The method for extracting english news elements based on core word diffusion according to claim 3, wherein the word frequency candidates are obtained by: and counting word frequencies of the candidate core words in the candidate core word set, and carrying out normalization processing on statistics of the word frequencies.

6. The method for extracting english news elements based on core word diffusion according to claim 3, wherein the word2vec value is obtained by: and loading a word2vec model to predict word embedding sequences of the article titles and word embedding sequences of sentences in which candidate core words are located, and then solving cosine similarity of the two word embedding sequences.

7. The english news element extraction method based on core word diffusion according to claim 3, wherein the Jaccard coefficient is obtained by: and comparing the similarity and the difference between the noun set contained in the article title and all candidate core word sets in sentences where the candidate core words are located.

8. A computer readable storage medium, on which computer program instructions are stored, which computer program instructions, when executed by a processor, implement the method of any of claims 1-7.