CN114676346A

CN114676346A - News event processing method and device, computer equipment and storage medium

Info

Publication number: CN114676346A
Application number: CN202210262081.7A
Authority: CN
Inventors: 文彬; 贺德涛; 章林; 孙静远; 冯丽琼
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-06-28

Abstract

The invention relates to a news event processing method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a target news event, and extracting the attribute of the target news event; taking the abstract of the target news event as the input of a pre-trained vector generation model to obtain a first vector representation of the target; searching in a pre-configured news database based on a preset similar screening rule to determine whether a historical news event meeting the similar screening rule exists in the news database; when the historical news events meeting the similar screening rules exist in the news database, determining vector representation of the abstract of the historical news events; and determining the similarity between the target first vector representation and the vector representation of the abstract of the historical news event, and determining the related news event of the target news event according to the comparison result. The method can improve the processing efficiency of similar news determination.

Description

News event processing method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of big data analysis, in particular to a news event processing method, a news event processing device, computer equipment and a storage medium.

Background

With the rapid development and popularization of internet technology, more and more users acquire the latest consulting information through the network, and a news channel is a relatively convenient way for users to browse and subscribe news provided by websites, and the users hope to comprehensively know the events reported by the news and even subscribe the subsequent event reports of the events through the news channel.

News generally reports recently occurred events, and for technical and financial events which last for a period of time, such as events on the market, complaints and the like which last for a long time, only a single report can not better understand the background, development process and the like of the events. In order to better report the current events, the history of event development needs to be collated; the traditional approach is to filter and sort news of related events and track the whole process in time sequence by manual search. The method needs to consume more human resources, and news is not published timely.

Disclosure of Invention

The application provides a news event processing method and device, computer equipment and a storage medium.

A first aspect provides a news event processing method, including:

acquiring a target news event, and extracting attributes of the target news event, wherein the attributes comprise an abstract, a named entity and a type of the target news event;

Taking the abstract of the target news event as the input of a pre-trained vector generation model to obtain a first vector representation of the target news event;

searching in a pre-configured news database based on a preset similar screening rule to determine whether a historical news event meeting the similar screening rule exists in the news database;

if so, determining a second vector representation of the summary of the historical news event;

determining the similarity of the first vector representation and the second vector representation, and determining all related news events of the target news event according to a comparison result;

and arranging the target news event and all the related news events according to the occurrence time of the news events.

In some embodiments, the extracting the attributes of the target news event to extract the summary of the target news event includes:

the target news event is cut into sentences to obtain a sentence list;

inputting the sentence list into a abstract extraction model to obtain an abstract of the target news event; the abstract extraction model is obtained by adding a parity sentence coding layer after a feedforward reverse layer in a Bert model; and extracting a decoder from the transform model, and combining the encoder and the decoder to obtain the abstract extraction model.

In some embodiments, the extracting the named entity of the target news event from the attributes of the target news event includes:

inputting the abstract of the target news event into a pre-configured BERT-BilSTM-CRF model to obtain a named entity in the abstract of the target news event; wherein the BERT-BilSTM-CRF model comprises the following components: the system comprises a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, wherein the BERT pre-training model layer is used for coding each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting the named entity with the maximum probability based on the new feature vector.

In some embodiments, the extracting the type of the target news event from the attributes of the target news event includes:

clustering training news events by adopting an LDA model, and carrying out category labeling on various training news;

marking the training news events and the categories of the news events as training data of a Bert model, and training the Bert model to obtain a type analysis model;

And inputting the abstract of the target news event into the type analysis model to obtain the type of the target news event.

In some embodiments, the searching in the pre-configured news database based on the preset similar filtering rules includes:

searching the news database for historical news events of the same type as the target news event;

determining a key named entity of the historical news event according to the type of the target news event;

and screening out historical news events with the same key named entities as the target news event from the historical news events with the same type as the target news event.

In some embodiments, said determining a similarity of said first vector representation to a vector representation of a summary of said historical news event comprises:

according to the second vector representation, carrying out similar search in a pre-configured vector database, and determining whether similar vector representation information represented by the second vector exists in the vector database; the vector database stores vector representations obtained by processing the historical news events when the historical news events are processed;

Determining whether a similar vector representation of the first vector representation exists in the vector database based on cosine similarities between the first vector representation and vector representations in the vector database.

In some embodiments, the method for training the vector generation model includes, in the step of taking the summary of the target news event as an input of a pre-trained vector generation model:

acquiring a plurality of same training news and a plurality of similar training news;

analyzing basic information of the target news event, and extracting an abstract of the training news;

taking the abstracts of the same training news as a positive sample input vector generation model, taking the abstracts of the similar training news as a negative sample input vector generation model, converting the abstracts into a vector by using a Bert model through the vector generation model, and performing similarity calculation on 2 output vectors after passing through an average pooling layer to obtain the similarity of the two training news;

and training the vector generation model according to the similarity of the two training news.

A second aspect provides a news event processing apparatus, including:

the attribute extraction unit is used for acquiring a target news event and extracting the attribute of the target news event, wherein the attribute comprises the abstract, the named entity and the type of the target news event;

The vector representation unit is used for taking the abstract of the target news event as the input of a pre-trained vector generation model to obtain a first vector representation of the target news event;

the screening unit is used for searching in a pre-configured news database based on a preset similar screening rule and determining whether a historical news event meeting the similar screening rule exists in the news database; if so, determining a second vector representation of the summary of the historical news event;

the similarity judging unit is used for determining the similarity between the first vector representation and the vector representation of the abstract of the historical news event and determining the related news event of the target news event according to the comparison result;

and the sequencing unit is used for arranging the target news event and the related news event according to the occurrence time of the news event.

A third aspect provides a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the target news event processing method described above.

A fourth aspect provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the target news event processing method described above.

According to the news event processing method, the news event processing device, the computer equipment and the storage medium, firstly, a target news event is obtained, and the attribute of the target news event is extracted, wherein the attribute comprises the abstract, the named entity and the type of the target news event; secondly, taking the abstract of the target news event as the input of a pre-trained vector generation model to obtain the vector representation of the latest news event; searching in a pre-configured news database based on a preset similar screening rule again to determine whether the historical news events meeting the similar screening rule exist in the news database; if yes, determining vector representation of the abstract of the historical news event; finally, determining the similarity between the first vector representation of the target news event and the vector representation of the abstract of the historical news event, and determining the related news event of the target news event according to the comparison result; and arranging the target news event and the related news event according to the occurrence time of the news event. Therefore, according to the method, through the sBert model, text comparison of the same news is converted into a vector similarity comparison problem, and meanwhile, vector representation information of all historical news is obtained in advance by relying on a pre-configured vector database, so that judgment of the news with the same semantic level can be in the order of hundred million (hundred million) level news, and the result can still be output within 100 milliseconds (typical value), namely compared with the prior art, the matching result precision and the matching efficiency can be well improved.

Drawings

FIG. 1 is a diagram of an environment in which a news event processing method provided in one embodiment may be implemented;

FIG. 2 is a flow diagram of a news event processing method in one embodiment;

FIG. 3 is a schematic diagram of a twin network model of a news event processing method in one embodiment;

fig. 4 is a block diagram showing a structure of a news event processing apparatus according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, technical terms related to the embodiments of the present invention are first explained:

twin neural networks (also known as twinborn neural networks) are coupled frameworks built on the basis of two artificial neural networks. The twin neural network takes two samples as input and outputs the characterization of embedding high-dimensional space of the two samples so as to compare the similarity degree of the two samples. The narrowly defined twin neural network is formed by splicing two neural networks which have the same structure and share the weight. A generalized twin neural network, or a pseudo-twin neural network, may be formed by splicing any two neural networks. Twin neural networks typically have a deep structure and may consist of convolutional neural networks, cyclic neural networks, and the like. In the supervised learning paradigm, a twin neural network maximizes the characterization of different tags and minimizes the characterization of the same tag. In an unsupervised or unsupervised learning paradigm, a twin neural network can minimize the characterization between the original input and the interfering input (e.g., the original image and the clipping of the image). The twin neural network can perform small-sample/single-sample learning (one-shot learning), and is not easily interfered by an error sample, so that the twin neural network can be used for pattern recognition problems with strict requirements on fault tolerance, such as portrait recognition, fingerprint recognition, target tracking, and the like.

Bert (bidirectional Encoder expressions from transforms) is a pre-trained model, the new language representation model of Bert, which represents the bidirectional Encoder representation of the Transformer. Unlike other language representation models in the near future, Bert aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained Bert representation can be finely adjusted through an additional output layer, and the method is suitable for constructing the most advanced model of a wide range of tasks, such as a question-answering task and language reasoning, and does not need to make great architectural modification for specific tasks.

The simhash is one of the commonly used text deduplication hash algorithms, similar to md5, crc32, etc. The principle is that a large section of text is mapped into a hash value of only 8 bytes by performing weight calculation on keywords extracted from text data. The method does not support direct similarity analysis and calculation of texts, but the generated hash result values can be compared through a Hamming distance algorithm, so that the similarity between texts is calculated. Since the hamming distance is calculated for the simhash result, the hamming distance is not original text data, the calculation amount is very small, and the simhash result can be calculated in advance after the text data is obtained.

As shown in fig. 2, in an embodiment, a method for processing a news event is provided, which specifically includes the following steps:

step 201, obtaining a target news event, and extracting attributes of the target news event, wherein the attributes include an abstract, a named entity and a type of the target news event. The target news event may be understood as any news event occurring within a short time period before the current time, and the short time period is not specifically limited in this embodiment, and may be set by a person skilled in the art according to the time when the target news event to be searched occurs.

In some embodiments, extracting the headlines and summaries of historical news events includes:

step 2011, the historical news events are cut into sentences to obtain a sentence list;

wherein the historical news events are segmented using [ CLS ].

Step 2012, inputting the sentence list into an abstract extraction model to obtain the title and abstract of the historical news event; the abstract extraction model is obtained by adding an odd-even sentence coding layer after a feedforward reverse layer in a Bert model; and extracting a decoder from the transform model, and combining the encoder and the decoder to obtain a digest extraction model.

Wherein the Bert model may be received directly from the program management library in which the Bert model resides. The Bert model (Bidirectional Encoder responses from transforms) is a currently disclosed general natural language processing framework, and an internal structure of the Bert model comprises an embedding layer, a multi-head attention machine layer and a feedforward reversal layer, wherein the embedding layer is used for representing a text by a matrix, the multi-head attention machine layer is used for extracting text features from the matrix text, and the feedforward reversal layer is used for mediating internal parameters of the Bert model according to the text features to achieve the purpose of optimizing the Bert model.

The main purpose of the odd-even sentence coding layer is to identify whether the number of words in a sentence is odd or even, thereby performing separate coding on the odd and even sentences. The odd-even sentence coding layer comprises a Chinese character encoding layer, wherein the Chinese character encoding layer comprises a Chinese character encoding layer, a Chinese character encoding layer and a Chinese character encoding layer, the Chinese character encoding layer comprises a Chinese character encoding layer, the Chinese character encoding layer divides a sentence by using the Chinese character encoding layer to obtain a plurality of groups of words, and traverses the number of the words, so as to complete the identification of the number of the words in the sentence.

the transform model is an open-source natural language processing model, and includes a decoder.

In some embodiments, extracting the named entity of the target news event from the attributes of the target news event comprises:

The named entity recognition model constructed based on the Bert model well solves the problems of difficult and low-precision entity recognition when the labeling data is insufficient and the entity boundary is fuzzy, and improves the performance and recognition accuracy of the entity recognition model.

In some embodiments, extracting the type of the target news event from the attributes of the target news event comprises:

clustering the training news events by adopting an LDA (latent Dirichlet Allocation) model, and labeling the categories of various training news;

and inputting the abstract of the target news event into a type analysis model to obtain the type of the target news event.

It is understood that the Bert model training is performed. And respectively screening out news texts with the same quantity as a preset threshold value from all the news texts in each news category, and inputting the screened news texts into a Bert model as training data for training. The Bert model is selected for training because the Bert model introduces a self-attention mechanism, the text is represented as a feature vector, the feature vector is a feature extraction based on words and is a universal feature, and therefore the sequence of the words in the sentence is disturbed, the feature vector of the Bert model is not influenced, and the method can be used for the method to face the problem of unbalanced training data. After the screening in the step, the number of the news texts serving as training data under various news categories is balanced, and the precision of the trained Bert model is high.

Specifically, the Bert model in the embodiment is implemented based on a transform architecture (a brand-new neural network architecture based on a self-attention mechanism), and training data is input into the Bert model for training; pre-training a Bert model based on a Masked LM task and a next sentence prediction task; training data is input into the Bert model, and the Bert model is fine-tuned so that the Bert model matches the training data. And inputting the target news event to be classified into the trained Bert model for classification. Since the precision of the Bert model of the embodiment is high, the classification effect is very good, and the accuracy is very high.

Step 202, taking the abstract of the target news event as the input of a pre-trained vector generation model to obtain a first vector representation of the target;

it can be understood that the vector generation model is a twin network, and since the twin network is composed of two parallel Bert models, each time the input data is a set of sentence pairs, some processing is required to complete the training of the data. Similar sentences and identical sentences need to be constructed, identical in this embodiment meaning identical at the semantic level. The training process is to input the two problems into a Bert model of the twin network, the two Bert models share parameters, the output of the last layer is respectively taken, an average pooling strategy is adopted, and the average value of each dimension output of all tokens is taken as an Embedding vector. And assuming that the output vector of the first sentence is u and the output vector of the second sentence is v, and adopting cos similarity as an optimized objective function. Training is carried out by using the new network, and Fine Turning is carried out on the BERT network.

As shown in fig. 3, the twin network uses a Bert pre-training model to obtain a vector of a sentence (sense) from a text, and obtains 2 outputs (u, v) through pooling and full connection layer (dense), and calculates cosine similarity of the output values to obtain final similar probability values.

Step 203, searching in a pre-configured news database based on a preset similar screening rule, and determining whether a historical news event meeting the similar screening rule exists in the news database;

it can be understood that, when processing the historical news event, the results of segmenting the headline and abstract of the historical news event and classifying the event are written into the elastic search. The ElasticSearch is a search server based on Lucene, and the step is to write historical news into the search server and establish a database capable of searching news according to event classification. Specifically, the method comprises the following steps:

(1) abstracting abstract of historical news event

(2) Extracting named entities in the abstract;

(3) determining the classification of the historical news events according to the abstract;

(4) marking a unique ID for the historical news event;

(5) establishing a link relation between the historical news events and the abstract, the named entity, the keyword historical news event classification and the unique ID;

(6) And storing the historical news events and the link relations into a historical news database.

In some embodiments, the searching is performed in a pre-configured news database based on preset similar filtering rules, including:

step 2031, searching a news database for historical news events with the same type as the target news event;

step 2032, determining key named entities of the historical news events according to the types of the target news events;

determining a key named entity according to the type of a target news event, for example, classifying the event into a listed state, extracting a company as a secondarily determined named entity, and forming an event + named entity; events are classified as litigation, and company + company is extracted as a key entity to form company A + litigation + company B. That is, the recalled historical news is filtered by using a keyword (named entity), and the historical news of the same entity and the same event is retained.

Step 2033, selecting the historical news events with the same key named entities as the target news event from the historical news events with the same type as the target news event.

It can be understood that the abstract, the named entity and the type of the historical news event are all stored in the news database in advance, and similar searching is carried out, so that the matching efficiency can be well improved.

Step 204, if the historical news events meeting the similar screening rules exist in the news database, determining vector representation of the abstract of the historical news events;

in some embodiments, determining a similarity of the first vector representation to a second vector representation of a summary of historical news events comprises:

step 2041, according to the second vector representation, performing similarity search in a preconfigured vector database, and determining whether similar vector representation information represented by the second vector exists in the vector database; the vector database stores vector representation obtained by processing the historical news events when the historical news events are processed;

it can be understood that the vector obtained by averaging and pooling the abstract of the historical news event by the sBert model is inserted into the milvus vector database (milvus supports near real-time search, and can be retrieved by inserting a drop). The historical news event vector is searched in a vector database such as millius or faiss.

Step 2042, determining whether a similar vector representation of the first vector representation exists in the vector database according to the cosine similarity between the first vector representation and the vector representations in the vector database.

Step 205, determining similarity of the first vector representation and the vector representation of the abstract of the historical news event, and determining a related news event of the target news event according to a comparison result; the target news event and the related news events are arranged according to the occurrence time of the news events.

As shown in fig. 4, in an embodiment, a news event processing apparatus is provided, and the news event processing apparatus may be integrated in the computer device 110, and specifically may include:

the attribute extracting unit 411 is configured to acquire a target news event and extract an attribute of the target news event, where the attribute includes an abstract, a named entity, and a type of the target news event;

a first vector representation unit 412, configured to use the abstract of the target news event as an input of a pre-trained vector generation model, to obtain a first vector representation of the target news event;

the screening unit 413 is configured to search a preconfigured news database based on a preset similar screening rule, and determine whether a historical news event satisfying the similar screening rule exists in the news database;

a second vector representing unit 414, configured to determine a second vector representation of the summary of the historical news event if there is a historical news event satisfying the similar filtering rule in the news database;

a similarity judging unit 415, configured to determine a similarity between the first vector representation and the second vector representation, and determine a related news event of the target news event according to the comparison result; the target news event and the related news events are arranged according to the occurrence time of the news events.

In one embodiment, a computer device is provided, which may include a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring a target news event, and extracting attributes of the target news event, wherein the attributes comprise an abstract, a named entity and a type of the target news event; taking the abstract of the target news event as the input of a pre-trained vector generation model to obtain a first vector representation of the target news event; searching in a pre-configured news database based on a preset similar screening rule to determine whether a historical news event meeting the similar screening rule exists in the news database; if so, determining a second vector representation of the summary of the historical news event; determining the similarity between the first vector representation and the second vector representation, and determining related news events of the target news events according to the comparison result; the target news event and the related news events are arranged according to the occurrence time of the news events.

In one embodiment, a storage medium is provided that stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: receiving a claim settlement request, and determining claim settlement data according to the claim settlement request, wherein the claim settlement data may include: acquiring a target news event, and extracting the attribute of the target news event, wherein the attribute comprises an abstract, a named entity and a type of the target news event; taking the abstract of the target news event as the input of a pre-trained vector generation model to obtain a first vector representation of the target news event; searching in a pre-configured news database based on a preset similar screening rule to determine whether a historical news event meeting the similar screening rule exists in the news database; if so, determining a second vector representation of the summary of the historical news event; determining the similarity between the target first vector representation and the second vector representation, and determining related news events of the target news events according to the comparison result; the target news event and the related news events are arranged according to the occurrence time of the news events.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above examples only show several embodiments of the present invention, and the description thereof is specific and detailed, but not to be construed as limiting the scope of the present invention. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. A news event processing method, the method comprising:

determining the similarity of the first vector representation and the second vector representation, and determining all related news events of the target news event according to a comparison result; and arranging the target news event and all the related news events according to the occurrence time of the news events.

2. The news event processing method of claim 1, wherein the extracting of the attribute of the target news event to extract the abstract of the target news event comprises:

Cutting the target news event to obtain a sentence list;

inputting the sentence list into an abstract extraction model to obtain an abstract of the target news event; the abstract extraction model is obtained by adding a parity sentence coding layer after a feedforward reverse layer in a Bert model; and extracting a decoder from the transform model, and combining the encoder and the decoder to obtain the abstract extraction model.

3. The news event processing method of claim 1, wherein the extracting the named entity of the target news event from the attributes of the target news event comprises:

4. The news event processing method of claim 2, wherein the extracting of the type of the target news event from the attributes of the target news event comprises:

marking the training news event and the category of the news event as training data of the Bert model, and training the Bert model to obtain a type analysis model;

5. The news event processing method of claim 1, wherein the searching in the preconfigured news database based on the preset similar filtering rule comprises:

6. A news event processing method as claimed in claim 1, wherein the determining of the similarity of the target first vector representation to the vector representation of the summary of the historical news event comprises:

determining whether a similar vector representation of the first vector representation exists in the vector database according to cosine similarity between the first vector representation and vector representations in the vector database.

7. The news event processing method of claim 1, wherein the digest of the target news event is used as an input of a pre-trained vector generation model, and the training method of the vector generation model comprises:

Taking the abstracts of the same training news as a positive sample input vector generation model, taking the abstracts of the similar training news as a negative sample input vector generation model, converting the abstracts into a vector by using a Bert model through the vector generation model, and performing similarity calculation on 2 output vectors through an average pooling layer to obtain the similarity of the two training news;

8. A news event processing apparatus, comprising:

the first vector representation unit is used for taking the abstract of the target news event as the input of a pre-trained vector generation model to obtain the first vector representation of the target news event;

the screening unit is used for searching in a pre-configured news database based on a preset similar screening rule and determining whether a historical news event meeting the similar screening rule exists in the news database;

A second vector representation unit, configured to determine a second vector representation of a summary of historical news events when there are the historical news events that satisfy a filtering rule in the news database;

the similarity judging unit is used for determining the similarity between the first vector representation and the vector representation of the abstract of the historical news event and determining the related news event of the target news event according to the comparison result; and arranging the target news event and the related news event according to the occurrence time of the news event.

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to carry out the steps of the target news event processing method as claimed in any one of claims 1 to 7.

10. A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the target news event processing method as claimed in any one of claims 1 to 7.