CN117910479A

CN117910479A - Method, device, equipment and medium for judging aggregated news

Info

Publication number: CN117910479A
Application number: CN202410308816.4A
Authority: CN
Inventors: 罗佳
Original assignee: Hunan Eefung Software Co ltd
Current assignee: Hunan Eefung Software Co ltd
Priority date: 2024-03-19
Filing date: 2024-03-19
Publication date: 2024-04-19
Anticipated expiration: 2044-03-19
Also published as: CN117910479B

Abstract

The invention belongs to the technical field of computer data processing, and relates to an aggregated news judging method, an aggregated news judging device, computer equipment and a medium, wherein the method comprises the following steps: keyword extraction step S1: screening important keywords of the articles; and a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space; an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index. The method, the device, the computer equipment and the medium can quickly identify whether the target article is the aggregated news, and have the advantages of high reliability, high calculation speed and the like.

Description

Method, device, equipment and medium for judging aggregated news

Technical Field

The invention relates to the technical field of computer data processing, in particular to an aggregated news judging method and device based on a semantic correlation matrix space, computer equipment and a computer readable storage medium.

Background

Syndication news refers to the integration of news content, stories, articles, or information from multiple different sources together to form a unified article or page that enables users to browse news stories from multiple sources at a time. These news may come from different news websites, media institutions, blogs, social media, or other sources of information, and the subject matter of the stories is complex and diverse, possibly across industries, fields, and without fixed rules. Aggregated news may have some negative impact on natural language processing analysis, mainly including:

The diversity of information leads to confusion: for analyzing content of a particular domain or single topic, content diversity of aggregated news results in over-fragmentation of information and irrelevance of the analyzed topic or event resulting in errors in the results.

Information repetition and redundancy: aggregated news may contain a large amount of duplicate or redundant information, especially when multiple sources are involved in the same topic or event.

The information quality is different: syndicated news encompasses multiple sources and may lead to uneven information quality. Some sources may lack reliability or convey inaccurate information, which may mislead the natural language processing system.

In general natural language processing and topic extraction or event analysis based on document content, serious interference is generated to the analysis content due to the presence of aggregated news. To improve the quality of the analyzed data, the aggregated news needs to be identified and filtered. Therefore, it is necessary to develop a method for determining the syndicated news, so as to determine whether the news is syndicated news.

Disclosure of Invention

In view of the above, the invention provides an aggregated news judging method, an aggregated news judging device, computer equipment and a computer readable storage medium based on a semantic correlation matrix space, which can rapidly identify whether a target article is an aggregated news and have the advantages of high reliability, high calculation speed and the like.

The technical scheme of the invention is as follows:

In a first aspect, the present invention provides a method for determining aggregated news, including the steps of:

Keyword extraction step S1: screening important keywords of the articles;

And a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space;

an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.

In a second aspect, the present invention further provides a device for determining aggregated news, including:

Keyword extraction module: the method is used for screening the important keywords of the articles;

The volume calculation module: the text keyword vector calculation method comprises the steps of using a semantic model to carry out vectorization on text keywords, combining a plurality of word vectors by taking articles as units as matrixes, carrying out orthogonal projection operation on the matrixes to reduce dimensions, and calculating the volume of the matrixes in space;

the aggregate news judging module: the method is used for classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.

In a third aspect, the present invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when the processor executes the computer program.

In a fourth aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the aggregated news judgment method described above.

Compared with the prior art, the method, the device, the computer equipment and the computer readable storage medium for judging the aggregated news have the following beneficial effects:

1. After model training is completed, the subsequent judging process is fully automatic, and whether the file is the news aggregation can be automatically judged by inputting article data.

2. The algorithm used by the invention is all operated based on vectors and matrixes, semantic judgment is carried out by utilizing spatial thinking, the calculation speed is high, the judgment efficiency is high, and the accuracy is high.

3. In the natural language processing analysis process in mass data, the invention can rapidly finish the judgment and the filtration of the aggregated news, and the calculation process is independent of external data, running environment and infrastructure, and can filter invalid data in the real-time processing analysis process of the text, thereby remarkably improving the processing speed of the text analysis and the accuracy of the result.

The preferred embodiments of the present invention and their advantageous effects will be described in further detail with reference to specific embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain the invention. In the drawings of which there are shown,

FIG. 1 is a schematic diagram of the content of a model to be trained in accordance with the present invention;

FIG. 2 is a schematic diagram of an overall flow for aggregated news judgment according to the present invention;

fig. 3 is a schematic diagram of the shape of the three-dimensional vector enclosed in space.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

The aggregated news judging method based on the semantic correlation matrix space provided by the embodiment of the application can be applied to computer equipment such as terminals, servers and the like. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, which may be head-mounted devices, etc.; the server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

Referring to fig. 1 and 2, the invention provides a method for determining aggregated news based on a semantic correlation matrix space, comprising the following steps:

Keyword extraction step S1: screening important keywords of the articles;

The keyword extraction step S1 includes:

substep S11: an IDF model is trained based on a set of documents, such as an existing news article set, web text data, academic discourse set, and the like.

TF-IDF (term frequency-Inverse Document Frequency) is a statistical method for measuring the importance of words in documents.

Let T denote the total number of all documents in the document collection,Representing the number of documents containing the term t, the inverse document frequency IDF of the term t can be expressed as: /(I)。

Substep S12: dividing the news article to be judged into words, calculating TF-IDF value, and selecting TF-IDF value topN word as the key word of the news.

Let t denote a word, d denote a document,Representing the number of occurrences of word t in document d,/>Representing the total number of words in document d, the word frequency TF of word t in document d is expressed as:

；

Combining the TF of the word t in the document d with the IDF in the whole document set, the TF-IDF value of the word t in the document d can be obtained:

TF-IDF(t,d,T)=TF(t,d)×IDF(t,T)；

and obtaining a keyword list corresponding to the news articles through the processing. These keywords have a high TF-IDF value reflecting their importance and uniqueness in the current news text.

The volume calculation step S2 includes:

Substep S21: based on the existing news article sets, the Word2Vec model is trained by the document sets such as the web text data, the academic discourse sets and the like, so that semantic relations among words are learned.

The above set of documents is D, the vocabulary is V, each documentContains keyword set/>; The training goal of Word2Vec model is to learn a mapping function/>Words in the vocabulary are mapped into a vector representation in d-dimension.

Substep S22: and mapping Cheng Gaowei vectors of each keyword corresponding to the news in the keyword extraction step S1 through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article.

For documentsKeywords/>Vector representation/>, is obtained through Word2Vec modelThe vectors of all keywords are then combined into a matrix a by column, i.e.:；

Where p is the total number of keywords and q is the word vector dimension.

Substep S23: transpose matrices A and AMultiplying and orthographic projecting to reduce the dimension of the matrix A and convert the matrix A into a square matrix B.

Performing orthogonal projection operation on the matrix A to obtain a dimension-reduced square matrix B, namely:。

substep S24: calculating the determinant of matrix B results in a volume V of the matrix.

Calculating the determinant of matrix B, i.e. The value of this determinant may represent the volume of matrix B. See fig. 3, where a simplified graph is used to represent the volume enclosed in space by the 3-dimensional vectors.

The keyword in the news article is vectorized by using a Word2Vec model, then dimension conversion and dimension reduction are carried out through matrix operation, and finally the volume V of the matrix after dimension reduction is calculated.

The aggregate news judging step S3 includes:

Substep S31: randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as a training set of the algorithm;

Substep S32: selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of the matrix volume dimension as an alternative threshold on the basis of the training set generated in the step S31, judging that the articles with the matrix volume dimension larger than M in the training set are aggregated news, otherwise judging that the articles are not aggregated news, comparing the results with the manually judged results, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting the threshold M corresponding to the maximum F1 value as a unique threshold M;

Substep S33: and (3) carrying out keyword extraction step S1 and volume calculation step S2 on other articles for inference, judging by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is larger than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.

Description of principle:

TF-IDF can calculate the general importance of words in a document, highlighting the subject matter and content of the document. While Word2Vec is based on a distribution assumption that words that occur in similar contexts are considered to have similar semantics, and thus can generate a dense vector representation of each Word. Thus, based on TF-IDF and Word2Vec, we can transform a document into a Word matrix that expresses its semantics.

Since the aggregated news content generally contains descriptions of emergency events in various industries and fields, the aggregated news content is generally scattered and fragmented semantically, and word vectors of the aggregated news content are also scattered in the direction and length of a geometric space. We use the volume of the polyhedron that the word vector tenses in space to evaluate the degree of dispersion of the word vector's semantics in the document. The more discrete the multidimensional vector, the greater its polyhedral volume. The matrix-tensed polyhedral volume can be quickly calculated by using matrix determinant, so that the volume index can be used as an important feature for evaluating whether the document is the aggregated news.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an aggregate news judging device based on the semantic correlation matrix space, which comprises the steps in the embodiment corresponding to the aggregate news judging method based on the semantic correlation matrix space, wherein the steps are used for realizing the aggregate news judging method based on the semantic correlation matrix space.

The aggregated news judging device based on the semantic correlation matrix space comprises:

The keyword extraction module comprises:

training an IDF unit: for training an IDF model based on a set of documents, such as an existing news article set, web text data, academic discourse set, etc.

Word segmentation calculation unit: and the method is used for segmenting the news articles to be judged, calculating TF-IDF values and selecting TF-IDF value topN words as keywords of the news.

Let t denote a word, d denote a document,Representing the number of occurrences of word t in document d,/>Representing the total word number in the document d, the word frequency TF of the word t in the document d is expressed as:/>；

Combining the TF of the word t in the document d with the IDF in the whole document set, the TF-IDF value of the word t in the document d can be obtained: TF-IDF (T, d, T) =tf (T, d) ×idf (T, T);

The volume calculation module includes:

Training Word2Vec unit: the method is used for training a Word2Vec model based on the existing news article sets, the web text data, the academic discourse sets and other document sets, so as to learn the semantic relation among words.

Mapping unit: and mapping Cheng Gaowei vectors of each keyword corresponding to the news in the keyword extraction module through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article.

For documentsKeywords/>Vector representation/>, is obtained through Word2Vec modelThe vectors of all keywords are then combined into a matrix a by column, i.e.:

；

Where p is the total number of keywords and q is the word vector dimension.

Converting into a square array unit: transposed matrix for matrices A and AMultiplying and orthographic projecting to reduce the dimension of the matrix A and convert the matrix A into a square matrix B.

calculating a volume unit: the determinant for computing matrix B results in a volume V of the matrix.

And vectorizing keywords in the news article by using a Word2Vec model, performing dimension conversion and dimension reduction by matrix operation, and finally calculating to obtain the volume V of the dimension-reduced matrix.

The aggregate news judging module comprises:

marking unit: the method comprises the steps of randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as an algorithm training set;

A boundary dividing unit: the method comprises the steps of selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of a matrix volume dimension as an alternative threshold on the basis of a training set generated by a labeling unit, judging that an article with the matrix volume dimension larger than M in the training set is aggregated news, otherwise judging that the article is not aggregated news, comparing the result with a manually judged result, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting a threshold M corresponding to the largest F1 value as a unique threshold M;

a judging unit: and the keyword extraction module and the volume calculation module are used for calculating other articles for inference, judging the articles by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is higher than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.

It should be understood that each module of the aggregated news determining device based on the semantic correlation matrix space is configured to execute each step in the embodiment of the corresponding method, and each step in the embodiment of the corresponding method has been explained in detail in the foregoing embodiment, and specific reference is made to the related description in the embodiment of the corresponding method, which is not repeated herein.

Based on the same inventive concept, the embodiment of the application also provides a computer device for realizing the above-mentioned aggregated news judging method based on the semantic correlation matrix space. The implementation scheme of the solution to the problem provided by the computer device is similar to the implementation scheme described in the above method, so the specific limitation in the embodiments of the computer device provided below may be referred to the limitation of the aggregated news judging method based on the semantic correlation matrix space hereinabove, and will not be described herein.

In one embodiment, a computer device, which may be a terminal, is provided that includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a method for aggregated news judgment based on a semantic correlation matrix space. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the aggregated news judging method based on the semantic correlation matrix space according to the above embodiment when executing the computer program.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the aggregated news judgment method based on semantic correlation matrix space as described in the above embodiments.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the aggregated news judgment method based on semantic correlation matrix space as described in the above embodiments.

The aggregated news judging method, the aggregated news judging device, the aggregated news judging computer equipment, the aggregated news judging computer readable storage medium and the aggregated news judging computer program product based on the semantic correlation matrix space have the following beneficial effects:

4. According to the text processing method, the text is converted into the matrix based on the semantic model to be expressed, so that the text can be thrown away, the text is calculated in a matrix mode, the text processing mode is widened, and the text processing speed is increased.

5. The invention expresses the dispersion degree of text content based on the volume of word vector in the matrix space, and takes the dispersion degree as the judgment standard of the aggregated news, thereby greatly improving the accuracy of data processing.

6. The index adopted by the invention is obtained by analyzing and summarizing a large amount of text contents according to the early manual work, and has the experience guiding function on the whole process.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high density embedded nonvolatile memory, resistive random access memory (ReRAM), magneto-resistive random access memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric memory (Ferroelectric Random Access Memory, FRAM), phase change memory (PHASE CHANGE memory, PCM), graphene memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. The method for judging the aggregated news is characterized by comprising the following steps of:

Keyword extraction step S1: screening important keywords of the articles;

2. The method according to claim 1, wherein the keyword extraction step S1 includes:

Substep S11: training an IDF model based on the existing document set;

3. The aggregate news judging method of claim 1, wherein the volume calculating step S2 includes:

Substep S21: training a Word2Vec model based on the existing document set so as to learn the semantic relation among words;

Substep S22: mapping Cheng Gaowei vectors of each keyword corresponding to news in the keyword extraction step S1 through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article;

Substep S23: transpose matrices A and A Multiplying, performing orthogonal projection to reduce the dimension of the matrix A and convert the dimension into a square matrix B;

4. The syndicated news judging method according to claim 1, wherein the syndicated news judging step S3 includes:

5. A device for determining aggregated news, comprising:

6. The apparatus for determining news of claim 5, wherein the keyword extraction module comprises:

Training an IDF unit: for training an IDF model based on the set of existing documents;

7. The apparatus for determining syndicated news according to claim 5, wherein the volume calculating module comprises:

Training Word2Vec unit: the method is used for training a Word2Vec model based on the existing document set so as to learn the semantic relation among words;

Mapping unit: mapping Cheng Gaowei vectors of each keyword corresponding to news in the keyword extraction module through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article;

converting into a square array unit: transposed matrix for matrices A and A Multiplying, performing orthogonal projection to reduce the dimension of the matrix A and convert the dimension into a square matrix B;

8. The apparatus for determining syndicated news according to claim 5, wherein the syndicated news determining module includes:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the aggregated news judgment method according to any one of claims 1 to 4 when the computer program is executed.

10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the aggregated news judgment method according to any one of claims 1 to 4.