CN117910479A - Method, device, equipment and medium for judging aggregated news - Google Patents

Method, device, equipment and medium for judging aggregated news Download PDF

Info

Publication number
CN117910479A
CN117910479A CN202410308816.4A CN202410308816A CN117910479A CN 117910479 A CN117910479 A CN 117910479A CN 202410308816 A CN202410308816 A CN 202410308816A CN 117910479 A CN117910479 A CN 117910479A
Authority
CN
China
Prior art keywords
news
articles
volume
judging
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410308816.4A
Other languages
Chinese (zh)
Other versions
CN117910479B (en
Inventor
罗佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Eefung Software Co ltd
Original Assignee
Hunan Eefung Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Eefung Software Co ltd filed Critical Hunan Eefung Software Co ltd
Priority to CN202410308816.4A priority Critical patent/CN117910479B/en
Publication of CN117910479A publication Critical patent/CN117910479A/en
Application granted granted Critical
Publication of CN117910479B publication Critical patent/CN117910479B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Computational Linguistics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of computer data processing, and relates to an aggregated news judging method, an aggregated news judging device, computer equipment and a medium, wherein the method comprises the following steps: keyword extraction step S1: screening important keywords of the articles; and a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space; an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index. The method, the device, the computer equipment and the medium can quickly identify whether the target article is the aggregated news, and have the advantages of high reliability, high calculation speed and the like.

Description

Method, device, equipment and medium for judging aggregated news
Technical Field
The invention relates to the technical field of computer data processing, in particular to an aggregated news judging method and device based on a semantic correlation matrix space, computer equipment and a computer readable storage medium.
Background
Syndication news refers to the integration of news content, stories, articles, or information from multiple different sources together to form a unified article or page that enables users to browse news stories from multiple sources at a time. These news may come from different news websites, media institutions, blogs, social media, or other sources of information, and the subject matter of the stories is complex and diverse, possibly across industries, fields, and without fixed rules. Aggregated news may have some negative impact on natural language processing analysis, mainly including:
The diversity of information leads to confusion: for analyzing content of a particular domain or single topic, content diversity of aggregated news results in over-fragmentation of information and irrelevance of the analyzed topic or event resulting in errors in the results.
Information repetition and redundancy: aggregated news may contain a large amount of duplicate or redundant information, especially when multiple sources are involved in the same topic or event.
The information quality is different: syndicated news encompasses multiple sources and may lead to uneven information quality. Some sources may lack reliability or convey inaccurate information, which may mislead the natural language processing system.
In general natural language processing and topic extraction or event analysis based on document content, serious interference is generated to the analysis content due to the presence of aggregated news. To improve the quality of the analyzed data, the aggregated news needs to be identified and filtered. Therefore, it is necessary to develop a method for determining the syndicated news, so as to determine whether the news is syndicated news.
Disclosure of Invention
In view of the above, the invention provides an aggregated news judging method, an aggregated news judging device, computer equipment and a computer readable storage medium based on a semantic correlation matrix space, which can rapidly identify whether a target article is an aggregated news and have the advantages of high reliability, high calculation speed and the like.
The technical scheme of the invention is as follows:
In a first aspect, the present invention provides a method for determining aggregated news, including the steps of:
Keyword extraction step S1: screening important keywords of the articles;
And a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space;
an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
In a second aspect, the present invention further provides a device for determining aggregated news, including:
Keyword extraction module: the method is used for screening the important keywords of the articles;
The volume calculation module: the text keyword vector calculation method comprises the steps of using a semantic model to carry out vectorization on text keywords, combining a plurality of word vectors by taking articles as units as matrixes, carrying out orthogonal projection operation on the matrixes to reduce dimensions, and calculating the volume of the matrixes in space;
the aggregate news judging module: the method is used for classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
In a third aspect, the present invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when the processor executes the computer program.
In a fourth aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the aggregated news judgment method described above.
Compared with the prior art, the method, the device, the computer equipment and the computer readable storage medium for judging the aggregated news have the following beneficial effects:
1. After model training is completed, the subsequent judging process is fully automatic, and whether the file is the news aggregation can be automatically judged by inputting article data.
2. The algorithm used by the invention is all operated based on vectors and matrixes, semantic judgment is carried out by utilizing spatial thinking, the calculation speed is high, the judgment efficiency is high, and the accuracy is high.
3. In the natural language processing analysis process in mass data, the invention can rapidly finish the judgment and the filtration of the aggregated news, and the calculation process is independent of external data, running environment and infrastructure, and can filter invalid data in the real-time processing analysis process of the text, thereby remarkably improving the processing speed of the text analysis and the accuracy of the result.
The preferred embodiments of the present invention and their advantageous effects will be described in further detail with reference to specific embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain the invention. In the drawings of which there are shown,
FIG. 1 is a schematic diagram of the content of a model to be trained in accordance with the present invention;
FIG. 2 is a schematic diagram of an overall flow for aggregated news judgment according to the present invention;
fig. 3 is a schematic diagram of the shape of the three-dimensional vector enclosed in space.
Detailed Description
The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
The aggregated news judging method based on the semantic correlation matrix space provided by the embodiment of the application can be applied to computer equipment such as terminals, servers and the like. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, which may be head-mounted devices, etc.; the server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
Referring to fig. 1 and 2, the invention provides a method for determining aggregated news based on a semantic correlation matrix space, comprising the following steps:
Keyword extraction step S1: screening important keywords of the articles;
And a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space;
an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
The keyword extraction step S1 includes:
substep S11: an IDF model is trained based on a set of documents, such as an existing news article set, web text data, academic discourse set, and the like.
TF-IDF (term frequency-Inverse Document Frequency) is a statistical method for measuring the importance of words in documents.
Let T denote the total number of all documents in the document collection,Representing the number of documents containing the term t, the inverse document frequency IDF of the term t can be expressed as: /(I)
Substep S12: dividing the news article to be judged into words, calculating TF-IDF value, and selecting TF-IDF value topN word as the key word of the news.
Let t denote a word, d denote a document,Representing the number of occurrences of word t in document d,/>Representing the total number of words in document d, the word frequency TF of word t in document d is expressed as:
Combining the TF of the word t in the document d with the IDF in the whole document set, the TF-IDF value of the word t in the document d can be obtained:
TF-IDF(t,d,T)=TF(t,d)×IDF(t,T);
and obtaining a keyword list corresponding to the news articles through the processing. These keywords have a high TF-IDF value reflecting their importance and uniqueness in the current news text.
The volume calculation step S2 includes:
Substep S21: based on the existing news article sets, the Word2Vec model is trained by the document sets such as the web text data, the academic discourse sets and the like, so that semantic relations among words are learned.
The above set of documents is D, the vocabulary is V, each documentContains keyword set/>; The training goal of Word2Vec model is to learn a mapping function/>Words in the vocabulary are mapped into a vector representation in d-dimension.
Substep S22: and mapping Cheng Gaowei vectors of each keyword corresponding to the news in the keyword extraction step S1 through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article.
For documentsKeywords/>Vector representation/>, is obtained through Word2Vec modelThe vectors of all keywords are then combined into a matrix a by column, i.e.:
Where p is the total number of keywords and q is the word vector dimension.
Substep S23: transpose matrices A and AMultiplying and orthographic projecting to reduce the dimension of the matrix A and convert the matrix A into a square matrix B.
Performing orthogonal projection operation on the matrix A to obtain a dimension-reduced square matrix B, namely:
substep S24: calculating the determinant of matrix B results in a volume V of the matrix.
Calculating the determinant of matrix B, i.e. The value of this determinant may represent the volume of matrix B. See fig. 3, where a simplified graph is used to represent the volume enclosed in space by the 3-dimensional vectors.
The keyword in the news article is vectorized by using a Word2Vec model, then dimension conversion and dimension reduction are carried out through matrix operation, and finally the volume V of the matrix after dimension reduction is calculated.
The aggregate news judging step S3 includes:
Substep S31: randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as a training set of the algorithm;
Substep S32: selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of the matrix volume dimension as an alternative threshold on the basis of the training set generated in the step S31, judging that the articles with the matrix volume dimension larger than M in the training set are aggregated news, otherwise judging that the articles are not aggregated news, comparing the results with the manually judged results, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting the threshold M corresponding to the maximum F1 value as a unique threshold M;
Substep S33: and (3) carrying out keyword extraction step S1 and volume calculation step S2 on other articles for inference, judging by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is larger than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.
Description of principle:
TF-IDF can calculate the general importance of words in a document, highlighting the subject matter and content of the document. While Word2Vec is based on a distribution assumption that words that occur in similar contexts are considered to have similar semantics, and thus can generate a dense vector representation of each Word. Thus, based on TF-IDF and Word2Vec, we can transform a document into a Word matrix that expresses its semantics.
Since the aggregated news content generally contains descriptions of emergency events in various industries and fields, the aggregated news content is generally scattered and fragmented semantically, and word vectors of the aggregated news content are also scattered in the direction and length of a geometric space. We use the volume of the polyhedron that the word vector tenses in space to evaluate the degree of dispersion of the word vector's semantics in the document. The more discrete the multidimensional vector, the greater its polyhedral volume. The matrix-tensed polyhedral volume can be quickly calculated by using matrix determinant, so that the volume index can be used as an important feature for evaluating whether the document is the aggregated news.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides an aggregate news judging device based on the semantic correlation matrix space, which comprises the steps in the embodiment corresponding to the aggregate news judging method based on the semantic correlation matrix space, wherein the steps are used for realizing the aggregate news judging method based on the semantic correlation matrix space.
The aggregated news judging device based on the semantic correlation matrix space comprises:
Keyword extraction module: the method is used for screening the important keywords of the articles;
The volume calculation module: the text keyword vector calculation method comprises the steps of using a semantic model to carry out vectorization on text keywords, combining a plurality of word vectors by taking articles as units as matrixes, carrying out orthogonal projection operation on the matrixes to reduce dimensions, and calculating the volume of the matrixes in space;
the aggregate news judging module: the method is used for classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
The keyword extraction module comprises:
training an IDF unit: for training an IDF model based on a set of documents, such as an existing news article set, web text data, academic discourse set, etc.
TF-IDF (term frequency-Inverse Document Frequency) is a statistical method for measuring the importance of words in documents.
Let T denote the total number of all documents in the document collection,Representing the number of documents containing the term t, the inverse document frequency IDF of the term t can be expressed as: /(I)
Word segmentation calculation unit: and the method is used for segmenting the news articles to be judged, calculating TF-IDF values and selecting TF-IDF value topN words as keywords of the news.
Let t denote a word, d denote a document,Representing the number of occurrences of word t in document d,/>Representing the total word number in the document d, the word frequency TF of the word t in the document d is expressed as:/>
Combining the TF of the word t in the document d with the IDF in the whole document set, the TF-IDF value of the word t in the document d can be obtained: TF-IDF (T, d, T) =tf (T, d) ×idf (T, T);
and obtaining a keyword list corresponding to the news articles through the processing. These keywords have a high TF-IDF value reflecting their importance and uniqueness in the current news text.
The volume calculation module includes:
Training Word2Vec unit: the method is used for training a Word2Vec model based on the existing news article sets, the web text data, the academic discourse sets and other document sets, so as to learn the semantic relation among words.
The above set of documents is D, the vocabulary is V, each documentContains keyword set/>; The training goal of Word2Vec model is to learn a mapping function/>Words in the vocabulary are mapped into a vector representation in d-dimension.
Mapping unit: and mapping Cheng Gaowei vectors of each keyword corresponding to the news in the keyword extraction module through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article.
For documentsKeywords/>Vector representation/>, is obtained through Word2Vec modelThe vectors of all keywords are then combined into a matrix a by column, i.e.:
Where p is the total number of keywords and q is the word vector dimension.
Converting into a square array unit: transposed matrix for matrices A and AMultiplying and orthographic projecting to reduce the dimension of the matrix A and convert the matrix A into a square matrix B.
Performing orthogonal projection operation on the matrix A to obtain a dimension-reduced square matrix B, namely:
calculating a volume unit: the determinant for computing matrix B results in a volume V of the matrix.
Calculating the determinant of matrix B, i.e. The value of this determinant may represent the volume of matrix B. See fig. 3, where a simplified graph is used to represent the volume enclosed in space by the 3-dimensional vectors.
And vectorizing keywords in the news article by using a Word2Vec model, performing dimension conversion and dimension reduction by matrix operation, and finally calculating to obtain the volume V of the dimension-reduced matrix.
The aggregate news judging module comprises:
marking unit: the method comprises the steps of randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as an algorithm training set;
A boundary dividing unit: the method comprises the steps of selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of a matrix volume dimension as an alternative threshold on the basis of a training set generated by a labeling unit, judging that an article with the matrix volume dimension larger than M in the training set is aggregated news, otherwise judging that the article is not aggregated news, comparing the result with a manually judged result, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting a threshold M corresponding to the largest F1 value as a unique threshold M;
a judging unit: and the keyword extraction module and the volume calculation module are used for calculating other articles for inference, judging the articles by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is higher than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.
It should be understood that each module of the aggregated news determining device based on the semantic correlation matrix space is configured to execute each step in the embodiment of the corresponding method, and each step in the embodiment of the corresponding method has been explained in detail in the foregoing embodiment, and specific reference is made to the related description in the embodiment of the corresponding method, which is not repeated herein.
Based on the same inventive concept, the embodiment of the application also provides a computer device for realizing the above-mentioned aggregated news judging method based on the semantic correlation matrix space. The implementation scheme of the solution to the problem provided by the computer device is similar to the implementation scheme described in the above method, so the specific limitation in the embodiments of the computer device provided below may be referred to the limitation of the aggregated news judging method based on the semantic correlation matrix space hereinabove, and will not be described herein.
In one embodiment, a computer device, which may be a terminal, is provided that includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a method for aggregated news judgment based on a semantic correlation matrix space. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the aggregated news judging method based on the semantic correlation matrix space according to the above embodiment when executing the computer program.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the aggregated news judgment method based on semantic correlation matrix space as described in the above embodiments.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the aggregated news judgment method based on semantic correlation matrix space as described in the above embodiments.
The aggregated news judging method, the aggregated news judging device, the aggregated news judging computer equipment, the aggregated news judging computer readable storage medium and the aggregated news judging computer program product based on the semantic correlation matrix space have the following beneficial effects:
1. After model training is completed, the subsequent judging process is fully automatic, and whether the file is the news aggregation can be automatically judged by inputting article data.
2. The algorithm used by the invention is all operated based on vectors and matrixes, semantic judgment is carried out by utilizing spatial thinking, the calculation speed is high, the judgment efficiency is high, and the accuracy is high.
3. In the natural language processing analysis process in mass data, the invention can rapidly finish the judgment and the filtration of the aggregated news, and the calculation process is independent of external data, running environment and infrastructure, and can filter invalid data in the real-time processing analysis process of the text, thereby remarkably improving the processing speed of the text analysis and the accuracy of the result.
4. According to the text processing method, the text is converted into the matrix based on the semantic model to be expressed, so that the text can be thrown away, the text is calculated in a matrix mode, the text processing mode is widened, and the text processing speed is increased.
5. The invention expresses the dispersion degree of text content based on the volume of word vector in the matrix space, and takes the dispersion degree as the judgment standard of the aggregated news, thereby greatly improving the accuracy of data processing.
6. The index adopted by the invention is obtained by analyzing and summarizing a large amount of text contents according to the early manual work, and has the experience guiding function on the whole process.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high density embedded nonvolatile memory, resistive random access memory (ReRAM), magneto-resistive random access memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric memory (Ferroelectric Random Access Memory, FRAM), phase change memory (PHASE CHANGE memory, PCM), graphene memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. The method for judging the aggregated news is characterized by comprising the following steps of:
Keyword extraction step S1: screening important keywords of the articles;
And a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space;
an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
2. The method according to claim 1, wherein the keyword extraction step S1 includes:
Substep S11: training an IDF model based on the existing document set;
Substep S12: dividing the news article to be judged into words, calculating TF-IDF value, and selecting TF-IDF value topN word as the key word of the news.
3. The aggregate news judging method of claim 1, wherein the volume calculating step S2 includes:
Substep S21: training a Word2Vec model based on the existing document set so as to learn the semantic relation among words;
Substep S22: mapping Cheng Gaowei vectors of each keyword corresponding to news in the keyword extraction step S1 through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article;
Substep S23: transpose matrices A and A Multiplying, performing orthogonal projection to reduce the dimension of the matrix A and convert the dimension into a square matrix B;
substep S24: calculating the determinant of matrix B results in a volume V of the matrix.
4. The syndicated news judging method according to claim 1, wherein the syndicated news judging step S3 includes:
Substep S31: randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as a training set of the algorithm;
Substep S32: selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of the matrix volume dimension as an alternative threshold on the basis of the training set generated in the step S31, judging that the articles with the matrix volume dimension larger than M in the training set are aggregated news, otherwise judging that the articles are not aggregated news, comparing the results with the manually judged results, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting the threshold M corresponding to the maximum F1 value as a unique threshold M;
Substep S33: and (3) carrying out keyword extraction step S1 and volume calculation step S2 on other articles for inference, judging by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is larger than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.
5. A device for determining aggregated news, comprising:
Keyword extraction module: the method is used for screening the important keywords of the articles;
The volume calculation module: the text keyword vector calculation method comprises the steps of using a semantic model to carry out vectorization on text keywords, combining a plurality of word vectors by taking articles as units as matrixes, carrying out orthogonal projection operation on the matrixes to reduce dimensions, and calculating the volume of the matrixes in space;
the aggregate news judging module: the method is used for classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
6. The apparatus for determining news of claim 5, wherein the keyword extraction module comprises:
Training an IDF unit: for training an IDF model based on the set of existing documents;
Word segmentation calculation unit: and the method is used for segmenting the news articles to be judged, calculating TF-IDF values and selecting TF-IDF value topN words as keywords of the news.
7. The apparatus for determining syndicated news according to claim 5, wherein the volume calculating module comprises:
Training Word2Vec unit: the method is used for training a Word2Vec model based on the existing document set so as to learn the semantic relation among words;
Mapping unit: mapping Cheng Gaowei vectors of each keyword corresponding to news in the keyword extraction module through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article;
converting into a square array unit: transposed matrix for matrices A and A Multiplying, performing orthogonal projection to reduce the dimension of the matrix A and convert the dimension into a square matrix B;
calculating a volume unit: the determinant for computing matrix B results in a volume V of the matrix.
8. The apparatus for determining syndicated news according to claim 5, wherein the syndicated news determining module includes:
marking unit: the method comprises the steps of randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as an algorithm training set;
A boundary dividing unit: the method comprises the steps of selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of a matrix volume dimension as an alternative threshold on the basis of a training set generated by a labeling unit, judging that an article with the matrix volume dimension larger than M in the training set is aggregated news, otherwise judging that the article is not aggregated news, comparing the result with a manually judged result, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting a threshold M corresponding to the largest F1 value as a unique threshold M;
a judging unit: and the keyword extraction module and the volume calculation module are used for calculating other articles for inference, judging the articles by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is higher than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the aggregated news judgment method according to any one of claims 1 to 4 when the computer program is executed.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the aggregated news judgment method according to any one of claims 1 to 4.
CN202410308816.4A 2024-03-19 2024-03-19 Method, device, equipment and medium for judging aggregated news Active CN117910479B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410308816.4A CN117910479B (en) 2024-03-19 2024-03-19 Method, device, equipment and medium for judging aggregated news

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410308816.4A CN117910479B (en) 2024-03-19 2024-03-19 Method, device, equipment and medium for judging aggregated news

Publications (2)

Publication Number Publication Date
CN117910479A true CN117910479A (en) 2024-04-19
CN117910479B CN117910479B (en) 2024-06-04

Family

ID=90693987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410308816.4A Active CN117910479B (en) 2024-03-19 2024-03-19 Method, device, equipment and medium for judging aggregated news

Country Status (1)

Country Link
CN (1) CN117910479B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN107122413A (en) * 2017-03-31 2017-09-01 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN110781377A (en) * 2019-09-03 2020-02-11 腾讯科技(深圳)有限公司 Article recommendation method and device
CN111009321A (en) * 2019-08-14 2020-04-14 电子科技大学 Application method of machine learning classification model in juvenile autism auxiliary diagnosis
CN112861990A (en) * 2021-03-05 2021-05-28 电子科技大学 Topic clustering method and device based on keywords and entities and computer-readable storage medium
WO2021184674A1 (en) * 2020-03-17 2021-09-23 上海爱数信息技术股份有限公司 Text keyword extraction method, electronic device, and computer readable storage medium
CN113742464A (en) * 2021-07-28 2021-12-03 北京智谱华章科技有限公司 News event discovery algorithm and device based on heterogeneous information network
CN113887107A (en) * 2021-10-13 2022-01-04 国网山东省电力公司电力科学研究院 Hexahedron volume calculation method and system based on digital twin body
US20220036011A1 (en) * 2020-07-30 2022-02-03 InfoAuthN AI Inc. Systems and Methods for Explainable Fake News Detection
WO2022227207A1 (en) * 2021-04-30 2022-11-03 平安科技(深圳)有限公司 Text classification method, apparatus, computer device, and storage medium
CN116933052A (en) * 2023-07-18 2023-10-24 国网信息通信产业集团有限公司北京分公司 Substation data online monitoring system and method
US20230350973A1 (en) * 2022-04-27 2023-11-02 Regents Of The University Of Michigan Methods and Systems for Multilinear Discriminant Analysis Via Invariant Theory for Data Classification
CN117271894A (en) * 2023-09-22 2023-12-22 江苏大学 Paper recommendation method based on hybrid network and DPP
CN117473249A (en) * 2023-09-28 2024-01-30 中国电信股份有限公司技术创新中心 Modeling method and detection method of network flow detection model and related equipment

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN107122413A (en) * 2017-03-31 2017-09-01 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN111009321A (en) * 2019-08-14 2020-04-14 电子科技大学 Application method of machine learning classification model in juvenile autism auxiliary diagnosis
CN110781377A (en) * 2019-09-03 2020-02-11 腾讯科技(深圳)有限公司 Article recommendation method and device
WO2021184674A1 (en) * 2020-03-17 2021-09-23 上海爱数信息技术股份有限公司 Text keyword extraction method, electronic device, and computer readable storage medium
US20220036011A1 (en) * 2020-07-30 2022-02-03 InfoAuthN AI Inc. Systems and Methods for Explainable Fake News Detection
CN112861990A (en) * 2021-03-05 2021-05-28 电子科技大学 Topic clustering method and device based on keywords and entities and computer-readable storage medium
WO2022227207A1 (en) * 2021-04-30 2022-11-03 平安科技(深圳)有限公司 Text classification method, apparatus, computer device, and storage medium
CN113742464A (en) * 2021-07-28 2021-12-03 北京智谱华章科技有限公司 News event discovery algorithm and device based on heterogeneous information network
CN113887107A (en) * 2021-10-13 2022-01-04 国网山东省电力公司电力科学研究院 Hexahedron volume calculation method and system based on digital twin body
US20230350973A1 (en) * 2022-04-27 2023-11-02 Regents Of The University Of Michigan Methods and Systems for Multilinear Discriminant Analysis Via Invariant Theory for Data Classification
CN116933052A (en) * 2023-07-18 2023-10-24 国网信息通信产业集团有限公司北京分公司 Substation data online monitoring system and method
CN117271894A (en) * 2023-09-22 2023-12-22 江苏大学 Paper recommendation method based on hybrid network and DPP
CN117473249A (en) * 2023-09-28 2024-01-30 中国电信股份有限公司技术创新中心 Modeling method and detection method of network flow detection model and related equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
尹倩;胡学钢;谢飞;吴信东;: "基于密度聚类模式的中文新闻网页关键词提取", 广西师范大学学报(自然科学版), no. 01, 15 March 2009 (2009-03-15), pages 1 - 5 *
秦记东;彭华峰;: "基于多重判别分析和Zernike矩的目标分类算法", 信息工程大学学报, no. 06, 15 December 2017 (2017-12-15), pages 1 - 6 *

Also Published As

Publication number Publication date
CN117910479B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
US10769381B2 (en) Topic-influenced document relationship graphs
US10255272B2 (en) Adjustment of document relationship graphs
Dey Sarkar et al. A novel feature selection technique for text classification using Naive Bayes
Luo et al. Online learning of interpretable word embeddings
CN112269792A (en) Data query method, device, equipment and computer readable storage medium
Xu et al. A new feature selection method based on support vector machines for text categorisation
CN115795000A (en) Joint similarity algorithm comparison-based enclosure identification method and device
Zhao et al. Discerning influence patterns with beta-poisson factorization in microblogging environments
Du et al. A topic recognition method of news text based on word embedding enhancement
CN115878761B (en) Event context generation method, device and medium
Li et al. Topic modeling for sequential documents based on hybrid inter-document topic dependency
CN117910479B (en) Method, device, equipment and medium for judging aggregated news
Wu et al. Deep feature embedding for tabular data
Dass et al. Amelioration of Big Data analytics by employing Big Data tools and techniques
Denli et al. Geoscience language processing for exploration
Dai et al. Graph sparse nonnegative matrix factorization algorithm based on the inertial projection neural network
Yang et al. RETRACTED ARTICLE: Simulation of cross-modal image-text retrieval algorithm under convolutional neural network structure and hash method
Zhao et al. MapReduce-based clustering for near-duplicate image identification
CN117009534B (en) Text classification method, apparatus, computer device and storage medium
XUE et al. Multiple clustering algorithm based on dynamic weighted tensor distance
CN118410805B (en) Chinese author name disambiguation method and device based on relation diagram convolutional neural network
US12038891B2 (en) Approximate query equivalence for feature stores in machine learning operations products
US20240320200A1 (en) Approximate query equivalence for feature stores in machine learning operations products
Kakkar et al. Interactive analysis of big geospatial data with high‐performance computing: A case study of partisan segregation in the United States
CN117909503A (en) Text classification method, apparatus, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant