CN117910479A - Method, device, equipment and medium for judging aggregated news - Google Patents
Method, device, equipment and medium for judging aggregated news Download PDFInfo
- Publication number
- CN117910479A CN117910479A CN202410308816.4A CN202410308816A CN117910479A CN 117910479 A CN117910479 A CN 117910479A CN 202410308816 A CN202410308816 A CN 202410308816A CN 117910479 A CN117910479 A CN 117910479A
- Authority
- CN
- China
- Prior art keywords
- news
- articles
- volume
- judging
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 239000013598 vector Substances 0.000 claims abstract description 37
- 238000004364 calculation method Methods 0.000 claims abstract description 28
- 238000000605 extraction Methods 0.000 claims abstract description 24
- 238000012216 screening Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 75
- 238000012549 training Methods 0.000 claims description 26
- 238000004590 computer program Methods 0.000 claims description 17
- 238000013507 mapping Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 13
- 238000004458 analytical method Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000003058 natural language processing Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 239000006185 dispersion Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 229910021389 graphene Inorganic materials 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Mathematical Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Optimization (AREA)
- Computational Linguistics (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of computer data processing, and relates to an aggregated news judging method, an aggregated news judging device, computer equipment and a medium, wherein the method comprises the following steps: keyword extraction step S1: screening important keywords of the articles; and a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space; an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index. The method, the device, the computer equipment and the medium can quickly identify whether the target article is the aggregated news, and have the advantages of high reliability, high calculation speed and the like.
Description
Technical Field
The invention relates to the technical field of computer data processing, in particular to an aggregated news judging method and device based on a semantic correlation matrix space, computer equipment and a computer readable storage medium.
Background
Syndication news refers to the integration of news content, stories, articles, or information from multiple different sources together to form a unified article or page that enables users to browse news stories from multiple sources at a time. These news may come from different news websites, media institutions, blogs, social media, or other sources of information, and the subject matter of the stories is complex and diverse, possibly across industries, fields, and without fixed rules. Aggregated news may have some negative impact on natural language processing analysis, mainly including:
The diversity of information leads to confusion: for analyzing content of a particular domain or single topic, content diversity of aggregated news results in over-fragmentation of information and irrelevance of the analyzed topic or event resulting in errors in the results.
Information repetition and redundancy: aggregated news may contain a large amount of duplicate or redundant information, especially when multiple sources are involved in the same topic or event.
The information quality is different: syndicated news encompasses multiple sources and may lead to uneven information quality. Some sources may lack reliability or convey inaccurate information, which may mislead the natural language processing system.
In general natural language processing and topic extraction or event analysis based on document content, serious interference is generated to the analysis content due to the presence of aggregated news. To improve the quality of the analyzed data, the aggregated news needs to be identified and filtered. Therefore, it is necessary to develop a method for determining the syndicated news, so as to determine whether the news is syndicated news.
Disclosure of Invention
In view of the above, the invention provides an aggregated news judging method, an aggregated news judging device, computer equipment and a computer readable storage medium based on a semantic correlation matrix space, which can rapidly identify whether a target article is an aggregated news and have the advantages of high reliability, high calculation speed and the like.
The technical scheme of the invention is as follows:
In a first aspect, the present invention provides a method for determining aggregated news, including the steps of:
Keyword extraction step S1: screening important keywords of the articles;
And a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space;
an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
In a second aspect, the present invention further provides a device for determining aggregated news, including:
Keyword extraction module: the method is used for screening the important keywords of the articles;
The volume calculation module: the text keyword vector calculation method comprises the steps of using a semantic model to carry out vectorization on text keywords, combining a plurality of word vectors by taking articles as units as matrixes, carrying out orthogonal projection operation on the matrixes to reduce dimensions, and calculating the volume of the matrixes in space;
the aggregate news judging module: the method is used for classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
In a third aspect, the present invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when the processor executes the computer program.
In a fourth aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the aggregated news judgment method described above.
Compared with the prior art, the method, the device, the computer equipment and the computer readable storage medium for judging the aggregated news have the following beneficial effects:
1. After model training is completed, the subsequent judging process is fully automatic, and whether the file is the news aggregation can be automatically judged by inputting article data.
2. The algorithm used by the invention is all operated based on vectors and matrixes, semantic judgment is carried out by utilizing spatial thinking, the calculation speed is high, the judgment efficiency is high, and the accuracy is high.
3. In the natural language processing analysis process in mass data, the invention can rapidly finish the judgment and the filtration of the aggregated news, and the calculation process is independent of external data, running environment and infrastructure, and can filter invalid data in the real-time processing analysis process of the text, thereby remarkably improving the processing speed of the text analysis and the accuracy of the result.
The preferred embodiments of the present invention and their advantageous effects will be described in further detail with reference to specific embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain the invention. In the drawings of which there are shown,
FIG. 1 is a schematic diagram of the content of a model to be trained in accordance with the present invention;
FIG. 2 is a schematic diagram of an overall flow for aggregated news judgment according to the present invention;
fig. 3 is a schematic diagram of the shape of the three-dimensional vector enclosed in space.
Detailed Description
The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
The aggregated news judging method based on the semantic correlation matrix space provided by the embodiment of the application can be applied to computer equipment such as terminals, servers and the like. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, which may be head-mounted devices, etc.; the server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
Referring to fig. 1 and 2, the invention provides a method for determining aggregated news based on a semantic correlation matrix space, comprising the following steps:
Keyword extraction step S1: screening important keywords of the articles;
And a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space;
an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
The keyword extraction step S1 includes:
substep S11: an IDF model is trained based on a set of documents, such as an existing news article set, web text data, academic discourse set, and the like.
TF-IDF (term frequency-Inverse Document Frequency) is a statistical method for measuring the importance of words in documents.
Let T denote the total number of all documents in the document collection,Representing the number of documents containing the term t, the inverse document frequency IDF of the term t can be expressed as: /(I)。
Substep S12: dividing the news article to be judged into words, calculating TF-IDF value, and selecting TF-IDF value topN word as the key word of the news.
Let t denote a word, d denote a document,Representing the number of occurrences of word t in document d,/>Representing the total number of words in document d, the word frequency TF of word t in document d is expressed as:
;
Combining the TF of the word t in the document d with the IDF in the whole document set, the TF-IDF value of the word t in the document d can be obtained:
TF-IDF(t,d,T)=TF(t,d)×IDF(t,T);
and obtaining a keyword list corresponding to the news articles through the processing. These keywords have a high TF-IDF value reflecting their importance and uniqueness in the current news text.
The volume calculation step S2 includes:
Substep S21: based on the existing news article sets, the Word2Vec model is trained by the document sets such as the web text data, the academic discourse sets and the like, so that semantic relations among words are learned.
The above set of documents is D, the vocabulary is V, each documentContains keyword set/>; The training goal of Word2Vec model is to learn a mapping function/>Words in the vocabulary are mapped into a vector representation in d-dimension.
Substep S22: and mapping Cheng Gaowei vectors of each keyword corresponding to the news in the keyword extraction step S1 through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article.
For documentsKeywords/>Vector representation/>, is obtained through Word2Vec modelThe vectors of all keywords are then combined into a matrix a by column, i.e.:;
Where p is the total number of keywords and q is the word vector dimension.
Substep S23: transpose matrices A and AMultiplying and orthographic projecting to reduce the dimension of the matrix A and convert the matrix A into a square matrix B.
Performing orthogonal projection operation on the matrix A to obtain a dimension-reduced square matrix B, namely:。
substep S24: calculating the determinant of matrix B results in a volume V of the matrix.
Calculating the determinant of matrix B, i.e. The value of this determinant may represent the volume of matrix B. See fig. 3, where a simplified graph is used to represent the volume enclosed in space by the 3-dimensional vectors.
The keyword in the news article is vectorized by using a Word2Vec model, then dimension conversion and dimension reduction are carried out through matrix operation, and finally the volume V of the matrix after dimension reduction is calculated.
The aggregate news judging step S3 includes:
Substep S31: randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as a training set of the algorithm;
Substep S32: selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of the matrix volume dimension as an alternative threshold on the basis of the training set generated in the step S31, judging that the articles with the matrix volume dimension larger than M in the training set are aggregated news, otherwise judging that the articles are not aggregated news, comparing the results with the manually judged results, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting the threshold M corresponding to the maximum F1 value as a unique threshold M;
Substep S33: and (3) carrying out keyword extraction step S1 and volume calculation step S2 on other articles for inference, judging by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is larger than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.
Description of principle:
TF-IDF can calculate the general importance of words in a document, highlighting the subject matter and content of the document. While Word2Vec is based on a distribution assumption that words that occur in similar contexts are considered to have similar semantics, and thus can generate a dense vector representation of each Word. Thus, based on TF-IDF and Word2Vec, we can transform a document into a Word matrix that expresses its semantics.
Since the aggregated news content generally contains descriptions of emergency events in various industries and fields, the aggregated news content is generally scattered and fragmented semantically, and word vectors of the aggregated news content are also scattered in the direction and length of a geometric space. We use the volume of the polyhedron that the word vector tenses in space to evaluate the degree of dispersion of the word vector's semantics in the document. The more discrete the multidimensional vector, the greater its polyhedral volume. The matrix-tensed polyhedral volume can be quickly calculated by using matrix determinant, so that the volume index can be used as an important feature for evaluating whether the document is the aggregated news.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides an aggregate news judging device based on the semantic correlation matrix space, which comprises the steps in the embodiment corresponding to the aggregate news judging method based on the semantic correlation matrix space, wherein the steps are used for realizing the aggregate news judging method based on the semantic correlation matrix space.
The aggregated news judging device based on the semantic correlation matrix space comprises:
Keyword extraction module: the method is used for screening the important keywords of the articles;
The volume calculation module: the text keyword vector calculation method comprises the steps of using a semantic model to carry out vectorization on text keywords, combining a plurality of word vectors by taking articles as units as matrixes, carrying out orthogonal projection operation on the matrixes to reduce dimensions, and calculating the volume of the matrixes in space;
the aggregate news judging module: the method is used for classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
The keyword extraction module comprises:
training an IDF unit: for training an IDF model based on a set of documents, such as an existing news article set, web text data, academic discourse set, etc.
TF-IDF (term frequency-Inverse Document Frequency) is a statistical method for measuring the importance of words in documents.
Let T denote the total number of all documents in the document collection,Representing the number of documents containing the term t, the inverse document frequency IDF of the term t can be expressed as: /(I)。
Word segmentation calculation unit: and the method is used for segmenting the news articles to be judged, calculating TF-IDF values and selecting TF-IDF value topN words as keywords of the news.
Let t denote a word, d denote a document,Representing the number of occurrences of word t in document d,/>Representing the total word number in the document d, the word frequency TF of the word t in the document d is expressed as:/>;
Combining the TF of the word t in the document d with the IDF in the whole document set, the TF-IDF value of the word t in the document d can be obtained: TF-IDF (T, d, T) =tf (T, d) ×idf (T, T);
and obtaining a keyword list corresponding to the news articles through the processing. These keywords have a high TF-IDF value reflecting their importance and uniqueness in the current news text.
The volume calculation module includes:
Training Word2Vec unit: the method is used for training a Word2Vec model based on the existing news article sets, the web text data, the academic discourse sets and other document sets, so as to learn the semantic relation among words.
The above set of documents is D, the vocabulary is V, each documentContains keyword set/>; The training goal of Word2Vec model is to learn a mapping function/>Words in the vocabulary are mapped into a vector representation in d-dimension.
Mapping unit: and mapping Cheng Gaowei vectors of each keyword corresponding to the news in the keyword extraction module through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article.
For documentsKeywords/>Vector representation/>, is obtained through Word2Vec modelThe vectors of all keywords are then combined into a matrix a by column, i.e.:
;
Where p is the total number of keywords and q is the word vector dimension.
Converting into a square array unit: transposed matrix for matrices A and AMultiplying and orthographic projecting to reduce the dimension of the matrix A and convert the matrix A into a square matrix B.
Performing orthogonal projection operation on the matrix A to obtain a dimension-reduced square matrix B, namely:。
calculating a volume unit: the determinant for computing matrix B results in a volume V of the matrix.
Calculating the determinant of matrix B, i.e. The value of this determinant may represent the volume of matrix B. See fig. 3, where a simplified graph is used to represent the volume enclosed in space by the 3-dimensional vectors.
And vectorizing keywords in the news article by using a Word2Vec model, performing dimension conversion and dimension reduction by matrix operation, and finally calculating to obtain the volume V of the dimension-reduced matrix.
The aggregate news judging module comprises:
marking unit: the method comprises the steps of randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as an algorithm training set;
A boundary dividing unit: the method comprises the steps of selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of a matrix volume dimension as an alternative threshold on the basis of a training set generated by a labeling unit, judging that an article with the matrix volume dimension larger than M in the training set is aggregated news, otherwise judging that the article is not aggregated news, comparing the result with a manually judged result, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting a threshold M corresponding to the largest F1 value as a unique threshold M;
a judging unit: and the keyword extraction module and the volume calculation module are used for calculating other articles for inference, judging the articles by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is higher than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.
It should be understood that each module of the aggregated news determining device based on the semantic correlation matrix space is configured to execute each step in the embodiment of the corresponding method, and each step in the embodiment of the corresponding method has been explained in detail in the foregoing embodiment, and specific reference is made to the related description in the embodiment of the corresponding method, which is not repeated herein.
Based on the same inventive concept, the embodiment of the application also provides a computer device for realizing the above-mentioned aggregated news judging method based on the semantic correlation matrix space. The implementation scheme of the solution to the problem provided by the computer device is similar to the implementation scheme described in the above method, so the specific limitation in the embodiments of the computer device provided below may be referred to the limitation of the aggregated news judging method based on the semantic correlation matrix space hereinabove, and will not be described herein.
In one embodiment, a computer device, which may be a terminal, is provided that includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a method for aggregated news judgment based on a semantic correlation matrix space. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the aggregated news judging method based on the semantic correlation matrix space according to the above embodiment when executing the computer program.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the aggregated news judgment method based on semantic correlation matrix space as described in the above embodiments.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the aggregated news judgment method based on semantic correlation matrix space as described in the above embodiments.
The aggregated news judging method, the aggregated news judging device, the aggregated news judging computer equipment, the aggregated news judging computer readable storage medium and the aggregated news judging computer program product based on the semantic correlation matrix space have the following beneficial effects:
1. After model training is completed, the subsequent judging process is fully automatic, and whether the file is the news aggregation can be automatically judged by inputting article data.
2. The algorithm used by the invention is all operated based on vectors and matrixes, semantic judgment is carried out by utilizing spatial thinking, the calculation speed is high, the judgment efficiency is high, and the accuracy is high.
3. In the natural language processing analysis process in mass data, the invention can rapidly finish the judgment and the filtration of the aggregated news, and the calculation process is independent of external data, running environment and infrastructure, and can filter invalid data in the real-time processing analysis process of the text, thereby remarkably improving the processing speed of the text analysis and the accuracy of the result.
4. According to the text processing method, the text is converted into the matrix based on the semantic model to be expressed, so that the text can be thrown away, the text is calculated in a matrix mode, the text processing mode is widened, and the text processing speed is increased.
5. The invention expresses the dispersion degree of text content based on the volume of word vector in the matrix space, and takes the dispersion degree as the judgment standard of the aggregated news, thereby greatly improving the accuracy of data processing.
6. The index adopted by the invention is obtained by analyzing and summarizing a large amount of text contents according to the early manual work, and has the experience guiding function on the whole process.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high density embedded nonvolatile memory, resistive random access memory (ReRAM), magneto-resistive random access memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric memory (Ferroelectric Random Access Memory, FRAM), phase change memory (PHASE CHANGE memory, PCM), graphene memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.
Claims (10)
1. The method for judging the aggregated news is characterized by comprising the following steps of:
Keyword extraction step S1: screening important keywords of the articles;
And a volume calculation step S2: vectorizing text keywords by using a semantic model, combining a plurality of word vectors by taking articles as units as matrixes, performing orthogonal projection operation on the matrixes to reduce the dimension, and calculating the volume of the matrixes in space;
an aggregate news judging step S3: and classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
2. The method according to claim 1, wherein the keyword extraction step S1 includes:
Substep S11: training an IDF model based on the existing document set;
Substep S12: dividing the news article to be judged into words, calculating TF-IDF value, and selecting TF-IDF value topN word as the key word of the news.
3. The aggregate news judging method of claim 1, wherein the volume calculating step S2 includes:
Substep S21: training a Word2Vec model based on the existing document set so as to learn the semantic relation among words;
Substep S22: mapping Cheng Gaowei vectors of each keyword corresponding to news in the keyword extraction step S1 through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article;
Substep S23: transpose matrices A and A Multiplying, performing orthogonal projection to reduce the dimension of the matrix A and convert the dimension into a square matrix B;
substep S24: calculating the determinant of matrix B results in a volume V of the matrix.
4. The syndicated news judging method according to claim 1, wherein the syndicated news judging step S3 includes:
Substep S31: randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as a training set of the algorithm;
Substep S32: selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of the matrix volume dimension as an alternative threshold on the basis of the training set generated in the step S31, judging that the articles with the matrix volume dimension larger than M in the training set are aggregated news, otherwise judging that the articles are not aggregated news, comparing the results with the manually judged results, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting the threshold M corresponding to the maximum F1 value as a unique threshold M;
Substep S33: and (3) carrying out keyword extraction step S1 and volume calculation step S2 on other articles for inference, judging by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is larger than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.
5. A device for determining aggregated news, comprising:
Keyword extraction module: the method is used for screening the important keywords of the articles;
The volume calculation module: the text keyword vector calculation method comprises the steps of using a semantic model to carry out vectorization on text keywords, combining a plurality of word vectors by taking articles as units as matrixes, carrying out orthogonal projection operation on the matrixes to reduce dimensions, and calculating the volume of the matrixes in space;
the aggregate news judging module: the method is used for classifying the articles into aggregated news and non-aggregated news by taking the volume as an index.
6. The apparatus for determining news of claim 5, wherein the keyword extraction module comprises:
Training an IDF unit: for training an IDF model based on the set of existing documents;
Word segmentation calculation unit: and the method is used for segmenting the news articles to be judged, calculating TF-IDF values and selecting TF-IDF value topN words as keywords of the news.
7. The apparatus for determining syndicated news according to claim 5, wherein the volume calculating module comprises:
Training Word2Vec unit: the method is used for training a Word2Vec model based on the existing document set so as to learn the semantic relation among words;
Mapping unit: mapping Cheng Gaowei vectors of each keyword corresponding to news in the keyword extraction module through a Word2Vec model, and combining the corresponding vectors into a multidimensional matrix A by all phrases of the article;
converting into a square array unit: transposed matrix for matrices A and A Multiplying, performing orthogonal projection to reduce the dimension of the matrix A and convert the dimension into a square matrix B;
calculating a volume unit: the determinant for computing matrix B results in a volume V of the matrix.
8. The apparatus for determining syndicated news according to claim 5, wherein the syndicated news determining module includes:
marking unit: the method comprises the steps of randomly selecting a batch of articles, manually marking whether the articles are aggregated news, calculating the matrix volume corresponding to each article by a keyword extraction step S1 and a volume calculation step S2, and taking the data as an algorithm training set;
A boundary dividing unit: the method comprises the steps of selecting a dividing threshold M epsilon [ M1, M2, M3, ], M10] of a matrix volume dimension as an alternative threshold on the basis of a training set generated by a labeling unit, judging that an article with the matrix volume dimension larger than M in the training set is aggregated news, otherwise judging that the article is not aggregated news, comparing the result with a manually judged result, calculating the accuracy, recall rate and F1 value corresponding to M, and iterating for a plurality of times to obtain F1 epsilon [ F11, F12, F13, ], F110], and selecting a threshold M corresponding to the largest F1 value as a unique threshold M;
a judging unit: and the keyword extraction module and the volume calculation module are used for calculating other articles for inference, judging the articles by taking the unique threshold value m of the aggregated news as a reference, and if the volume of the matrix calculated by the articles is higher than the unique threshold value m, judging the articles as the aggregated news, otherwise, judging the articles as the non-aggregated news.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the aggregated news judgment method according to any one of claims 1 to 4 when the computer program is executed.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the aggregated news judgment method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410308816.4A CN117910479B (en) | 2024-03-19 | 2024-03-19 | Method, device, equipment and medium for judging aggregated news |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410308816.4A CN117910479B (en) | 2024-03-19 | 2024-03-19 | Method, device, equipment and medium for judging aggregated news |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117910479A true CN117910479A (en) | 2024-04-19 |
CN117910479B CN117910479B (en) | 2024-06-04 |
Family
ID=90693987
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410308816.4A Active CN117910479B (en) | 2024-03-19 | 2024-03-19 | Method, device, equipment and medium for judging aggregated news |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117910479B (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095737A (en) * | 2016-06-07 | 2016-11-09 | 杭州凡闻科技有限公司 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
CN107122413A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN110781377A (en) * | 2019-09-03 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Article recommendation method and device |
CN111009321A (en) * | 2019-08-14 | 2020-04-14 | 电子科技大学 | Application method of machine learning classification model in juvenile autism auxiliary diagnosis |
CN112861990A (en) * | 2021-03-05 | 2021-05-28 | 电子科技大学 | Topic clustering method and device based on keywords and entities and computer-readable storage medium |
WO2021184674A1 (en) * | 2020-03-17 | 2021-09-23 | 上海爱数信息技术股份有限公司 | Text keyword extraction method, electronic device, and computer readable storage medium |
CN113742464A (en) * | 2021-07-28 | 2021-12-03 | 北京智谱华章科技有限公司 | News event discovery algorithm and device based on heterogeneous information network |
CN113887107A (en) * | 2021-10-13 | 2022-01-04 | 国网山东省电力公司电力科学研究院 | Hexahedron volume calculation method and system based on digital twin body |
US20220036011A1 (en) * | 2020-07-30 | 2022-02-03 | InfoAuthN AI Inc. | Systems and Methods for Explainable Fake News Detection |
WO2022227207A1 (en) * | 2021-04-30 | 2022-11-03 | 平安科技(深圳)有限公司 | Text classification method, apparatus, computer device, and storage medium |
CN116933052A (en) * | 2023-07-18 | 2023-10-24 | 国网信息通信产业集团有限公司北京分公司 | Substation data online monitoring system and method |
US20230350973A1 (en) * | 2022-04-27 | 2023-11-02 | Regents Of The University Of Michigan | Methods and Systems for Multilinear Discriminant Analysis Via Invariant Theory for Data Classification |
CN117271894A (en) * | 2023-09-22 | 2023-12-22 | 江苏大学 | Paper recommendation method based on hybrid network and DPP |
CN117473249A (en) * | 2023-09-28 | 2024-01-30 | 中国电信股份有限公司技术创新中心 | Modeling method and detection method of network flow detection model and related equipment |
-
2024
- 2024-03-19 CN CN202410308816.4A patent/CN117910479B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095737A (en) * | 2016-06-07 | 2016-11-09 | 杭州凡闻科技有限公司 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
CN107122413A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN111009321A (en) * | 2019-08-14 | 2020-04-14 | 电子科技大学 | Application method of machine learning classification model in juvenile autism auxiliary diagnosis |
CN110781377A (en) * | 2019-09-03 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Article recommendation method and device |
WO2021184674A1 (en) * | 2020-03-17 | 2021-09-23 | 上海爱数信息技术股份有限公司 | Text keyword extraction method, electronic device, and computer readable storage medium |
US20220036011A1 (en) * | 2020-07-30 | 2022-02-03 | InfoAuthN AI Inc. | Systems and Methods for Explainable Fake News Detection |
CN112861990A (en) * | 2021-03-05 | 2021-05-28 | 电子科技大学 | Topic clustering method and device based on keywords and entities and computer-readable storage medium |
WO2022227207A1 (en) * | 2021-04-30 | 2022-11-03 | 平安科技(深圳)有限公司 | Text classification method, apparatus, computer device, and storage medium |
CN113742464A (en) * | 2021-07-28 | 2021-12-03 | 北京智谱华章科技有限公司 | News event discovery algorithm and device based on heterogeneous information network |
CN113887107A (en) * | 2021-10-13 | 2022-01-04 | 国网山东省电力公司电力科学研究院 | Hexahedron volume calculation method and system based on digital twin body |
US20230350973A1 (en) * | 2022-04-27 | 2023-11-02 | Regents Of The University Of Michigan | Methods and Systems for Multilinear Discriminant Analysis Via Invariant Theory for Data Classification |
CN116933052A (en) * | 2023-07-18 | 2023-10-24 | 国网信息通信产业集团有限公司北京分公司 | Substation data online monitoring system and method |
CN117271894A (en) * | 2023-09-22 | 2023-12-22 | 江苏大学 | Paper recommendation method based on hybrid network and DPP |
CN117473249A (en) * | 2023-09-28 | 2024-01-30 | 中国电信股份有限公司技术创新中心 | Modeling method and detection method of network flow detection model and related equipment |
Non-Patent Citations (2)
Title |
---|
尹倩;胡学钢;谢飞;吴信东;: "基于密度聚类模式的中文新闻网页关键词提取", 广西师范大学学报(自然科学版), no. 01, 15 March 2009 (2009-03-15), pages 1 - 5 * |
秦记东;彭华峰;: "基于多重判别分析和Zernike矩的目标分类算法", 信息工程大学学报, no. 06, 15 December 2017 (2017-12-15), pages 1 - 6 * |
Also Published As
Publication number | Publication date |
---|---|
CN117910479B (en) | 2024-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10769381B2 (en) | Topic-influenced document relationship graphs | |
US10255272B2 (en) | Adjustment of document relationship graphs | |
Dey Sarkar et al. | A novel feature selection technique for text classification using Naive Bayes | |
Luo et al. | Online learning of interpretable word embeddings | |
CN112269792A (en) | Data query method, device, equipment and computer readable storage medium | |
Xu et al. | A new feature selection method based on support vector machines for text categorisation | |
CN115795000A (en) | Joint similarity algorithm comparison-based enclosure identification method and device | |
Zhao et al. | Discerning influence patterns with beta-poisson factorization in microblogging environments | |
Du et al. | A topic recognition method of news text based on word embedding enhancement | |
CN115878761B (en) | Event context generation method, device and medium | |
Li et al. | Topic modeling for sequential documents based on hybrid inter-document topic dependency | |
CN117910479B (en) | Method, device, equipment and medium for judging aggregated news | |
Wu et al. | Deep feature embedding for tabular data | |
Dass et al. | Amelioration of Big Data analytics by employing Big Data tools and techniques | |
Denli et al. | Geoscience language processing for exploration | |
Dai et al. | Graph sparse nonnegative matrix factorization algorithm based on the inertial projection neural network | |
Yang et al. | RETRACTED ARTICLE: Simulation of cross-modal image-text retrieval algorithm under convolutional neural network structure and hash method | |
Zhao et al. | MapReduce-based clustering for near-duplicate image identification | |
CN117009534B (en) | Text classification method, apparatus, computer device and storage medium | |
XUE et al. | Multiple clustering algorithm based on dynamic weighted tensor distance | |
CN118410805B (en) | Chinese author name disambiguation method and device based on relation diagram convolutional neural network | |
US12038891B2 (en) | Approximate query equivalence for feature stores in machine learning operations products | |
US20240320200A1 (en) | Approximate query equivalence for feature stores in machine learning operations products | |
Kakkar et al. | Interactive analysis of big geospatial data with high‐performance computing: A case study of partisan segregation in the United States | |
CN117909503A (en) | Text classification method, apparatus, computer device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |