CN112580355B - News information topic detection and real-time aggregation method - Google Patents

News information topic detection and real-time aggregation method Download PDF

Info

Publication number
CN112580355B
CN112580355B CN202011613849.8A CN202011613849A CN112580355B CN 112580355 B CN112580355 B CN 112580355B CN 202011613849 A CN202011613849 A CN 202011613849A CN 112580355 B CN112580355 B CN 112580355B
Authority
CN
China
Prior art keywords
real
task
text
time
news information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011613849.8A
Other languages
Chinese (zh)
Other versions
CN112580355A (en
Inventor
吴琼
刘武雷
王元卓
郭建永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Original Assignee
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences filed Critical Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority to CN202011613849.8A priority Critical patent/CN112580355B/en
Publication of CN112580355A publication Critical patent/CN112580355A/en
Application granted granted Critical
Publication of CN112580355B publication Critical patent/CN112580355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of natural language processing, and particularly relates to a news information topic detection and real-time aggregation method. The method can be used for finishing the real-time pushing of news information leisurely through data acquisition, data processing, text fusion model construction and real-time aggregation. On the basis of constructing a text feature model by using a multi-feature fusion method, a distributed real-time streaming data calculation method is adopted to distribute topic clustering tasks to different calculation nodes, so that the accuracy and the real-time performance of news information real-time aggregation are improved, the performance problem under a single node is solved, and finally, a news information aggregation result can be pushed to an end user through a terminal device, and the method is convenient and practical.

Description

News information topic detection and real-time aggregation method
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a multi-data news information topic detection and real-time aggregation method.
Background
The continuous innovation and rapid development of information technology bring a profound influence to news spreading, media platforms based on the internet are continuously increased, the spreading speed and the number of news are increased day by day, network news information becomes messy, different media platforms forward and copy the same news information, and the homogenization problem is serious. Therefore, how to automatically mine and analyze the hot topics worth attention at present in an immense information sea by utilizing computer technology and comprehensively display the aggregated hot news to users in real time is a research hotspot and focus of current network news. In addition, as the data scale of the network news is rapidly increased, the original serialized topic discovery and tracking method cannot be effectively executed due to the limitation of conditions such as memory capacity when processing a mass news data set, and the requirement on timeliness and the like is difficult to meet.
Disclosure of Invention
Aiming at the defects and problems that the conventional news data are increased sharply, and the conventional serialized topic finding and tracking method is often unable to be effectively executed due to the limitation of conditions such as memory capacity and the like when processing a mass news data set, and is difficult to meet the requirements in aspects such as timeliness and the like, the invention provides a topic detection and aggregation method for multi-data news information on the basis of constructing a text feature model by utilizing a multi-feature fusion method, distributes topic clustering tasks to different computing nodes, improves the accuracy and real-time performance of news information real-time aggregation, and solves the performance problem under a single node.
The technical scheme adopted by the invention for solving the technical problems is as follows: a news information topic detection and real-time aggregation method comprises the following steps:
step one, distributed data acquisition: collecting news information from an internet news media website in real time through a distributed collection program to serve as original data;
step two, data preprocessing: carrying out text denoising, Chinese word segmentation, word filtering stop words, part of speech tagging, keyword extraction and named entity identification on original data to obtain a data document set D to be processed;
step three, constructing a text feature model: the text feature model is constructed by utilizing a multi-feature fusion method, and the model construction method comprises the following steps:
(1) obtaining subject characteristics of a text by utilizing a named entity recognition technology and an LDA model which are integrated, receiving a document set D as input, and calculating the similarity sim (p, q) of the texts p and qlda
Figure GDA0003099045890000021
In the formula: p and q are probability vectors of the quantitative texts, and DKL is a vector distance calculated by adopting relative entropy;
(2) obtaining semantic features of the text by using a Word2Vect model, and calculating the semantic similarity sim (p, q) of the text p and the text q by using cosine similarityv2q
Figure GDA0003099045890000022
(3) A text fusion model is obtained by fusing the theme characteristics and the semantic characteristics by adopting the weighting factors,
sim(p,q)=α*sim(p,q)lda+β*sim(p,q)v2q
in the formula: α, β are weighting factors, α + β ═ 1;
(4) adding time attenuation factors to the text fusion model to update the model, calculating the similarity of the updated text,
sim(p,q)=e-k*(t2-t1)*α*sim(p,q)lda+e-k*(t2-t1)*β*sim(p,q)v2q
in the formula: k is an attenuation factor, t2And t1Is the update time of both articles;
step four, distributed real-time clustering: the method for clustering news information in real time by adopting a distributed real-time clustering algorithm comprises the following steps:
(1) vectorizing the collected and preprocessed text, transferring the vector data to task scheduling nodes of a distributed real-time aggregation algorithm according to an input sequence, uniformly numbering the tasks by the task scheduling nodes, and then issuing the tasks to task execution nodes;
(2) traversing the feature vectors of the text by the task execution nodes, and calculating the similarity of each vector and other vectors of the calculation nodes according to the updated text fusion model to obtain a candidate similarity set;
(3) selecting the maximum similarity from the similarity candidate set and recording the feature vector corresponding to the maximum similarity to form a feature vector similarity set;
(4) filtering combinations with the similarity smaller than a specified threshold value from the feature vector similarity set to obtain a filtering set, and outputting the result to the message middleware;
(5) and taking out the filtering set from the message middleware, merging and outputting the sets with the same text until all clusters are not updated any more, and obtaining the real-time clustered news information.
Step five, pushing in real time: and pushing the news information clustered in real time to the user in real time through a visualization tool.
In the above news information topic detection and real-time aggregation method, the internet news media data is various news information from various media platforms.
According to the news information topic detection and real-time aggregation method, in the first step, data acquisition adopts a distributed architecture design, a task generation module executes a generated acquisition task, and a task execution module executes the acquisition task.
According to the news information topic detection and real-time aggregation method, message middleware can be arranged between the task generation module and the task execution module, and the two modules are respectively in communication connection with the message middleware to finish data transmission.
According to the news information topic detection and real-time aggregation method, the distributed acquisition program comprises the task scheduling center and the task acquisition nodes, wherein the task scheduling center acquires tasks from the task list and issues the acquired tasks to the specific task acquisition nodes through the message middleware to generate the acquisition tasks to be executed of the form and shadow; the task acquisition node is used for executing an acquisition task and downloading and acquiring page news data.
The invention has the beneficial effects that:
the text similarity calculation method utilizes a multi-feature fusion method to construct a text feature model, utilizes a named entity recognition technology and an LDA model to obtain the subject feature of the text, and fully considers the named entity factors and the time factors to construct a frame of text similarity calculation.
The invention adopts a distributed real-time clustering algorithm to distribute topic clustering tasks to different computing nodes, improves the accuracy and real-time performance of news information real-time aggregation, and solves the performance problem under a single node.
According to the topic detection and real-time aggregation method for multi-data news information, real-time pushing of the news information is finished through data acquisition, data processing, text fusion model construction and real-time aggregation, during data acquisition, a task generation module executes and generates an acquisition task, a task execution module executes the acquisition task, and the two modules can dynamically expand or reduce resources according to the size of a task amount scheduling program without influencing normal operation of a system, so that acquisition efficiency is guaranteed.
Drawings
FIG. 1 is a schematic view of the overall process of the present invention.
Fig. 2 is a schematic diagram of data acquisition processing according to the present invention.
Detailed Description
The invention provides a topic detection and aggregation method for multi-data news information on the basis of constructing a text feature model by using a multi-feature fusion method, and topic clustering tasks are distributed to different computing nodes, so that the accuracy and the real-time performance of news information real-time aggregation are improved, and the performance problem under a single node is solved. The invention is further illustrated with reference to the following figures and examples.
As shown in fig. 1, the news information topic detection and real-time aggregation method of the present invention includes the following steps.
Step one, distributed data acquisition: the news information from the internet news media website is collected in real time through a distributed collection program to serve as original data.
The method comprises the following steps:
(1) generating an acquisition task, generating a corresponding acquisition task according to the data volume of the data source, and transmitting the acquired task to the message middleware;
(2) and receiving the acquisition task, executing the acquisition task, and acquiring data according to the acquisition task in the middle of the acquisition task and the received message to obtain first data.
The mutual news media data is various news information from various media platforms (including but not limited to websites of traditional news media in various cities and internet news sites). In the specific implementation process, script can be used as a framework of an acquisition program, a task acquisition module extracts a task according to an initialized data source and a task extraction rule, and writes the analyzed task into a kafka acquisition task; and the acquisition module reads the tasks from the kafka and performs data acquisition and completes preprocessing and warehousing work. During implementation, partial task acquisition or execution nodes can be dynamically started and suspended by a scheduling program according to the task quantity condition in kafka.
Or a distributed architecture design is adopted, the task generating module executes and generates the acquisition task, and the task executing module executes the acquisition task; meanwhile, a message middleware can be arranged between the task generating module and the task executing module, and the two modules are respectively in communication connection with the message middleware to finish data transmission; the task generation module and the task execution module can dynamically expand or reduce resources according to the size of the task amount and the scheduling program without influencing the normal operation of the system, and the acquisition efficiency is ensured.
Step two, data preprocessing: carrying out text denoising, Chinese word segmentation, word filtering stop words, part of speech tagging, keyword extraction and named entity identification on original data to obtain a data document set D to be processed;
for the Chinese word segmentation, because the current Chinese word segmentation technology is relatively mature and the word segmentation effect of the mainstream word segmentation tool is relatively close, the open-source Chinese word segmentation tool, such as the jieba word segmentation, can be directly used in the implementation.
Aiming at the filtering of stop words, new stop words can be further added into the stop word library in a further perfection manner by combining the characteristics of news aggregation on the basis of considering the common natural language processing stop word set.
Step three, constructing a text feature model: constructing a text feature model by using a multi-feature fusion method, wherein the model construction method comprises the following steps:
(1) obtaining subject characteristics of texts by utilizing named entity recognition technology and LDA model, receiving a document set D as input, and calculating text similarity sim (p, q) of texts p and qlda
Figure GDA0003099045890000071
In the formula: p and q are probability vectors of texts, and DKL is a vector distance calculated by using relative entropy.
(2) Obtaining semantic features of the text by using a Word2Vect model, and calculating the semantic similarity sim (p, q) of the text p and the text q by using cosine similarityv2q
Figure GDA0003099045890000072
In the formula: p is a radical ofiAnd q isiRespectively, representing different text.
The important meaning of the Word2Vect model (Word vector) is that natural language is converted into a vector that a computer can understand the computation. Word2Vec is a Word vector computation model proposed by Google. The Word2 vent tool mainly comprises two models: continuous bag of words model (CBOW, continuous bag of words) and skip-word model (skip-gram). CBOW is a word vector obtained by training according to the context to predict a target word; and the Skip-gram is trained according to the target word to predict the surrounding words to obtain a word vector. In the specific implementation process, because the Skip-gram has a good effect on large-scale corpora, the Skip-gram is adopted to construct word vectors, the news2016zh corpora is used as a training corpus to construct word vectors, and a trained model is used to represent texts.
(3) Fusing text features and semantic features by using weighting factors to obtain a text fusion model,
sim(p,q)=α*sim(p,q)lda+β*sim(p,q)v2q
in the formula: α and β are weighting factors, and α + β is 1.
(4) Adding a time attenuation factor to the text fusion model to update the model, wherein the similarity of the updated text is calculated as follows:
sim(p,q)=e-k*(t2-t1)*α*sim(p,q)lda+e-k*(t2-t1)*β*sim(p,q)v2q
in the formula: k is an attenuation factor, t2And t1Is the update time of both articles.
Step four, distributed real-time clustering: the method for clustering news information in real time by adopting a distributed real-time clustering algorithm comprises the following steps:
(1) and vectorizing the collected and preprocessed text, transferring the vector data to task scheduling nodes of a distributed real-time aggregation algorithm according to an input sequence, uniformly numbering the tasks by the task scheduling nodes, and then issuing the tasks to task execution nodes.
(2) Traversing the feature vectors of the text by the task execution nodes, and calculating the similarity of each vector and other vectors of the calculation nodes according to the updated text fusion model to obtain a candidate similarity set;
(3) selecting the maximum similarity from the similarity candidate set and recording the feature vector corresponding to the maximum similarity to form a feature vector similarity set;
(4) filtering combinations with the similarity smaller than a specified threshold value from the feature vector similarity set to obtain a filtering set, and outputting the result to the message middleware;
(5) and taking out the filtering set from the message middleware, merging and outputting the sets with the same text until all clusters are not updated any more, and obtaining the real-time clustered news information.
Step five, pushing in real time: and pushing the news information clustered in real time to the user in real time through a visualization tool.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be covered thereby.

Claims (5)

1. A news information topic detection and real-time aggregation method is characterized in that: the method comprises the following steps:
step one, distributed data acquisition: collecting news information from an internet news media website in real time through a distributed collection program to serve as original data;
step two, data preprocessing: carrying out text denoising, Chinese word segmentation, word filtering stop words, part of speech tagging, keyword extraction and named entity identification on original data to obtain a data document set D to be processed;
step three, constructing a text feature model: the text feature model is constructed by utilizing a multi-feature fusion method, and the model construction method comprises the following steps:
(1) obtaining subject characteristics of texts by utilizing named entity recognition technology and LDA model, receiving a document set D as input, and calculating text similarity sim (p, q) of texts p and qlda
Figure FDA0003099045880000011
In the formula: p and q are probability vectors of texts, and DKL is a vector distance calculated by adopting relative entropy;
(2) obtaining semantic features of the text by using a Word2Vect model, and calculating the semantic similarity sim (p, q) of the text p and the text q by using cosine similarityv2q
Figure FDA0003099045880000012
(3) A text fusion model is obtained by fusing the theme characteristics and the semantic characteristics by adopting the weighting factors,
sim(p,q)=α*sim(p,q)lda+β*sim(p,q)v2q
in the formula: α, β are weighting factors, α + β ═ 1;
(4) adding time attenuation factors to the text fusion model to update the model, calculating the similarity of the updated text,
sim(p,q)=e-k*(t2-t1)*α*sim(p,q)lda+e-k*(t2-t1)*β*sim(p,q)v2q
in the formula: k is an attenuation factor, t2And t1Is the update time of both articles;
step four, distributed real-time clustering: the method for clustering news information in real time by adopting a distributed real-time clustering algorithm comprises the following steps:
(1) vectorizing the collected and preprocessed text, transferring the vector data to task scheduling nodes of a distributed real-time aggregation algorithm according to an input sequence, uniformly numbering the tasks by the task scheduling nodes, and then issuing the tasks to task execution nodes;
(2) traversing the feature vectors of the text by the task execution nodes, and calculating the similarity of each vector and other vectors of the calculation nodes according to the updated text fusion model to obtain a candidate similarity set;
(3) selecting the maximum similarity from the similarity candidate set and recording the feature vector corresponding to the maximum similarity to form a feature vector similarity set;
(4) filtering combinations with the similarity smaller than a specified threshold value from the feature vector similarity set to obtain a filtering set, and outputting the result to the message middleware;
(5) taking out a filtering set from the message middleware, merging and outputting the sets with the same text until all clusters are not updated any more, and obtaining real-time clustered news information;
step five, pushing in real time: and pushing the news information clustered in real time to the user in real time through a visualization tool.
2. The news information topic detection and real-time aggregation method as claimed in claim 1, wherein: the internet news media website data is various news information from various media platforms.
3. The news information topic detection and real-time aggregation method as claimed in claim 1, wherein: in the first step, data acquisition adopts a distributed architecture design, a task generation module executes a generated acquisition task, and a task execution module executes the acquisition task.
4. The news information topic detection and real-time aggregation method as claimed in claim 3, wherein: and a message middleware can be arranged between the task generating module and the task executing module, and the two modules are respectively in communication connection with the message middleware to finish data transmission.
5. The news information topic detection and real-time aggregation method as claimed in claim 1, wherein: the distributed acquisition program comprises a task scheduling center and task acquisition nodes, wherein the task scheduling center acquires tasks from a task list and issues the acquisition tasks to the specific task acquisition nodes to generate corresponding acquisition tasks to be executed; the task acquisition node is used for executing an acquisition task and downloading and acquiring page news data.
CN202011613849.8A 2020-12-30 2020-12-30 News information topic detection and real-time aggregation method Active CN112580355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011613849.8A CN112580355B (en) 2020-12-30 2020-12-30 News information topic detection and real-time aggregation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011613849.8A CN112580355B (en) 2020-12-30 2020-12-30 News information topic detection and real-time aggregation method

Publications (2)

Publication Number Publication Date
CN112580355A CN112580355A (en) 2021-03-30
CN112580355B true CN112580355B (en) 2021-08-31

Family

ID=75145101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011613849.8A Active CN112580355B (en) 2020-12-30 2020-12-30 News information topic detection and real-time aggregation method

Country Status (1)

Country Link
CN (1) CN112580355B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117077632B (en) * 2023-10-18 2024-01-09 北京国科众安科技有限公司 Automatic generation method for information theme

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526819A (en) * 2017-08-29 2017-12-29 江苏飞搏软件股份有限公司 A kind of big data the analysis of public opinion method towards short text topic model
CN109492157A (en) * 2018-10-24 2019-03-19 华侨大学 Based on RNN, the news recommended method of attention mechanism and theme characterizing method
CN109885675A (en) * 2019-02-25 2019-06-14 合肥工业大学 Method is found based on the text sub-topic for improving LDA
US10460035B1 (en) * 2016-12-26 2019-10-29 Cerner Innovation, Inc. Determining adequacy of documentation using perplexity and probabilistic coherence
CN110795533A (en) * 2019-10-22 2020-02-14 王帅 Long text-oriented theme detection method

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8781989B2 (en) * 2008-01-14 2014-07-15 Aptima, Inc. Method and system to predict a data value
CN105677769B (en) * 2015-12-29 2018-01-05 广州神马移动信息科技有限公司 One kind is based on latent Dirichletal location(LDA)The keyword recommendation method and system of model
CN106202065B (en) * 2016-06-30 2018-12-21 中央民族大学 Across the language topic detecting method of one kind and system
CN106951463A (en) * 2017-02-27 2017-07-14 宇龙计算机通信科技(深圳)有限公司 News push method and system
US10671936B2 (en) * 2017-04-06 2020-06-02 Universite Paris Descartes Method for clustering nodes of a textual network taking into account textual content, computer-readable storage device and system implementing said method
CN107423337A (en) * 2017-04-27 2017-12-01 天津大学 News topic detection method based on LDA Fusion Models and multi-level clustering
CN107463605B (en) * 2017-06-21 2021-06-11 北京百度网讯科技有限公司 Method and device for identifying low-quality news resource, computer equipment and readable medium
CN107861939B (en) * 2017-09-30 2021-05-14 昆明理工大学 Domain entity disambiguation method fusing word vector and topic model
CN108509517B (en) * 2018-03-09 2021-05-11 东南大学 Streaming topic evolution tracking method for real-time news content
CN108519971B (en) * 2018-03-23 2022-02-11 中国传媒大学 Cross-language news topic similarity comparison method based on parallel corpus
CN108920482B (en) * 2018-04-27 2020-08-21 浙江工业大学 Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN108920508A (en) * 2018-05-29 2018-11-30 福建新大陆软件工程有限公司 Textual classification model training method and system based on LDA algorithm
CN108829799A (en) * 2018-06-05 2018-11-16 中国人民公安大学 Based on the Text similarity computing method and system for improving LDA topic model
CN109033320B (en) * 2018-07-18 2021-02-12 无码科技(杭州)有限公司 Bilingual news aggregation method and system
CN109710728B (en) * 2018-11-26 2022-05-17 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatic news topic discovery method
CN111858918A (en) * 2019-04-30 2020-10-30 中移(苏州)软件技术有限公司 News classification method and device, network element and storage medium
CN110297988B (en) * 2019-07-06 2020-05-01 四川大学 Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm
CN110738053A (en) * 2019-10-14 2020-01-31 广东南方新媒体科技有限公司 News theme recommendation algorithm based on semantic analysis and supervised learning model
CN111144453A (en) * 2019-12-11 2020-05-12 中科院计算技术研究所大数据研究院 Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data
CN111460289B (en) * 2020-03-27 2024-03-29 北京百度网讯科技有限公司 News information pushing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10460035B1 (en) * 2016-12-26 2019-10-29 Cerner Innovation, Inc. Determining adequacy of documentation using perplexity and probabilistic coherence
CN107526819A (en) * 2017-08-29 2017-12-29 江苏飞搏软件股份有限公司 A kind of big data the analysis of public opinion method towards short text topic model
CN109492157A (en) * 2018-10-24 2019-03-19 华侨大学 Based on RNN, the news recommended method of attention mechanism and theme characterizing method
CN109885675A (en) * 2019-02-25 2019-06-14 合肥工业大学 Method is found based on the text sub-topic for improving LDA
CN110795533A (en) * 2019-10-22 2020-02-14 王帅 Long text-oriented theme detection method

Also Published As

Publication number Publication date
CN112580355A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
Li et al. Filtering out the noise in short text topic modeling
US20210224568A1 (en) Method and apparatus for recognizing text
CN109783651B (en) Method and device for extracting entity related information, electronic equipment and storage medium
US11899681B2 (en) Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium
WO2017202125A1 (en) Text classification method and apparatus
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN106446148A (en) Cluster-based text duplicate checking method
US20220058222A1 (en) Method and apparatus of processing information, method and apparatus of recommending information, electronic device, and storage medium
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN112036906B (en) Data processing method, device and equipment
JP2022191412A (en) Method for training multi-target image-text matching model and image-text retrieval method and apparatus
CN110851644A (en) Image retrieval method and device, computer-readable storage medium and electronic device
WO2023045187A1 (en) Semantic search method and apparatus based on neural network, device, and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN104391969B (en) Determine the method and device of user's query statement syntactic structure
WO2022116324A1 (en) Search model training method, apparatus, terminal device, and storage medium
JP2018509664A (en) Model generation method, word weighting method, apparatus, device, and computer storage medium
CN111061837A (en) Topic identification method, device, equipment and medium
US20220383036A1 (en) Clustering data using neural networks based on normalized cuts
CN111738341B (en) Distributed large-scale face clustering method and device
Wei Study on the application of cloud computing and speech recognition technology in English teaching
CN112580355B (en) News information topic detection and real-time aggregation method
CN111538859B (en) Method and device for dynamically updating video tag and electronic equipment
CN111191242A (en) Vulnerability information determination method and device, computer readable storage medium and equipment
CN109918661A (en) Synonym acquisition methods and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant