CN113934910A - Automatic optimization and updating theme library construction method and hot event real-time updating method - Google Patents

Automatic optimization and updating theme library construction method and hot event real-time updating method Download PDF

Info

Publication number
CN113934910A
CN113934910A CN202111188831.2A CN202111188831A CN113934910A CN 113934910 A CN113934910 A CN 113934910A CN 202111188831 A CN202111188831 A CN 202111188831A CN 113934910 A CN113934910 A CN 113934910A
Authority
CN
China
Prior art keywords
word
text
document
title
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111188831.2A
Other languages
Chinese (zh)
Inventor
周洁琴
周金明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Inspector Intelligent Technology Co Ltd
Original Assignee
Nanjing Inspector Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Inspector Intelligent Technology Co Ltd filed Critical Nanjing Inspector Intelligent Technology Co Ltd
Priority to CN202111188831.2A priority Critical patent/CN113934910A/en
Publication of CN113934910A publication Critical patent/CN113934910A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for constructing an automatically optimized and updated theme library and a method for updating a hot event in real time, wherein the method for constructing the theme library comprises the following steps: acquiring history document data of a subject library: determining an internet information source of historical document data, and crawling the historical document data of the determined internet information source website for a period of time through a crawler tool; performing text preprocessing on the acquired historical document data, and merging the title and the text of the preprocessed historical document data; and calculating the weight of the words appearing in each document by using a TF-IDF algorithm, clustering each document by adopting a method of combining clustering algorithms, and setting a crawling cycle of a crawler tool to update the keyword dictionary under each topic library. Generating keyword dictionaries under different topic libraries by crawling historical webpage data, and automatically updating the topic libraries by directionally crawling webpage data in real time according to the keyword dictionaries.

Description

Automatic optimization and updating theme library construction method and hot event real-time updating method
Technical Field
The invention relates to the field of natural language processing research, in particular to a theme library construction method capable of automatically optimizing and updating and a hot event real-time updating method.
Background
With the rapid development of the internet, each internet platform can issue a large number of hot events every day, and due to the huge information amount and the difficulty in supervision, the overload phenomenon of each hot event is serious, the quality of topic contents is good and uneven, and the phenomena of repeated contents, different titles, a 'title party' and the like exist, so that a user cannot accurately know the current hot event. Therefore, how to automatically discover the hot events from the network news is very important.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: the current hot event automatic mining technology in China needs to manually determine the theme or the situation that the theme cannot be automatically updated, and has a hysteresis phenomenon; the comprehensiveness is insufficient, and because different groups have different attention degrees to the hot events under different topics, different topics need to be considered to show the hot events; in the case of a title party, in order to attract attention of a group, a case where a title and a content do not match each other may occur. Therefore, an automatic mining technology for hot events is needed, which can directly and automatically mine hot events, thereby improving the efficiency of mining the hot events, providing different subject libraries which are automatically updated, and improving the accuracy of identifying the hot subjects.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an automatic optimization and updating topic library construction method and a hot event real-time updating method, which realize the automatic real-time extraction of hot events from massive internet text information. The technical scheme is as follows:
in a first aspect, a method for automatically optimizing and updating a theme library is provided, which comprises the following steps:
acquiring history document data of a subject library: determining the internet information source of the history document data,
crawling the historical document data of the determined internet information source website for a period of time by using a crawler tool;
performing text preprocessing on the acquired historical document data, including webpage title extraction and text extraction, analyzing the webpage structure characteristics of websites to which the historical document data belong, and extracting the title and the text of each webpage according to different regular expression rules for each website in a multi-process mode; removing noise data and filtering redundant meaningless information in the webpage;
merging the title and the text of the preprocessed historical document data, and performing word segmentation and stop word processing on the sentence;
the TF-IDF algorithm is used for calculating the weight of the appeared words in each document, and the calculation formula is as follows:
Figure BDA0003300378660000021
Figure BDA0003300378660000022
Figure BDA0003300378660000023
wherein, count (W) is the occurrence frequency of the word W, | DiL is document DiThe number of all words in the document, n is the total number of all documents, I (W, D)i) Representing document DiWhether the word W is contained or not is 1 if the word W is contained, and otherwise is 0;
clustering each document by adopting a method of combining LDA and k-means clustering algorithms, obtaining document-subject distribution through an LDA subject model, using the document-subject distribution as input of k-means, clustering all documents into c categories, and using a plurality of keywords with maximum weight under each category as subject names of the categories.
Processing and extracting keywords of the news data of each theme, wherein the processing and extraction include selecting verbs and nouns and removing duplication to obtain all non-repeated keywords under each theme, and the non-repeated keywords are used as a keyword dictionary of each theme, so that the c categories obtain a theme library containing c themes;
setting a crawling cycle of the crawler tool, performing directional crawling on news data through the keyword dictionary under each topic library, and performing preprocessing (preferably, the preprocessing comprises word segmentation, duplicate removal and the like) on the news data to update the keyword dictionary under each topic library.
Preferably, the internet information sources include news websites and portal websites with high popularity as corpus sources.
Preferably, the historical document data of the determined internet information source website over a period of time is crawled through a crawler tool, specifically, historical data of a recent year is crawled.
Preferably, the top multiple keywords with the largest weight in each category are used as the subject names of the category, and specifically, the top 3 keywords with the largest weight in each category are used as the subject names of the category.
Preferably, the number of classes c is determined by a contour coefficient method.
Preferably, the first keywords with the largest weight in each category are used as the subject names of the category, and specifically, the first three keywords with the largest weight in each category are used as the subject names of the category.
In a second aspect, the present invention provides a method for updating a hotspot event in real time, which includes the following steps:
according to any one of all possible implementation manners, the method for constructing the automatically optimized and updated topic library obtains a keyword dictionary of each topic under the topic library, the latest news webpage is directionally crawled according to the keyword dictionary of each topic, the text content under each topic is subjected to the text automatic abstracting technology (avoiding the heading party), the core content is generated as the title, and similarity calculation, namely similarity calculation is carried out on abstract results of each topicX,YIf the similarity is larger than a given threshold value, the two texts are considered to be the same text i, and the number Count of the text i is countediAnd the total number of documents under that topic;
calculating the heat value of each news content according to a heat value calculation formula, wherein the formula is as follows:
Figure BDA0003300378660000031
wherein H0Is an initial heat value; a is a number weight coefficient, CountiIs the number of texts i and N is the total number of documents(ii) a b is a time weight coefficient, k is a time coefficient, T1For news release time, T0Is the current time;
and sorting according to the heat value from big to small, setting a hot event display number p value, and updating the first p hot events under each theme at regular time (such as every day and every week).
Preferably, the automatic text summarization technology is adopted for the news text content under each topic, and the core content is generated as a title, specifically as follows:
(1) for document DiSegmenting according to the complete sentence to obtain a plurality of sentences;
(2) for each sentence, preprocessing is performed by word segmentation tools such as Chinese, Japanese, etc., including word segmentation, stop word, part of speech tagging, etc.
(3) Firstly, calculating the weight of each word by TF-IDF algorithm
Figure BDA0003300378660000033
Then considering title factor and part of speech factor, adjusting the weight to obtain new weight
Figure BDA0003300378660000034
The calculation method is as follows:
Figure BDA0003300378660000035
wherein, TwFor title factor, if word w appears in the title of the document, Tw>1,TwThe specific size value is adjusted according to the effect of the last text abstract if TwIf the acquisition is larger, the title is more important, and T is adjusted according to the importance of the titlew(ii) a If the word w is not present in the document title, Tw=1。PwIs a part-of-speech coefficient, if the word w is an entity name, PwIs more than 1; otherwise, Pw=1。
According to the new weight
Figure BDA0003300378660000036
Generating text sentence vector diExpressed as:
Figure BDA0003300378660000032
(4) and calculating the similarity between sentences by utilizing the cosine similarity to construct a similarity matrix.
Figure BDA0003300378660000041
Wherein, XiWeight for the ith word of text sentence vector XW,X,YiWeight for the ith word of text sentence vector YW,Y
(5) And (5) carrying out iterative calculation by using the TextRank to obtain the score of each sentence.
(6) And sorting the sentences according to the importance degree, and selecting a plurality of sentences with the importance degree larger than a preset value as candidate abstract sentences.
(7) And extracting sentences from the candidate abstract sentences to form an abstract result according to the set word count or sentence word count requirement.
Compared with the prior art, one of the technical schemes has the following beneficial effects: generating keyword dictionaries under different topic libraries by crawling historical webpage data, and then directionally crawling the webpage data in real time according to the keyword dictionaries to update the topic libraries. For the latest webpage data, the text abstract is automatically extracted as the title through an automatic abstracting technology, so that the contents of the title party are effectively avoided, and hot events of different topics are sequenced according to the heat value.
Detailed Description
In order to clarify the technical solution and the working principle of the present invention, the embodiments of the present disclosure will be described in further detail below. All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
The numerical descriptions of the terms "(1)", "(2)", "(3)" etc. in the description and claims of this application are for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be practiced in sequences other than those described herein.
In a first aspect: the embodiment of the disclosure provides a method for constructing an automatically optimized and updated theme library, which comprises the following steps:
acquiring history document data of a subject library: determining internet information sources of the historical document data, wherein the internet information sources include portal websites, news websites, wikipedia, Baidu encyclopedia, forums, blogs and the like, preferably, in consideration of accuracy and authority of the historical document data, the internet information sources include news websites and portal websites with high popularity as corpus sources, such as people's network, phoenix network, fox searching network, Tencent network, new wave network, internet surfing network, Xinhua network and the like. And crawling historical document data of the determined internet information source website for a period of time by using a crawler tool, preferably crawling historical data of the determined internet information source website for one year.
Performing text preprocessing on the acquired historical document data, including webpage title extraction and text extraction, analyzing the webpage structure characteristics of websites to which the historical document data belong, and extracting the title and the text of each webpage according to different regular expression rules for each website in a multi-process mode; noise data such as advertisements, registration and copyright information are removed, redundant meaningless information in the webpage is filtered, and the effectiveness of information extraction is improved.
Merging the title and the text of the preprocessed historical document data, and performing word segmentation and stop word processing on the sentence.
The TF-IDF algorithm is used for calculating the weight of the appeared words in each document, and the calculation formula is as follows:
Figure BDA0003300378660000051
Figure BDA0003300378660000052
Figure BDA0003300378660000053
wherein, count (w) is the occurrence frequency of word w, | DiL is document DiThe number of all words in, n is the total number of all documents, I (w, D)i) Representing document DiWhether the word w is contained or not is 1 if the word w is contained, and otherwise is 0;
clustering each document by adopting a method of combining LDA and k-means clustering algorithms, obtaining document-subject distribution through an LDA subject model, using the document-subject distribution as input of k-means, clustering all documents into c categories, and using a plurality of (three) keywords with the maximum weight under each category as subject names of the categories. Preferably, the category number c is determined by a contour coefficient method, and an optimal c value is selected, so that the document clustering effect is good.
Processing and extracting keywords of the news data of each theme, wherein the processing and extraction include selecting verbs and nouns, removing duplication and the like, so that all non-repeated keywords under each theme are obtained and serve as a keyword dictionary of each theme, and thus the c categories obtain a theme library containing c themes; the names of the topics are the first multiple (three) keywords with the largest weight.
Setting a crawling cycle (for example, every day and every week) of the crawler tool, performing directional crawling on news data through the keyword dictionary in each topic library, and performing preprocessing (including word segmentation, duplicate removal and other processing) on the news data to update the keyword dictionary in each topic library.
In a second aspect, an embodiment of the present disclosure provides a method for updating a hotspot event in real time, where the method includes the following steps:
automatically optimized and updated subject library builder according to any one of all possible implementation mannersThe method comprises the steps of obtaining a keyword dictionary of each topic in a topic library, directionally crawling the latest news webpage according to the keyword dictionary of each topic, adopting a text automatic abstract technology (avoiding a heading party) for news text contents of each topic to generate core contents as a heading, and performing similarity calculation, namely similarity calculation on abstract results of each topicX,YIf the similarity is larger than a given threshold value, the two texts are considered to be the same text i, and the number Count of the text i is countediAnd the total number of documents under that topic;
calculating the heat value of each news content according to a heat value calculation formula, wherein the formula is as follows:
Figure BDA0003300378660000061
wherein H0Is an initial heat value; a is a number weight coefficient, CountiThe number of the texts i is N is the total number of the documents; b is a time weight coefficient, k is a time coefficient, T1For news release time, T0Is the current time;
and sorting according to the heat value from big to small, setting a hot event display number p value, and updating the first p hot events under each theme at regular time (such as every day and every week).
Preferably, the automatic text summarization technology is adopted for the news text content under each topic, and the core content is generated as a title, specifically as follows:
1. for document DiSegmenting according to the complete sentence to obtain a plurality of sentences;
2. for each sentence, preprocessing is performed by word segmentation tools such as Chinese, Japanese, etc., including word segmentation, stop word, part of speech tagging, etc. High-frequency words such as ' and ' are ' and the like without much meaning are filtered out by stop words. And determining the part of speech of each word, such as nouns, verbs, adverbs and adjectives, and identifying entity names including names of people, places, organizations and the like in the text through part of speech tagging pairs.
3. Firstly, calculating the weight of each word by TF-IDF algorithm
Figure BDA0003300378660000062
Then considering title factor and part of speech factor, adjusting the weight to obtain new weight
Figure BDA0003300378660000063
This takes into account not only the probability of a word appearing in a single document and the weight of that word in the entire document set, but also the importance of the word itself in the title and in the entity name. The calculation method is as follows:
Figure BDA0003300378660000064
wherein, TwFor title factor, if word w appears in the title of the document, Tw>1,TwThe specific size value is adjusted according to the effect of the last text abstract if TwIf the acquisition is larger, the title is more important, and T is adjusted according to the importance of the titlew(ii) a If the word w is not present in the document title, Tw=1。PwIs a part-of-speech coefficient, if the word w is an entity name, PwIs more than 1; otherwise, Pw=1。
According to the new weight
Figure BDA0003300378660000073
Generating text sentence vector diExpressed as:
Figure BDA0003300378660000071
4. and calculating the similarity between sentences by utilizing the cosine similarity to construct a similarity matrix.
Figure BDA0003300378660000072
Wherein, XiWeight of i-th word as text sentence vector XweightW,X,YiWeight for the ith word of text sentence vector YW,Y
5. And (5) carrying out iterative calculation by using the TextRank to obtain the score of each sentence.
6. And sorting the sentences according to the importance degree, and selecting a plurality of sentences with the importance degree larger than a preset value as candidate abstract sentences.
7. And extracting sentences from the candidate abstract sentences to form an abstract result according to the set word count or sentence word count requirement.
The invention has been described above by way of example, it is obvious that the specific implementation of the invention is not limited by the above-described manner, and that various insubstantial modifications are possible using the method concepts and technical solutions of the invention; or directly apply the conception and the technical scheme of the invention to other occasions without improvement and equivalent replacement, and the invention is within the protection scope of the invention.

Claims (9)

1. A method for constructing an automatically optimized and updated theme library is characterized by comprising the following steps:
acquiring history document data of a subject library: determining the internet information source of the history document data,
crawling the historical document data of the determined internet information source website for a period of time by using a crawler tool;
performing text preprocessing on the acquired historical document data, including webpage title extraction and text extraction, analyzing the webpage structure characteristics of websites to which the historical document data belong, and extracting the title and the text of each webpage according to different regular expression rules for each website in a multi-process mode; removing noise data and filtering redundant meaningless information in the webpage;
merging the title and the text of the preprocessed historical document data, and performing word segmentation and stop word processing on the sentence;
the TF-IDF algorithm is used for calculating the weight of the appeared words in each document, and the calculation formula is as follows:
Figure FDA0003300378650000011
Figure FDA0003300378650000012
Figure FDA0003300378650000013
wherein, count (w) is the occurrence frequency of word w, | DiL is document DiThe number of all words in, n is the total number of all documents, I (w, D)i) Representing document DiWhether the word w is contained or not is 1 if the word w is contained, and otherwise is 0;
clustering each document by adopting a method of combining LDA and k-means clustering algorithms, obtaining document-subject distribution through an LDA subject model, using the document-subject distribution as input of k-means, clustering all documents into c categories, and using a plurality of keywords with maximum weight under each category as subject names of the categories;
processing and extracting keywords of the news data of each theme, wherein the processing and extraction include selecting verbs and nouns and removing duplication to obtain all non-repeated keywords under each theme, and the non-repeated keywords are used as a keyword dictionary of each theme, so that the c categories obtain a theme library containing c themes;
setting a crawling cycle of the crawler tool, performing directional crawling on news data through the keyword dictionary under each topic library, and preprocessing the news data to update the keyword dictionary under each topic library.
2. The method as claimed in claim 1, wherein the internet information sources include news websites and portals with high popularity as corpus sources.
3. The method for automatically optimizing and updating the theme base according to claim 1, wherein the historical document data of the determined internet information source website over a period of time is crawled by a crawler tool, specifically, the historical data of the last year is crawled.
4. The method as claimed in claim 1, wherein the top keywords with the highest weight in each category are used as the subject names of the categories, and specifically, the top 3 keywords with the highest weight in each category are used as the subject names of the categories.
5. The method for automatically optimizing and updating the theme base according to claim 1, wherein the category number c is determined by a contour coefficient method.
6. The method as claimed in claim 1, wherein the first keywords with the highest weight in each category are used as the subject names of the categories, and specifically, the first three keywords with the highest weight in each category are used as the subject names of the categories.
7. The method for automatically optimizing and updating the theme library according to any one of claims 1 to 6, wherein the preprocessing of the news data comprises word segmentation and duplicate removal.
8. A method for updating a hot spot event in real time is characterized by comprising the following steps:
the method for constructing an automatically optimized and updated topic library according to any one of claims 1 to 7, wherein a keyword dictionary of each topic in the topic library is obtained, the latest news webpage is directionally crawled according to the keyword dictionary of each topic, the text automatic abstracting technology is adopted for the news text content of each topic to generate core content as a title, and the similarity meter is carried out on the abstract result of each topicInstant similarityX,YIf the similarity is larger than a given threshold value, the two texts are considered to be the same text i, and the number Count of the text i is countediAnd the total number of documents under that topic;
calculating the heat value of each news content according to a heat value calculation formula, wherein the formula is as follows:
Figure FDA0003300378650000021
wherein H0Is an initial heat value; a is a number weight coefficient, CountiThe number of the texts i is N is the total number of the documents; b is a time weight coefficient, k is a time coefficient, T1For news release time, T0Is the current time;
and sequencing according to the heat degree value from large to small, setting a hot event display number p value, and updating the first p hot events under each theme at regular time.
9. The method according to claim 8, wherein a text automatic abstraction technology is adopted for news text content under each topic to generate core content as a title, and the method specifically comprises the following steps:
(1) for document DiSegmenting according to the complete sentence to obtain a plurality of sentences;
(2) for each sentence, preprocessing the sentence through a word segmentation tool, wherein the preprocessing comprises word segmentation, stop word removal and part of speech tagging;
(3) firstly, calculating the weight of each word by TF-IDF algorithm
Figure FDA0003300378650000031
Then considering title factor and part of speech factor, adjusting the weight to obtain new weight
Figure FDA0003300378650000032
The calculation method is as follows:
Figure FDA0003300378650000033
wherein, TwFor title factor, if word w appears in the title of the document, Tw>1,TwThe specific size value is adjusted according to the effect of the last text abstract if TwIf the acquisition is larger, the title is more important, and T is adjusted according to the importance of the titlew(ii) a If the word w is not present in the document title, Tw=1;PwIs a part-of-speech coefficient, if the word w is an entity name, PwIs more than 1; otherwise, Pw=1;
According to the new weight
Figure FDA0003300378650000034
Generating text sentence vector diExpressed as:
Figure FDA0003300378650000035
(4) calculating the similarity between sentences by using the cosine similarity, and constructing a similarity matrix;
Figure FDA0003300378650000036
wherein, XiWeight for the ith word of text sentence vector XW,X,YiWeight for the ith word of text sentence vector YW,Y
(5) Carrying out iterative calculation by using the TextRank to obtain the score of each sentence;
(6) sorting the sentences according to the importance degree, and selecting a plurality of sentences with the importance degree larger than a preset value as candidate abstract sentences;
(7) and extracting sentences from the candidate abstract sentences to form an abstract result according to the set word count or sentence word count requirement.
CN202111188831.2A 2021-10-12 2021-10-12 Automatic optimization and updating theme library construction method and hot event real-time updating method Withdrawn CN113934910A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111188831.2A CN113934910A (en) 2021-10-12 2021-10-12 Automatic optimization and updating theme library construction method and hot event real-time updating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111188831.2A CN113934910A (en) 2021-10-12 2021-10-12 Automatic optimization and updating theme library construction method and hot event real-time updating method

Publications (1)

Publication Number Publication Date
CN113934910A true CN113934910A (en) 2022-01-14

Family

ID=79278993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111188831.2A Withdrawn CN113934910A (en) 2021-10-12 2021-10-12 Automatic optimization and updating theme library construction method and hot event real-time updating method

Country Status (1)

Country Link
CN (1) CN113934910A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975246A (en) * 2023-08-03 2023-10-31 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975246A (en) * 2023-08-03 2023-10-31 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal
CN116975246B (en) * 2023-08-03 2024-04-26 深圳市博锐高科科技有限公司 Data acquisition method, device, chip and terminal

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
CN112131863B (en) Comment opinion theme extraction method, electronic equipment and storage medium
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
Gunawan et al. Automatic text summarization for Indonesian language using textteaser
WO2008014702A1 (en) Method and system of extracting new words
WO2007143914A1 (en) Method, device and inputting system for creating word frequency database based on web information
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN106570120A (en) Process for realizing searching engine optimization through improved keyword optimization
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
Rathod Extractive text summarization of Marathi news articles
CN116561295A (en) Internet data extraction system
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
CN107908749B (en) Character retrieval system and method based on search engine
CN111444713A (en) Method and device for extracting entity relationship in news event
CN111680505B (en) Method for extracting unsupervised keywords of MarkDown feature perception
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Shah et al. An automatic text summarization on Naive Bayes classifier using latent semantic analysis
Doostmohammadi et al. Perkey: A persian news corpus for keyphrase extraction and generation
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data
Kannan et al. Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220114

WW01 Invention patent application withdrawn after publication