CN115712700A - Hot word extraction method, system, computer device and storage medium - Google Patents

Hot word extraction method, system, computer device and storage medium Download PDF

Info

Publication number
CN115712700A
CN115712700A CN202211446313.0A CN202211446313A CN115712700A CN 115712700 A CN115712700 A CN 115712700A CN 202211446313 A CN202211446313 A CN 202211446313A CN 115712700 A CN115712700 A CN 115712700A
Authority
CN
China
Prior art keywords
word
word segmentation
segmentation processing
keywords
processing result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211446313.0A
Other languages
Chinese (zh)
Inventor
王晓婷
李勃
容冰
杨书豪
王倩
储成君
刘侗一
李雅婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Environmental Planning Institute Of Ministry Of Ecology And Environment
Original Assignee
Environmental Planning Institute Of Ministry Of Ecology And Environment
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Environmental Planning Institute Of Ministry Of Ecology And Environment filed Critical Environmental Planning Institute Of Ministry Of Ecology And Environment
Priority to CN202211446313.0A priority Critical patent/CN115712700A/en
Publication of CN115712700A publication Critical patent/CN115712700A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application discloses a hot word extraction method, a system, computer equipment and a storage medium, which are used for analyzing keywords in the field of ecological environment by using text information published in a period of time; the method specifically comprises the following steps: collecting text data; word segmentation processing; extracting keywords based on the word segmentation processing result; performing word frequency statistics and relevance analysis based on the word segmentation processing result; performing topic clustering and co-occurrence network analysis based on the word segmentation processing result; and performing hot word screening based on the results of extracting the keywords, word frequency statistics and topic clustering. According to the technical scheme of the application, the application has the advantages that: keywords in a period of time of ecological fields are analyzed, and suggestions and references in directions are provided on macro policy management and public opinion propaganda guidance; on the basis of a text data analysis method and word frequency statistics, the importance and the representativeness of related words are considered for word screening, and more accurate keywords are provided.

Description

Hot word extraction method, system, computer device and storage medium
Technical Field
The present application relates to the field of semantic analysis, and in particular, to a method, system, computer device, and storage medium for extracting hotwords.
Background
The method is widely applied to the field of government affairs at present, but is not realized in the field of environmental protection. The ecological environment field differs from other public event fields in that:
ecological environment field events are generally high in specialty; the ecological environment protection is a discipline formed by crossing professional disciplines such as engineering, science, management and science, and the like, is closely linked with basic disciplines, and has independent logic and discipline systems. Ecological environment protection is embodied in the aspects of production and life, but common people are difficult to pay attention to except part of symbolic environmental pollution events, and the inherent operation logic of the ordinary people is difficult to understand. Therefore, analysis by conventional collection means such as public sentiment text often ignores some hotspots, and some professional terms may be split into meaningless words by a common word segmentation method.
Words generated in the ecological environment field are closely related, and a plurality of environment-friendly hotwords generally point to the same thingElements, e.g. PM 2.5 Words such as haze and atmospheric pollution often appear simultaneously, and the words are extracted through means such as word frequency statistics to generate repetition.
Therefore, how to extract the hot words in the ecological environment field from massive texts becomes a technical problem to be solved in the field.
Disclosure of Invention
In view of this, the present application provides a hotword extraction method, a system, a computer device, and a storage medium, so as to achieve accuracy of extracting hotwords in the field of ecological environment from massive texts and improve production efficiency.
According to the application, a hotword extraction method is provided, and the method comprises the following steps:
step 1: collecting text data in the ecological environment field;
step 2: word segmentation processing;
and step 3: extracting keywords based on the word segmentation processing result, and/or performing word frequency statistics based on the word segmentation processing result, and/or performing topic clustering based on the word segmentation processing result;
and 4, step 4: and carrying out hot word screening based on the results of extracting the keywords, word frequency statistics and topic clustering.
As a modification of the above method, the step 3 further includes: and performing relevance analysis based on the word segmentation processing result.
As a modification of the above method, the step 3 further includes: and performing co-occurrence network analysis based on the word segmentation processing result.
As an improvement of the method, the word segmentation processing adopts a Jieba library word segmentation component to perform word segmentation so as to generate word combinations containing semantics.
As an improvement of the method, the extraction keywords adopt a TextRank algorithm based on MMR optimization to generate a summary word set which is most suitable for expressing the text meaning.
As an improvement of the method, the word frequency statistics adopts a TF-IWF algorithm to generate a word set with the frequency of occurrence from high to low.
As an improvement of the method, the topic clustering adopts an LDA model to generate core key words and specific probabilities of topics.
As an improvement of the method, the hotword screening is to extract the top word set in the result generated in the step 3.
The present application also provides a hotword extraction system, the system including:
the data collecting module is used for collecting text information in the field of ecological environment;
the word segmentation processing module is used for carrying out word segmentation processing on the collected data;
the hot word calculation module is used for extracting keywords based on the word segmentation processing result, and/or performing word frequency statistics based on the word segmentation processing result, and/or performing topic clustering based on the word segmentation processing result;
and the hot word screening module is used for screening hot words based on the results of extracting the keywords, the word frequency statistics and the theme clustering.
The present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of the above when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the method according to any of the preceding claims.
According to the technical scheme of the application, the application has the advantages that:
1. the application reviews hotspots and focuses in the ecological environment field for a period of time, analyzes keywords in the ecological field for a period of time, and provides suggestions and references in the direction on macro policy management and public opinion propaganda guidance.
2. In the text data selection range, the method and the device combine authoritative and non-authoritative media such as media, public opinions, authorities and the like for analysis, and the data source is wide.
3. On the basis of word frequency statistics, the text data analysis method adopts cluster analysis for establishing the connection among different words and phrases, and finally comprehensively considers the importance and representativeness of the connected words and phrases for word screening, thereby providing more accurate keywords.
4. On the basis of the professionalism of the ecological environment field, the method provides a set of ecological environment field word stock for word segmentation and screening of text words based on practical experience.
5. In consideration of the difference of the ecological environment hot words from other theme words in terms of parts of speech and word meanings and the characteristics of long words, conjunctions and special words, the method adopts the Jieba word segmentation component to replace a default dictionary by specifying a dictionary of segmented words (important specific words including carbon biodiversity protection, carbon emission right transaction, three-way-one word, ecological product value realization ecological environment protection planning and the like) and a self-defined dictionary, and ensures that the key specific words are not disassembled in the segmented words and are reserved.
6. Considering that when the TextRank selects the abstract sentences, the TextRank selects the sentences with high similarity to a plurality of sentences, so that the redundancy of the selected sentences is high finally, and the defect that some sentences with other topic information but 'single power' are missed is overcome.
7. Considering that the TF-IDF algorithm adopts text inverse frequency IDF to weight the TF value to obtain a large weight value as a keyword, but the simple structure of the IDF cannot effectively reflect the importance degree of words and the distribution condition of characteristic words, so that the IDF cannot well complete the function of weight adjustment, and the characteristic of high frequency of the characteristic words in ecological environment hot words is combined, so that the precision of the TF-IDF algorithm is not very high, different position weights are given to the words at different positions by using a paragraph labeling technology, word similarity calculation is carried out on homonymy words with higher word frequency in a word segmentation result, the words with higher similarity are combined, the keywords are obtained by sequencing according to the weight value through the TF-IWF algorithm, and the problem that the keyword extraction precision is not high due to the fact that the words with high similarity are not taken care in the ecological environment hot word Chinese keyword extraction method is solved.
8. Considering the problems that when the data volume of the ecological environment hot word text is very large, the LDA algorithm is slow in sampling, and the topic classification efficiency under the large text data is reduced, through the LDA parallel optimization method and the LDA parallel optimization process, the use efficiency of the LDA model under the large text data can be improved, and the problems that the ecological environment hot word is wide in source and much in data are solved.
Additional features and advantages of the present application will be described in detail in the detailed description which follows.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate an embodiment of the invention and, together with the description, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a text analysis-based ecological environment domain hotword extraction method;
FIG. 2 is a schematic diagram of word segmentation processing;
FIG. 3 is a diagram showing the TF (vocabulary frequency) -IWF (inverse vocabulary frequency) results using the Ministry of ecology public communications in 2021;
FIG. 4 is a diagram showing the result of correlation analysis (coresponsiveness analysis) using the WeChat public information of the department of ecology, 2021;
FIG. 5 is a diagram showing the result of Cluster analysis (Cluster analysis) using the public information of Ministry of ecology and Environment in 2021;
fig. 6 is a diagram showing the result of Co-Occurrence Network analysis (Co-Occurrence Network analysis) using the eco 2021 ministry of ecology.
Detailed Description
The technical solutions of the present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The application aims to review hot spots and focus topics in the field of ecological environment and extract high-heat key points or keywords from ecological environment events in a period of time. The period of time can be a period of time of a past week, a month or a year, and the like, and a period of time with unlimited time can be used as an input for extracting hotwords in the application.
As shown in fig. 1, the hot word extraction method of the present application includes the following steps:
step 1: collecting text data in the ecological environment field;
the text data of the ecological environment field collected by the present application may be any text data related to the ecological environment field, including:
all article texts related to the ecological environment events in a period of time are collected based on official published text media (including WeChat public numbers). In the official text media, the number of partial columns is large, and the publishing time frequency is high, so that the official text media need to be eliminated. In addition, the topic of the article slightly released by the ministry of ecological environment is complicated, and tends to release publicity content and civilian articles, and the specialty of protection of the ministry of ecological environment and the time trend of hot spots cannot be reflected, so that the characters of a news release party are required to be used as an auxiliary analysis sample. And extracting the monthly news conference characters of the ecological environment department as an analysis sample. Meanwhile, in order to analyze the attention direction of social public sentiment, articles issued in a period of time from ecological environment related columns in newspapers with large influence, such as ecological environment protection in the sunshine journal, public sentiment channels in the Chinese environment journal and the like, can be selected as analysis samples.
On the other hand, reference is made to history and literature networks, etc. The analysis sample breaks out of the data collection premise that the ecological environment theme is the main part, and the ecological environment hot word theme is extracted from a wide text, so that the relation between the ecological environment hot word cluster and other theme clusters is researched.
And other unofficially released ecological environment domain data.
Step 2: word segmentation processing;
word segmentation is a basic step of Chinese text processing and is also a basic module of Chinese man-machine natural language interaction. Unlike English, chinese sentences have no word boundaries, so when Chinese natural language processing is performed, word segmentation is usually required first to generate word combinations containing semantics. The word segmentation effect will directly affect the effect of the module such as the part of speech, the syntax tree, etc.
The analysis sample used by the method is based on a multi-channel, multi-source and multi-structure text data set, and when text word segmentation processing is carried out, a HanLP method, a FoolNLTK method or a Jieba library can be used for preprocessing text data.
The HanLP method is a word segmentation method based on HMM-Bigram and composed of three principles of word formation and dictionary word segmentation, and comprises shortest path word segmentation, N-shortest path word segmentation, perceptron word segmentation, CRF word segmentation and top-speed dictionary word segmentation algorithms.
The FoolNLTK method is based on recurrent neural network word segmentation, and comprises a BilSTM algorithm (and bidirectional LSTM model).
The method comprises the steps of preprocessing text data by using a Chinese word segmentation module Jieba base based on python, and selecting a custom dictionary (comprising important specific vocabularies such as biodiversity protection, carbon emission right transaction, three-line one-unit, ecological product value realization of ecological environment protection planning and the like) according to the characteristics of ecological environment hot words. The Jieba library word segmentation component can be used for the functions of word segmentation, part of speech tagging, keyword extraction and the like of Chinese texts, and the main work flow of the component is shown in fig. 2. In the Jieba segmentation, firstly, a directed acyclic graph of sentences is generated by comparing dictionaries, and then, the sentences are intercepted after the shortest path is found according to the dictionaries or the sentences are directly intercepted according to different selected modes. HMM is used for new word discovery for unregistered words (words not in the dictionary).
The word segmentation function generates a dictionary tree according to a dictionary, obtains continuous Chinese characters and English characters by using a regular mode, segments the characters into a phrase list, obtains a maximum probability path and a shortest path by using DAG (dictionary lookup) and dynamic programming for each phrase, intercepts a sentence or directly intercepts the sentence, and sequentially connects segmented word segmentation results with non-Chinese character parts to serve as final word segmentation results. Combining the characters which are not found in the dictionary in the DAG into a new fragment phrase, and performing word segmentation by using an HMM model, namely recognizing new words outside the dictionary. The Chinese vocabulary is labeled according to BEMS four states, B is the beginning begin position, E is the end position, M is the middle position, and S is the position of a single word. And respectively storing an expression probability matrix, an initial probability vector and a transition probability matrix among the words by using a dictionary file, and solving the most possible hidden state by using a viterbi algorithm according to the probability so as to find a new word. The part-of-speech analysis is used for directly extracting parts of speech of dictionary words from the dictionary, and processing the new words by a new word and part-of-speech discovery module based on an HMM model. The HMM model constructed above is:
Figure BDA0003950484510000081
in the formula, S and O represent a state sequence and an observation sequence, respectively.
And step 3: extracting keywords based on the word segmentation processing result, and/or performing word frequency statistics based on the word segmentation processing result, and/or performing topic clustering based on the word segmentation processing result;
the Word2Vec algorithm or the TextRank algorithm can be adopted for extracting the keywords.
The Word2Vec algorithm comprises two steps of Word2Vec Word vector representation and K-Means clustering algorithm, the main idea is that for words represented by Word vectors, words in an article are clustered through the K-Means algorithm, a clustering center is selected as a main keyword of a text, the distance between other words and the clustering center, namely the similarity, is calculated, topK words closest to the clustering center are selected as keywords, and the similarity between the words can be calculated by using vectors generated by Word2 Vec.
The method adopts a TextRank algorithm to generate an abstract word set which is most suitable for expressing the text meaning. The TextRank algorithm based on MMR optimization is an algorithm for sequencing important components in a text by using a voting mechanism based on PageRank, and finally a keyword sequence from large to small according to the important components is generated; if two words co-occur within a fixed size window, a line is considered to exist between the two words and a score is defined as follows:
Figure BDA0003950484510000082
Figure BDA0003950484510000083
in the formula, vi is an object needing to calculate PR value;
vj is an in-chain;
s (Vj) is the PR value of Vj;
in (Vi) is the set of all inbound chains;
out (Vj) is the set of pointing objects present in j;
| Out (Vj) | is the number thereof;
d is a damping coefficient representing the probability of a fixed point pointing to any other point.
And the word frequency statistics adopts TF-IDF algorithm or TF-IWF (Trans frequency-inverse Fourier transform) optimization algorithm of the TF-IDF algorithm to generate a word set with the occurrence frequency from high to low. The weight value calculated by the TF-IDF is generally very small, even close to 0, the accuracy is not high, and the problem that the weight value is too small can be solved by the calculation result of the TF-IWF algorithm. The TF-IWF algorithm (a word inverse frequency mode calculation weighting algorithm) is a statistical method for evaluating the importance degree of words in a document, the text inverse frequency is replaced by the word inverse frequency, the weighting method reduces the influence of the similar texts in a document set/corpus on word weight, and more accurately expresses the importance degree of the words in the document to be checked. The core idea is that if a word appears in an article frequently, i.e. with high TF, and appears in other documents rarely, the word is considered to have good category distinguishing capability, and the model is as follows:
TF-IWF=tf ij ×iwf i
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003950484510000091
in the formula, n ij The number of times that the i word appears in the j document; the denominator is the sum of the occurrence times of all letters in the j document;
Figure BDA0003950484510000092
wherein, the molecules are frequency of all words in the corpus; the denominator is the sum of the frequency of occurrence of a given word in the corpus.
In the example, the TF (vocabulary frequency) -IWF (inverse vocabulary frequency) result of the text data of the wesn public number of the ecological environment department of 2021 is shown in fig. 3, and the TF-IWF analysis result is mainly used as a reference for selecting subsequent vocabularies, and the vocabularies can be selected according to the TF or IWF or considering both the TF and IWF simultaneously, so as to further carry out further analysis such as clustering, theme extraction, collinearity and the like.
In an example, based on the TF-IWF analysis result using the text data of the wesn public number of the ecosystem part of 2021, the result of the relevance analysis (coresponsence analysis) is shown in fig. 4, and the relevance analysis graph is mainly based on the result of the TF and IWF calculation and the result of the node data set (e.g. daily data set, per chapter data set, etc.) to perform the relevance co-occurrence, and the relevance co-occurrence is mainly used for exploring the heterogeneity of the vocabulary appearance pattern, such as what relevance exists among the vocabularies of the data set of 1 day per month, what relevance exists among the vocabularies of the data set of each month, and the like.
The text data in the ecological environment field has wide sources, multiple semantics and strong specialization, and is co-existed with text data in other fields such as social and economic development at high frequency, so that topics related to the ecological environment are effectively extracted from text data with complicated characters, complex structures and hybrid semantics by relying on results of text word frequency statistics and keyword extraction on the basis of text data vectorization, and clustering relations among different topics (ecological environment, economic development, energy industry, traffic transportation, etc.) are analyzed.
Topic clustering can be performed using either LSA (latent semantic analysis model) or LDA models.
LSA is characterized by high speed and easy realization, but the problem needs to be researched to satisfy the condition that the words are normally distributed in the document, the condition is harsh, and the data concentration of the nonlinear dependence relationship is not good.
The method and the device adopt the LDA model to perform topic clustering. Under the LDA algorithm theory, each document represents a probability distribution formed by a plurality of topics, and each topic represents a probability distribution formed by a plurality of words, so that the result of model fitting shows the core key words and specific probability of each topic, and finally, the core key word sequence ordered from large to small according to the probability can be obtained. The LDA model is a topic model, where the topics of each document in a document set are given in the form of probability distribution, so that after some documents are extracted by analyzing their topics (distribution), topic clustering or text classification can be performed according to the topics (distribution), and the main model is a hidden dirichlet distribution model as follows:
Figure BDA0003950484510000101
the LDA classification principle is an unsupervised Bayesian model, a theme model and an unsupervised learning model, and the core model is as follows:
p (word | document) = P (word | subject) P (subject | document)
The mathematical expression is:
P(w|d)=P(w|t)*P(t|d)
in the example, based on the TF-IWF analysis result of the public number text data using the ministry of ecological environment of 2021, the Cluster analysis (Cluster analysis) result is shown in fig. 5, the Cluster analysis diagram mainly uses the LDA model to preliminarily perform topic clustering on the vocabularies, and performs topic clustering display, and the Cluster analysis is mainly used for a process of exploring the vocabularies and clustering between the vocabularies to form topics, for example, in fig. 5, different shapes represent one Cluster, and it can be seen that 8 clusters appear in the example data, and the clusters also show the relevance between the vocabularies (the vocabularies are overlapped) and the importance difference of the vocabularies in the same shape (the same shape has a large number of clusters, and the clusters are small and large), but the topic co-occurrence clustering is only a process of exploring the vocabulary to form topics in the next step, and the topic extraction is not completely realized, and the topics are further extracted through optimization of the LDA model.
In the example, based on the cluster analysis of the public number text data with the micro information of the ecotope part in 2021, the Co-Occurrence Network analysis (Co-Occurrence Network analysis) result is shown in fig. 6, the Co-Occurrence Network analysis chart mainly uses the LDA model to further subject extraction for the subject clusters, the Co-Occurrence Network analysis is mainly used for extracting the subjects of the analysis data, for example, in fig. 6, separation occurs between different clusters to form independent subjects, and 8 subjects appear in the example data, wherein the "ecotope" subject is centered and largest, and includes "ecotope", "environment", "governance", "economy", "development", and other words, the highest Occurrence frequency of the most important words in the unit is largest, and other subjects among the other subjects are sub-subjects of the "ecotope" subject, for example, the "climate change" subject, some subjects related to the "ecotope" subject are parallel subjects, for example, the left subject, the extraction from the unstructured text data to the structured text data, the subject analysis of the global environment, the subject analysis, the subject of the most important words in the current subject, and the time series of the extracted subjects can be compared.
And 4, step 4: hot word screening;
after synonyms and professional irrelevant words are removed, keywords of the top ten topics are extracted based on word frequency statistics, TF-IWF statistics, cluster analysis and co-occurrence network analysis, and at most 30 words are obtained. The screening and extraction of the hot words need to consider results of three statistical methods as well as ecological environment correlation and hot events in the same year, for example, the fifteenth meeting of the convention of biodiversity convention in 2021 is held in Yunnan Kunming, and the biodiversity appears for many times in word frequency and cluster analysis, so that the hot words in the same year are selected. Also consider the central vocabulary in the subject cluster (mostly central vocabulary and high frequency vocabulary, if differences occur, the reason for this can be further analyzed). And further obtaining the ecological environment hot words in the target time sequence.
The present application further provides a computer device comprising: at least one processor, memory, at least one network interface, and a user interface. The various components in the device are coupled together by a bus system. It will be appreciated that a bus system is used to enable communications among the components. The bus system includes a power bus, a control bus, and a status signal bus in addition to a data bus.
The user interface may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, track ball, touch pad, or touch screen, etc.).
It will be appreciated that the memory in the embodiments disclosed herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), enhanced Synchronous SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, the memory stores elements, executable modules or data structures, or a subset thereof, or an expanded set thereof: an operating system and an application program.
The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. The program for implementing the method of the embodiment of the present disclosure may be included in the application program.
In the above embodiments of the present application, the processor may further be configured to call a program or an instruction stored in the memory, specifically, a program or an instruction stored in the application program, and the processor is configured to:
the steps of the above method are performed.
The above method may be applied in or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The methods, steps, and logic blocks disclosed above may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the above disclosure may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules within the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and combines hardware thereof to complete the steps of the method.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques of this application may be implemented by executing the functional blocks (e.g., procedures, functions, and so on) of the application. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The present application may also provide a non-volatile storage medium for storing a computer program. The computer program may realize the steps of the above-described method embodiments when executed by a processor.
The preferred embodiments of the present application have been described in detail above, but the present application is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the technical idea of the present application, and these simple modifications all belong to the protection scope of the present application.
It should be noted that, in the above embodiments, the various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described separately in the present application.
In addition, any combination of the various embodiments of the present application is also possible, and the same should be considered as disclosed in the present application as long as it does not depart from the idea of the present application.

Claims (11)

1. A hotword extraction method, the method comprising:
step 1: collecting text data in the ecological environment field;
step 2: word segmentation processing;
and step 3: extracting keywords based on the word segmentation processing result, and/or performing word frequency statistics based on the word segmentation processing result, and/or performing topic clustering based on the word segmentation processing result;
and 4, step 4: and carrying out hot word screening based on the results of extracting the keywords, word frequency statistics and topic clustering.
2. The hotword extraction method according to claim 1, wherein said step 3 further comprises: and performing relevance analysis based on the word segmentation processing result.
3. The hotword extraction method according to claim 1, wherein the step 3 further comprises: and performing co-occurrence network analysis based on the word segmentation processing result.
4. The hot word extraction method according to claim 1, wherein the word segmentation processing performs word segmentation by using a Jieba library word segmentation component to generate a word combination including semantics.
5. A hotword extraction method as claimed in claim 1, wherein the extraction keywords adopt a TextRank algorithm based on MMR optimization to generate a summary word set most suitable for expressing text meaning.
6. A hot word extraction method as claimed in claim 1, wherein the word frequency statistics are performed by using TF-IWF algorithm to generate a word set with a high occurrence frequency.
7. The hotword extraction method of claim 1, wherein the topic cluster adopts an LDA model to generate core key words and specific probabilities of topics.
8. A hotword extraction method as claimed in claim 1, wherein the hotword filtering is the set of top-ranked words in the result generated in the extraction step 3.
9. A hotword extraction system, the system comprising:
the data collecting module is used for collecting text information in the field of ecological environment;
the word segmentation processing module is used for carrying out word segmentation processing on the collected data;
the hot word calculation module is used for extracting keywords based on the word segmentation processing result, and/or performing word frequency statistics based on the word segmentation processing result, and/or performing topic clustering based on the word segmentation processing result;
and the hot word screening module is used for screening hot words based on the results of extracting the keywords, the word frequency statistics and the theme clustering.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method according to any one of claims 1 to 8.
CN202211446313.0A 2022-11-18 2022-11-18 Hot word extraction method, system, computer device and storage medium Pending CN115712700A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211446313.0A CN115712700A (en) 2022-11-18 2022-11-18 Hot word extraction method, system, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211446313.0A CN115712700A (en) 2022-11-18 2022-11-18 Hot word extraction method, system, computer device and storage medium

Publications (1)

Publication Number Publication Date
CN115712700A true CN115712700A (en) 2023-02-24

Family

ID=85233687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211446313.0A Pending CN115712700A (en) 2022-11-18 2022-11-18 Hot word extraction method, system, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN115712700A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076963A (en) * 2023-10-17 2023-11-17 北京国科众安科技有限公司 Information heat analysis method based on big data platform
CN117669550B (en) * 2023-11-13 2024-04-30 东风日产数据服务有限公司 Topic mining method, system, equipment and medium based on text center

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN113360646A (en) * 2021-06-02 2021-09-07 华院计算技术(上海)股份有限公司 Text generation method and equipment based on dynamic weight and storage medium
CN114912446A (en) * 2022-04-29 2022-08-16 中证信用增进股份有限公司 Keyword extraction method and device and storage medium
CN114996444A (en) * 2022-06-28 2022-09-02 中国人民解放军63768部队 Automatic news summarization method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN113360646A (en) * 2021-06-02 2021-09-07 华院计算技术(上海)股份有限公司 Text generation method and equipment based on dynamic weight and storage medium
CN114912446A (en) * 2022-04-29 2022-08-16 中证信用增进股份有限公司 Keyword extraction method and device and storage medium
CN114996444A (en) * 2022-06-28 2022-09-02 中国人民解放军63768部队 Automatic news summarization method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马红;蔡永明;: "共词网络LDA模型的中文文本主题分析:以交通法学文献(2000-2016)为例", 现代图书情报技术, no. 12 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076963A (en) * 2023-10-17 2023-11-17 北京国科众安科技有限公司 Information heat analysis method based on big data platform
CN117076963B (en) * 2023-10-17 2024-01-02 北京国科众安科技有限公司 Information heat analysis method based on big data platform
CN117669550B (en) * 2023-11-13 2024-04-30 东风日产数据服务有限公司 Topic mining method, system, equipment and medium based on text center

Similar Documents

Publication Publication Date Title
Qiao et al. A joint model for entity and relation extraction based on BERT
Abercrombie et al. Sentiment and position-taking analysis of parliamentary debates: a systematic literature review
Obiedat et al. Arabic aspect-based sentiment analysis: A systematic literature review
Gao et al. Generation of topic evolution graphs from short text streams
Suleiman et al. Comparative study of word embeddings models and their usage in Arabic language applications
Wen et al. Sememe knowledge and auxiliary information enhanced approach for sarcasm detection
Shahi et al. Natural language processing for Nepali text: a review
Papantoniou et al. NLP for the Greek language: A brief survey
Lin et al. A simple but effective method for Indonesian automatic text summarisation
Ali et al. SiNER: A large dataset for Sindhi named entity recognition
Zhu et al. A hybrid classification method via character embedding in chinese short text with few words
Da et al. Deep learning based dual encoder retrieval model for citation recommendation
Liu et al. A new approach to process the unknown words in financial public opinion
Xiong et al. Learning Chinese word representation better by cascade morphological n-gram
Zakari et al. A systematic literature review of hausa natural language processing
Lin et al. Multi-channel word embeddings for sentiment analysis
Hu et al. Retrieval-based language model adaptation for handwritten Chinese text recognition
CN115712700A (en) Hot word extraction method, system, computer device and storage medium
Islam et al. An in-depth exploration of Bangla blog post classification
Nasim et al. Evaluation of clustering techniques on Urdu News head-lines: A case of short length text
Saleh et al. TxLASM: A novel language agnostic summarization model for text documents
Zhang et al. Combining the attention network and semantic representation for Chinese verb metaphor identification
Fei et al. GFMRC: A machine reading comprehension model for named entity recognition
Omari et al. Classifiers for Arabic NLP: survey
Wang et al. Natural language processing systems and Big Data analytics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination