CN117076963A - Information heat analysis method based on big data platform - Google Patents

Information heat analysis method based on big data platform Download PDF

Info

Publication number
CN117076963A
CN117076963A CN202311337284.9A CN202311337284A CN117076963A CN 117076963 A CN117076963 A CN 117076963A CN 202311337284 A CN202311337284 A CN 202311337284A CN 117076963 A CN117076963 A CN 117076963A
Authority
CN
China
Prior art keywords
information
word
clustering
matrix
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311337284.9A
Other languages
Chinese (zh)
Other versions
CN117076963B (en
Inventor
胡红亮
郭传斌
聂雯莹
丁荣
杨万波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Guoke Zhongan Technology Co ltd
Original Assignee
Beijing Guoke Zhongan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Guoke Zhongan Technology Co ltd filed Critical Beijing Guoke Zhongan Technology Co ltd
Priority to CN202311337284.9A priority Critical patent/CN117076963B/en
Publication of CN117076963A publication Critical patent/CN117076963A/en
Application granted granted Critical
Publication of CN117076963B publication Critical patent/CN117076963B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an information heat analysis method based on a big data platform, which relates to the technical field of natural language processing and comprises the steps of analyzing a first clustering result obtained by clustering information acquired in a preset time period of a plurality of target fields by adopting a clustering algorithm, and generating a first hot word set; calculating the semanteme degree of each piece of information in the first clustering result and the corresponding first hotword set; and acquiring the release time of each piece of information in the first clustering result and the user text data, and calculating by combining the semantic degree to obtain the information hotness. Clustering the multiple field information by adopting a clustering algorithm to generate a corresponding hot word set; the semantic degree of each information in the category is calculated through the hotword set, and then the semantic degree and the multidimensional data are comprehensively calculated to obtain the heat, so that the information heat value is rapidly and automatically calculated according to a large amount of news information while the traditional manual acquisition mode is effectively changed, and an important role is played in the analysis of the information in multiple fields.

Description

Information heat analysis method based on big data platform
Technical Field
The invention relates to the technical field of natural language processing, in particular to an information heat analysis method based on a big data platform.
Background
At present, society generates a large amount of information every day, and acquiring hot spot information becomes more and more important in analysis, traditional information heat acquisition is to manually set a judgment standard, and read and comprehensively judge information according to the standard, but the data types in the information field are various, and the efficiency is low by means of manual mode.
Therefore, the information heat analysis method based on the big data platform can automatically calculate heat values according to a large amount of news information, obtains heat information quickly, changes the traditional manual mode, and improves efficiency.
Disclosure of Invention
The invention provides an information heat analysis method based on a big data platform, which is used for generating a corresponding heat word set after clustering a plurality of field information by adopting a clustering algorithm; the semantic degree of each information in the category is calculated through the hotword set, and then the semantic degree and the multidimensional data are comprehensively calculated to obtain the heat, so that the information heat value is rapidly and automatically calculated according to a large amount of news information while the traditional manual acquisition mode is effectively changed, and an important role is played in the analysis of the information in multiple fields.
The invention provides an information heat analysis method based on a big data platform, which comprises the following steps:
step 1: clustering information acquired in a preset time period of a plurality of target fields by adopting a clustering algorithm to obtain a first clustering result;
step 2: analyzing the first clustering result to generate a first hotword set;
step 3: calculating the related semanteme degree of each piece of information in the first clustering result and the first hotword set correspondingly generated;
step 4: and acquiring the release time of each piece of information in the first clustering result and the user text data, and calculating by combining the semantic degree to obtain the information hotness.
Preferably, clustering information acquired in a preset time period of a plurality of target fields by using a clustering algorithm to obtain a first clustering result, including:
step 11: acquiring information data in preset time periods of all target fields from a network at regular time;
step 12: removing abnormal data from the acquired information data to obtain first data;
step 13: constructing a corresponding first word segmentation dictionary based on professional vocabularies related to the first data to segment the first data, and then marking parts of speech and removing stop words to obtain a target data set;
step 14: obtaining the confusion degree of the target data set, so as to obtain the optimal clustering number K;
step 15: randomly selecting K pieces of first information from a target data set as the center of an initial cluster, calculating to obtain distances from each piece of rest information to the K pieces of first information, and dividing the distances to the nearest cluster;
step 16: and randomly selecting one piece of rest information, calculating the total cost value of replacing the representative information with the rest information, and if the total cost value is smaller than zero, replacing the representative information to form K new clusters for reclustering until the new clusters are no longer appeared, and outputting K first clustering results.
Preferably, obtaining the confusion degree of the target data set, so as to obtain the optimal clustering number, including:
confusion of the target datasetThe formula for the calculation of (2) is as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein M represents the total number of information in the target data set; />Expressed as the number of words in the information w; />Expressed as words in the information w; />Expressed as word->Probability generated in the information w;
and (3) according to a confusion degree calculation formula, carrying out experiment construction to obtain a clustering number-confusion degree line graph, selecting the corresponding clustering number with the minimum confusion degree and at the inflection point from the line graph as the optimal clustering number, and outputting the selected clustering number.
Preferably, analyzing the first clustering result to generate a first hot word set includes:
step 21: performing new word recognition on the information in the first clustering result so as to generate a new word dictionary;
step 22: based on the new word dictionary, performing word segmentation on the information in the first clustering result to obtain a plurality of candidate words;
step 23: calculating from three dimensions of word frequency, word frequency increasing rate and information acquisition source influence to obtain a first heat corresponding to the candidate word;
step 24: and according to all the acquired first heat, collecting candidate words corresponding to the first heat which is larger than a preset heat threshold to obtain a first heat word set, and outputting the first heat word set.
Preferably, the method for identifying new words of information in the first clustering result, thereby generating a new word dictionary, includes:
step 31: recording the number of occurrences of left and right adjacent words corresponding to all words in the information in the first clustering result;
step 32: obtaining corresponding left information entropy and right information entropy and statistical information entropy after left and right combination according to the proportion of each word in the left adjacent word and the right adjacent word of each word;
step 33: respectively acquiring word frequencies of left and right adjacent words, and calculating word aggregation degree based on word frequencies of adjacent occurrence of corresponding words;
step 34: by calculating the score of word combination DScore of words that make up the termThe size of (1)>The new word which is not found in the word combination D is considered and added into a self-defined new word dictionary;
wherein the score is calculated according to the following formula:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Manually set coefficients expressed as control word condensing degree; />Human preset coefficients expressed as the degree of importance of the control information entropy; />Expressed as the degree of word aggregation; n is denoted as information entropy.
Preferably, the first heat corresponding to the candidate word is obtained by calculating from three dimensions of word frequency, word frequency increasing rate and information acquisition source influence, including:
step 41: acquiring word frequency increasing rate and corresponding word frequency increasing rate weight of the candidate words, and outputting the candidate words with the word frequency increasing rate larger than a preset increasing threshold as first words;
step 42: classifying all the obtained first words according to the parts of speech to obtain the number and the occupation ratio of each part of speech word, giving part of speech weight to the corresponding first word based on the occupation ratio, and combining the word frequency and the position weight of the first word to obtain the word frequency weight;
step 43: determining an information acquisition source corresponding to the information containing the first word, thereby obtaining a corresponding information acquisition source influence weight;
step 44: the first heat of the first word is calculated, and the formula is calculated as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>A first heat degree represented by a word c in a preset time period t; />The word frequency weight of the word c in the corresponding first clustering result in the preset time period t is represented; />The word frequency increasing rate weight of the word c in the corresponding first clustering result in the preset time period t is expressed; />Expressed as information set->Middle information->The corresponding information of (a) obtains source influence weights; />Indicated as an information set containing words c within a predetermined time period t.
Preferably, calculating the related semanteme degree of each piece of information in the first clustering result and the corresponding generated first hotword set includes:
step 51: each sentence forming the information in the first clustering result is used as a target sentence and is sequentially encoded into a first matrix formed by word vectors, and then a first hot word set generated by the corresponding first clustering result is encoded into a second matrix formed by the word vectors;
step 52: comparing the similarity of the target sentence and all the hot words in the corresponding first hot word set based on word granularity by using an interactive attention mechanism to obtain a first attention matrix;
step 53: generating a new sentence matrix and a new set matrix with weighted attention by mutually generating the first attention moment matrix, the first matrix and the second matrix respectively;
step 54: splicing and fusing the new sentence matrix and the first matrix to obtain a first representation matrix, and splicing and fusing the new set matrix and the second matrix to obtain a second representation matrix;
step 54: respectively inputting the first expression matrix and the second expression matrix into a transform encoder to perform deep semantic coding to obtain a first semantic feature vector and a second semantic feature vector;
step 55: after the first semantic feature vector and the second semantic feature vector are fused, feature weight adjustment is carried out through a fully connected network, and then the semanteme degree of the target sentence and the corresponding first hotword set is obtained by utilizing a Softmax normalization function
Step 56: semanteme degree of all target sentences of collection composition information and corresponding first hotword collectionThe related semanteme degree of the obtained information and the corresponding generated first hotword set is +.>Wherein->N is expressed as the total number of sentences constituting the information.
Preferably, the user line text data comprises downloading amount, reading amount and collection amount data.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a method for information heat analysis based on a big data platform according to an embodiment of the invention;
FIG. 2 is a flow chart of information calculation heat in an embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
An embodiment of the present invention provides a method for thermal analysis of information based on a big data platform, as shown in fig. 1, including:
step 1: clustering information acquired in a preset time period of a plurality of target fields by adopting a clustering algorithm to obtain a first clustering result;
step 2: analyzing the first clustering result to generate a first hotword set;
step 3: calculating the related semanteme degree of each piece of information in the first clustering result and the first hotword set correspondingly generated;
step 4: and acquiring the release time of each piece of information in the first clustering result and the user text data, and calculating by combining the semantic degree to obtain the information hotness.
In this embodiment, the target area is pre-selected, such as health food, medical, and fitness; the preset time period is preset in advance; the information means information that a user can bring value to himself in a relatively short time because he obtains it in time and uses it; the first clustering result is obtained by carrying out similar clustering on all the information data acquired in a preset time period after preprocessing by using a clustering algorithm, wherein the preprocessing mode comprises abnormal data removal, part-of-speech tagging, word segmentation and word stopping.
In this embodiment, the first hot word set specifically considers three aspects of word frequency, word frequency increasing rate and influence of information acquisition sources, wherein the information acquisition sources include microblogs, weChat public numbers and related websites; the related semanteme degree refers to a numerical description that each piece of information in the first clustering result is similar to related semantic features of the corresponding generated first hotword set; the user line text data consists of downloading amount, reading amount and collection amount data; the information heat is calculated by utilizing the release time of each information in the first clustering result and the user text data and combining the semanteme, wherein the calculation formula is as follows:
in this embodiment, for example, there is a hot news automatic acquisition step as follows, and the information calculation hot flow chart is shown in fig. 2;
step s1: acquiring information data every day at regular time;
step s2: clustering the information data by adopting a lda clustering algorithm, wherein n classes are required to be predefined in advance before clustering, and clustering is completed to obtain the semantic similarity of each class of information of n classes of information;
step s3: clustering is completed to generate n-class phrase sets;
step s4: calculating the semanteme of each information in each class through a phrase set, wherein the semanteme is calculated by adopting tf-idf word frequency inverse document algorithm;
step s5: acquiring the release time and user behavior of each piece of information;
step s6: and integrating the semanteme, the release time and the user behavior comprehensive value of each information to calculate the heat, wherein the heat calculation formula is as follows:
step s7: and ordering all the information according to the hotness value to obtain the hotness news.
The beneficial effects of the technical scheme are as follows: clustering the multiple field information by adopting a clustering algorithm to generate a corresponding hot word set; the semantic degree of each information in the category is calculated through the hotword set, and then the semantic degree and the multidimensional data are comprehensively calculated to obtain the heat, so that the information heat value is rapidly and automatically calculated according to a large amount of news information while the traditional manual acquisition mode is effectively changed, and an important role is played in the analysis of the information in multiple fields.
The embodiment of the invention provides an information heat analysis method based on a big data platform, which adopts a clustering algorithm to cluster information acquired in a preset time period of a plurality of target fields to obtain a first clustering result, and comprises the following steps:
step 11: acquiring information data in preset time periods of all target fields from a network at regular time;
step 12: removing abnormal data from the acquired information data to obtain first data;
step 13: constructing a corresponding first word segmentation dictionary based on professional vocabularies related to the first data to segment the first data, and then marking parts of speech and removing stop words to obtain a target data set;
step 14: obtaining the confusion degree of the target data set, so as to obtain the optimal clustering number K;
step 15: randomly selecting K pieces of first information from a target data set as the center of an initial cluster, calculating to obtain distances from each piece of rest information to the K pieces of first information, and dividing the distances to the nearest cluster;
step 16: and randomly selecting one piece of rest information, calculating the total cost value of replacing the representative information with the rest information, and if the total cost value is smaller than zero, replacing the representative information to form K new clusters for reclustering until the new clusters are no longer appeared, and outputting K first clustering results.
In this embodiment, the target area is pre-selected, such as health food, medical, and fitness; the preset time period is preset in advance; information refers to information that a user can bring value to himself in a relatively short time because he obtains it in time and uses it, and information data refers to text data; the first data is obtained by eliminating abnormal data from the acquired information data, wherein the purpose of eliminating the abnormal data is to ensure that the subsequent data processing is accurate.
In this embodiment, the first word segmentation dictionary is constructed based on the specialized vocabulary related in the first data, and is used for segmenting the first data; parts of speech refers to the feature of a word as a basis for dividing the part of speech, such as nouns, verbs and adjectives; the decommissioning word can be removed by using a decommissioning word list downloaded on a network, wherein the decommissioning word refers to a word which is completely useless or has no meaning, such as a fluxing word and a mood word; the target data set is a data set obtained by segmenting the first data, marking the parts of speech and removing stop words; the confusion degree is used for obtaining the optimal number of clusters; the total cost value refers to the sum of the cost of all non-center point objects obtained by using a cost function, wherein the cost function is mainly used for measuring the quality of clustering results, and the expression formula is thatWherein->Expressed as sum of squares of distance differences between all objects in the target data set and the current cluster center when the remaining information for replacing the representative information is taken as the cluster center,/for>Expressed as the sum of squares of the distances between all objects in the target data set and the current cluster center when the representative information is taken as the cluster center.
The beneficial effects of the technical scheme are as follows: the optimal clustering number is determined in advance by solving the confusion degree after preprocessing the information data, and the optimal cluster center is found by combining the cost function so as to realize the classification of similar information into the same class, thereby laying a foundation for the subsequent hot word acquisition.
The embodiment of the invention provides an information heat analysis method based on a big data platform, which is used for obtaining the confusion degree of a target data set so as to obtain the optimal clustering number, and comprises the following steps:
confusion of the target datasetThe formula for the calculation of (2) is as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein M represents the total number of information in the target data set; />Expressed as the number of words in the information w; />Expressed as words in the information w; />Expressed as word->Probability generated in the information w;
and (3) according to a confusion degree calculation formula, carrying out experiment construction to obtain a clustering number-confusion degree line graph, selecting the corresponding clustering number with the minimum confusion degree and at the inflection point from the line graph as the optimal clustering number, and outputting the selected clustering number.
In the embodiment, the clustering number-confusion degree line graph is a graph constructed based on experimental evolution by taking clustering data as independent variables and confusion degree as dependent variables, and is used for determining the optimal clustering number, so that the degree of paraphrasing of information data is improved.
The beneficial effects of the technical scheme are as follows: the confusion degree is solved by utilizing a formula to quickly obtain the optimal clustering number, so that the definition degree of the information data is improved, and the follow-up information clustering accuracy is facilitated.
The embodiment of the invention provides an information heat analysis method based on a big data platform, which analyzes the first clustering result to generate a first hot word set, and comprises the following steps:
step 21: performing new word recognition on the information in the first clustering result so as to generate a new word dictionary;
step 22: based on the new word dictionary, performing word segmentation on the information in the first clustering result to obtain a plurality of candidate words;
step 23: calculating from three dimensions of word frequency, word frequency increasing rate and information acquisition source influence to obtain a first heat corresponding to the candidate word;
step 24: and according to all the acquired first heat, collecting candidate words corresponding to the first heat which is larger than a preset heat threshold to obtain a first heat word set, and outputting the first heat word set.
In this embodiment, the first clustering result is a result obtained by performing similar clustering on all information data acquired in a preset time period after preprocessing by using a clustering algorithm, where the preprocessing mode includes removing abnormal data, part-of-speech labeling, word segmentation and deactivation word.
In this embodiment, the purpose of the new word recognition is to recognize the word words and the web phrases that may exist in the information, so that the subsequent clustering task can be prevented from being interfered by the new word as much as possible to generate errors; the new word dictionary is obtained by identifying a new word set; candidate words refer to a plurality of words obtained by re-word segmentation of information in a first clustering result by using a new word dictionary; the first heat is the attention degree obtained by comprehensively considering and calculating three angles of word frequency, word frequency increasing rate and information acquisition source influence, wherein the information acquisition source comprises microblogs, weChat public numbers and websites; the preset heat threshold is set in advance; the first hotword set is obtained by combining all acquired hotwords.
The beneficial effects of the technical scheme are as follows: the candidate words are obtained by generating a new word dictionary and then segmenting the information in the first clustering result; the first heat degree of the candidate words is determined from three angles of word frequency, word frequency increasing rate and information acquisition source influence, so that rapid and accurate screening is realized, a first heat word set is obtained through collection, and data support is provided for acquisition of subsequent semanteme.
The embodiment of the invention provides an information heat analysis method based on a big data platform, which carries out new word recognition on information in a first clustering result so as to generate a new word dictionary, and comprises the following steps:
step 31: recording the number of occurrences of left and right adjacent words corresponding to all words in the information in the first clustering result;
step 32: obtaining corresponding left information entropy and right information entropy and statistical information entropy after left and right combination according to the proportion of each word in the left adjacent word and the right adjacent word of each word;
step 33: respectively acquiring word frequencies of left and right adjacent words, and calculating word aggregation degree based on word frequencies of adjacent occurrence of corresponding words;
step 34: by calculating the score of word combination DScore of words that make up the termThe size of (1)>The new word which is not found in the word combination D is considered and added into a self-defined new word dictionary;
wherein the score is calculated according to the following formula:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Manually set coefficients expressed as control word condensing degree; />Human preset coefficients expressed as the degree of importance of the control information entropy; />Expressed as the degree of word aggregation; n is denoted as information entropy.
In the embodiment, the information entropy is a measure of how much information is, and can be used for measuring whether the left-right collocation of a word is rich or not, and the higher the information entropy is, the more abundant the information quantity is represented; the statistical information entropy is obtained by combining the corresponding information entropy of the word left and right adjacent words.
In this embodiment, for example, there is a "smart city", the left neighbor is rich, it may be "research", "is", "about", etc., while the right neighbor is very scarce, only "city", at which time "smart city" should be divided as a word.
In this embodiment, a high word aggregation represents that each word constituting the word should be mainly co-occurring and not be matched randomly, and the calculation formula is as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Expressed as the degree of word aggregation; />Expressed as the probability of the simultaneous occurrence of the words a, b; />Expressed as the frequency of occurrence of word a; />Expressed as the frequency of occurrence of word b.
In this embodiment, the new words should satisfy that the left-right collocation of the words is sufficiently rich and the word aggregation is sufficiently high, that is, the words contained in the words always appear simultaneously.
The beneficial effects of the technical scheme are as follows: the information in the first clustering result is effectively identified by combining the information entropy and the word aggregation degree, so that a new word dictionary is constructed to be used for subsequent information and word segmentation, and the subsequent hot word extraction task can be prevented from being interfered by the new word as much as possible to generate errors.
The embodiment of the invention provides an information heat analysis method based on a big data platform, which calculates and obtains a first heat corresponding to a candidate word from three dimensions of word frequency, word frequency growth rate and information acquisition source influence, and comprises the following steps:
step 41: acquiring word frequency increasing rate and corresponding word frequency increasing rate weight of the candidate words, and outputting the candidate words with the word frequency increasing rate larger than a preset increasing threshold as first words;
step 42: classifying all the obtained first words according to the parts of speech to obtain the number and the occupation ratio of each part of speech word, giving part of speech weight to the corresponding first word based on the occupation ratio, and combining the word frequency and the position weight of the first word to obtain the word frequency weight;
step 43: determining an information acquisition source corresponding to the information containing the first word, thereby obtaining a corresponding information acquisition source influence weight;
step 44: the first heat of the first word is calculated, and the formula is calculated as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>A first heat degree represented by a word c in a preset time period t; />The word frequency weight of the word c in the corresponding first clustering result in the preset time period t is represented; />Expressed as that the word c in the preset time period t is in the corresponding first clustering resultWord frequency increasing rate weights of (a); />Expressed as information set->Middle information->The corresponding information of (a) obtains source influence weights; />Indicated as an information set containing words c within a predetermined time period t.
In the embodiment, the word frequency increasing rate weight refers to the relative importance degree of the word frequency changing trend on the heat degree of the evaluation word, wherein the word with the word frequency increasing rate weight larger than the preset increasing threshold has the potential of hot words; the preset increase threshold is set in advance; the first word refers to a candidate word with word frequency increasing rate larger than a preset increasing threshold value; parts of speech refers to the use of word-place features as the basis for dividing parts of speech, such as nouns, verbs, and adjectives.
In this embodiment, the location weight refers to the relative importance of the word location information to the popularity of the evaluation word, for example, the title word may be more capable of representing the information subject; word frequency weights are used in conjunction with research on word frequency, word position, and part-of-speech information to describe the relative importance of evaluating word popularity; the first heat is the attention degree obtained by comprehensively considering and calculating three angles of word frequency, word frequency increasing rate and information acquisition source influence, wherein the information acquisition source comprises microblogs, weChat public numbers and websites; the information acquisition source influence weight is acquired based on the user quantity, the Page Rank value, the reverse link number and the search times and is used for accurately identifying hot words, and the information is more likely to become hot news and the hot words are more likely to be generated as the information acquisition source with larger influence is more likely to become hot news.
The beneficial effects of the technical scheme are as follows: the characteristics of the hot words are researched from three angles of word frequency, word frequency increasing rate and information acquisition source influence, so that the credibility of the heat of the acquired candidate words is ensured, and a foundation is laid for the follow-up accurate and effective selection of the hot words.
The embodiment of the invention provides an information heat analysis method based on a big data platform, which calculates the related semanteme degree of each piece of information in a first clustering result and a first heat word set correspondingly generated, and comprises the following steps:
step 51: each sentence forming the information in the first clustering result is used as a target sentence and is sequentially encoded into a first matrix formed by word vectors, and then a first hot word set generated by the corresponding first clustering result is encoded into a second matrix formed by the word vectors;
step 52: comparing the similarity of the target sentence and all the hot words in the corresponding first hot word set based on word granularity by using an interactive attention mechanism to obtain a first attention matrix;
step 53: generating a new sentence matrix and a new set matrix after attention weighting by respectively interacting the first attention moment matrix with a first matrix and a second matrix;
step 54: splicing and fusing the new sentence matrix and the first matrix to obtain a first representation matrix, and splicing and fusing the new set matrix and the second matrix to obtain a second representation matrix;
step 54: respectively inputting the first expression matrix and the second expression matrix into a transform encoder to perform deep semantic coding to obtain a first semantic feature vector and a second semantic feature vector;
step 55: after the first semantic feature vector and the second semantic feature vector are fused, feature weight adjustment is carried out through a fully connected network, and then the semanteme degree of the target sentence and the corresponding first hotword set is obtained by utilizing a Softmax normalization function
Step 56: semanteme degree of all target sentences of collection composition information and corresponding first hotword collectionThe related semanteme degree of the obtained information and the corresponding generated first hotword set is +.>Wherein->N is expressed as the total number of sentences constituting the information.
In this embodiment, the target sentence is used to describe any one sentence constituting information; the first matrix is in a sentence matrix form obtained by constructing a target sentence according to word vectors; the second matrix refers to encoding a first hot word set generated by the first clustering result into a matrix form composed of word vectors; the interaction attention mechanism is used for carrying out one-time interaction on the word vector matrix before the sentence semantic vector is acquired, so that the accuracy of the semantic degree between the acquired information and the hotword is improved; the first attention matrix is obtained based on the similarity of the target sentence and the word granularity of all hot words in the corresponding first hot word set, wherein the word granularity is used for segmenting the target sentence, and the first attention matrix has the advantage of being capable of well preserving the semantics and boundary information of the words.
In this embodiment, the new sentence matrix is a matrix obtained by performing interactive attention weighting on the first attention matrix and the first matrix; the new set matrix is a matrix obtained by carrying out interactive attention weighting on the first attention matrix and the second matrix; the first representation matrix is obtained by splicing and fusing a new sentence matrix and the first matrix, and aims to enrich sentence characteristic information; the second expression matrix is obtained by splicing and fusing the new set matrix and the second matrix, and the purpose of the second expression matrix is to enrich sentence characteristic information.
In this embodiment, for example, there is a first matrixAnd a second matrix->Generating a first attention matrix +.>The method comprises the steps of carrying out a first treatment on the surface of the By combining the first attention matrix Z with the first matrix X and the second matrixThe matrix Y is weighted by the interaction attention to obtain a new sentence matrix A and a new set matrix B, and the row vector generation formula is as follows:
in the method, in the process of the invention,the ith row and jth column element, denoted as the first attention matrix Z, represent the target sentence +.>Is +.>Similarity of the j-th word of (a); />Expressed as a first hotword set->Word vectors of the j-th word in (a); />Expressed as +.>Word vectors of the i-th word in (a);
and then the row vectors of the new sentence matrix A and the new set matrix BAnd a first matrix->Row vector of second matrix YRespectively splicing and fusing to obtain a first representation matrix R and a second representation matrix W, and expressing row vectors of the corresponding first representation matrix R as +.>The corresponding row vector of the second representation matrix W is expressed as +.>
In this embodiment, the first semantic feature vector and the second semantic feature vector are encoding vectors obtained by extracting features of the first representation matrix and the second representation matrix by using a transform encoder, and the purpose of the encoding vectors is to obtain deep semantic information; the fully-connected network is used for changing high-dimensional information into low-dimensional information, and preserving useful information and realizing weight adjustment of the characteristics; the Softmax normalization function is used for predicting the semanteme of the target sentence and the corresponding first hotword set.
The beneficial effects of the technical scheme are as follows: the semantic information of sentences and the interaction information with hot words are comprehensively considered by utilizing an interaction attention mechanism to interact the word vector matrix before the semantic vectors of the sentences are acquired, and then the semantic degree sets of all sentences which are combined with the first hot words and form information obtained by combining a Transformer encoder and a Softmax normalization function are combined, so that the related semantic degree of the information and the first hot word set which is correspondingly generated can be effectively determined, and accurate data support is provided for the subsequent rapid automatic calculation of information hot degree values according to a large amount of news information.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (8)

1. An information heat analysis method based on a big data platform is characterized by comprising the following steps:
step 1: clustering information acquired in a preset time period of a plurality of target fields by adopting a clustering algorithm to obtain a first clustering result;
step 2: analyzing the first clustering result to generate a first hotword set;
step 3: calculating the related semanteme degree of each piece of information in the first clustering result and the first hotword set correspondingly generated;
step 4: and acquiring the release time of each piece of information in the first clustering result and the user text data, and calculating by combining the semantic degree to obtain the information hotness.
2. The information heat analysis method based on a big data platform according to claim 1, wherein clustering information acquired in a preset time period of a plurality of target fields by using a clustering algorithm to obtain a first clustering result comprises:
step 11: acquiring information data in preset time periods of all target fields from a network at regular time;
step 12: removing abnormal data from the acquired information data to obtain first data;
step 13: constructing a corresponding first word segmentation dictionary based on professional vocabularies related to the first data to segment the first data, and then marking parts of speech and removing stop words to obtain a target data set;
step 14: obtaining the confusion degree of the target data set, so as to obtain the optimal clustering number K;
step 15: randomly selecting K pieces of first information from a target data set as the center of an initial cluster, calculating to obtain distances from each piece of rest information to the K pieces of first information, and dividing the distances to the nearest cluster;
step 16: and randomly selecting one piece of rest information, calculating the total cost value of replacing the representative information with the rest information, and if the total cost value is smaller than zero, replacing the representative information to form K new clusters for reclustering until the new clusters are no longer appeared, and outputting K first clustering results.
3. The information heat analysis method based on a big data platform according to claim 2, wherein obtaining the confusion degree of the target data set, thereby obtaining the optimal clustering number, comprises:
the purpose is thatConfusion of target data setsThe formula for the calculation of (2) is as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein M represents the total number of information in the target data set; />Expressed as the number of words in the information w; />Expressed as words in the information w; />Expressed as word->Probability generated in the information w;
and (3) according to a confusion degree calculation formula, carrying out experiment construction to obtain a clustering number-confusion degree line graph, selecting the corresponding clustering number with the minimum confusion degree and at the inflection point from the line graph as the optimal clustering number, and outputting the selected clustering number.
4. The method of claim 1, wherein analyzing the first clustering result to generate a first hotword set comprises:
step 21: performing new word recognition on the information in the first clustering result so as to generate a new word dictionary;
step 22: based on the new word dictionary, performing word segmentation on the information in the first clustering result to obtain a plurality of candidate words;
step 23: calculating from three dimensions of word frequency, word frequency increasing rate and information acquisition source influence to obtain a first heat corresponding to the candidate word;
step 24: and according to all the acquired first heat, collecting candidate words corresponding to the first heat which is larger than a preset heat threshold to obtain a first heat word set, and outputting the first heat word set.
5. The method of claim 4, wherein performing new word recognition on the information in the first clustering result to generate a new word dictionary, comprises:
step 31: recording left adjacent words and right adjacent words corresponding to all words in the information in the first clustering result and the occurrence frequency of each adjacent word;
step 32: obtaining corresponding left information entropy and right information entropy and statistical information entropy after left and right combination according to the proportion of each word in the left adjacent word and the right adjacent word of each word;
step 33: respectively acquiring word frequencies of left and right adjacent words, and calculating word aggregation degree based on word frequencies of adjacent occurrence of corresponding words;
step 34: by calculating the score of word combination DScore of words composing the word ++>The size of (1)>The new word which is not found in the word combination D is considered and added into a self-defined new word dictionary;
wherein the score is calculated according to the following formula:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Manually set coefficients expressed as control word condensing degree; />Human preset coefficients expressed as the degree of importance of the control information entropy; />Expressed as the degree of word aggregation; n is denoted as information entropy.
6. The method of claim 4, wherein the calculating the first heat corresponding to the candidate word from three dimensions of word frequency, word frequency increase rate and information acquisition source influence comprises:
step 41: acquiring word frequency increasing rate and corresponding word frequency increasing rate weight of the candidate words, and outputting the candidate words with the word frequency increasing rate larger than a preset increasing threshold as first words;
step 42: classifying all the obtained first words according to the parts of speech to obtain the number and the occupation ratio of each part of speech word, giving part of speech weight to the corresponding first word based on the occupation ratio, and combining the word frequency and the position weight of the first word to obtain the word frequency weight;
step 43: determining an information acquisition source corresponding to the information containing the first word, thereby obtaining a corresponding information acquisition source influence weight;
step 44: the first heat of the first word is calculated, and the formula is calculated as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>A first heat degree represented by a word c in a preset time period t;expressed as word frequency weights of words c in corresponding first clustering results in preset time period tWeighing; />The word frequency increasing rate weight of the word c in the corresponding first clustering result in the preset time period t is expressed; />Expressed as information set->Medium informationThe corresponding information of (a) obtains source influence weights; />Indicated as an information set containing words c within a predetermined time period t.
7. The method of claim 1, wherein calculating the semanteme of each piece of information in the first clustering result relative to the corresponding generated first hotword set comprises:
step 51: each sentence forming the information in the first clustering result is used as a target sentence and is sequentially encoded into a first matrix formed by word vectors, and then a first hot word set generated by the corresponding first clustering result is encoded into a second matrix formed by the word vectors;
step 52: comparing the similarity of the target sentence and all the hot words in the corresponding first hot word set based on word granularity by using an interactive attention mechanism to obtain a first attention matrix;
step 53: generating a new sentence matrix and a new set matrix with weighted attention by mutually generating the first attention moment matrix, the first matrix and the second matrix respectively;
step 54: splicing and fusing the new sentence matrix and the first matrix to obtain a first representation matrix, and splicing and fusing the new set matrix and the second matrix to obtain a second representation matrix;
step 54: respectively inputting the first expression matrix and the second expression matrix into a transform encoder to perform deep semantic coding to obtain a first semantic feature vector and a second semantic feature vector;
step 55: after the first semantic feature vector and the second semantic feature vector are fused, feature weight adjustment is carried out through a fully connected network, and then the semanteme degree of the target sentence and the corresponding first hotword set is obtained by utilizing a Softmax normalization function
Step 56: semanteme degree of all target sentences of collection composition information and corresponding first hotword collectionThe related semanteme degree of the obtained information and the corresponding generated first hotword set is +.>Wherein->N is expressed as the total number of sentences constituting the information.
8. The method for thermal analysis of information based on big data platform according to claim 1, wherein the user text data consists of downloading amount, reading amount and collection amount data.
CN202311337284.9A 2023-10-17 2023-10-17 Information heat analysis method based on big data platform Active CN117076963B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311337284.9A CN117076963B (en) 2023-10-17 2023-10-17 Information heat analysis method based on big data platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311337284.9A CN117076963B (en) 2023-10-17 2023-10-17 Information heat analysis method based on big data platform

Publications (2)

Publication Number Publication Date
CN117076963A true CN117076963A (en) 2023-11-17
CN117076963B CN117076963B (en) 2024-01-02

Family

ID=88706490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311337284.9A Active CN117076963B (en) 2023-10-17 2023-10-17 Information heat analysis method based on big data platform

Country Status (1)

Country Link
CN (1) CN117076963B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130046771A1 (en) * 2011-08-15 2013-02-21 Lockheed Martin Corporation Systems and methods for facilitating the gathering of open source intelligence
CN103577501A (en) * 2012-08-10 2014-02-12 深圳市世纪光速信息技术有限公司 Hot topic searching system and hot topic searching method
CN103678670A (en) * 2013-12-25 2014-03-26 福州大学 Micro-blog hot word and hot topic mining system and method
CN109635192A (en) * 2018-12-05 2019-04-16 宁波深擎信息科技有限公司 Magnanimity information temperature seniority among brothers and sisters update method and platform towards micro services
CN110162796A (en) * 2019-05-31 2019-08-23 阿里巴巴集团控股有限公司 Special Topics in Journalism creation method and device
CN113076416A (en) * 2021-03-15 2021-07-06 北京明略软件系统有限公司 Information heat evaluation method and device and electronic equipment
CN115034206A (en) * 2022-06-20 2022-09-09 科大国创云网科技有限公司 Customer service hot spot event discovery method and system
CN115510500A (en) * 2022-11-18 2022-12-23 北京国科众安科技有限公司 Sensitive analysis method and system for text content
CN115712700A (en) * 2022-11-18 2023-02-24 生态环境部环境规划院 Hot word extraction method, system, computer device and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130046771A1 (en) * 2011-08-15 2013-02-21 Lockheed Martin Corporation Systems and methods for facilitating the gathering of open source intelligence
CN103577501A (en) * 2012-08-10 2014-02-12 深圳市世纪光速信息技术有限公司 Hot topic searching system and hot topic searching method
CN103678670A (en) * 2013-12-25 2014-03-26 福州大学 Micro-blog hot word and hot topic mining system and method
CN109635192A (en) * 2018-12-05 2019-04-16 宁波深擎信息科技有限公司 Magnanimity information temperature seniority among brothers and sisters update method and platform towards micro services
CN110162796A (en) * 2019-05-31 2019-08-23 阿里巴巴集团控股有限公司 Special Topics in Journalism creation method and device
CN113076416A (en) * 2021-03-15 2021-07-06 北京明略软件系统有限公司 Information heat evaluation method and device and electronic equipment
CN115034206A (en) * 2022-06-20 2022-09-09 科大国创云网科技有限公司 Customer service hot spot event discovery method and system
CN115510500A (en) * 2022-11-18 2022-12-23 北京国科众安科技有限公司 Sensitive analysis method and system for text content
CN115712700A (en) * 2022-11-18 2023-02-24 生态环境部环境规划院 Hot word extraction method, system, computer device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹彬;顾怡立;谢珍真;陈震;: "一种基于大数据技术的舆情监控系统", 信息网络安全, no. 12, pages 38 - 42 *

Also Published As

Publication number Publication date
CN117076963B (en) 2024-01-02

Similar Documents

Publication Publication Date Title
Orabi et al. Deep learning for depression detection of twitter users
Li et al. Sentiment analysis of danmaku videos based on naïve bayes and sentiment dictionary
Amir et al. Modelling context with user embeddings for sarcasm detection in social media
CN113435203B (en) Multi-modal named entity recognition method and device and electronic equipment
CN107291723B (en) Method and device for classifying webpage texts and method and device for identifying webpage texts
Amir et al. Quantifying mental health from social media with neural user embeddings
CN106897363B (en) Text recommendation method based on eye movement tracking
CN112667899A (en) Cold start recommendation method and device based on user interest migration and storage equipment
US8275772B2 (en) Content and quality assessment method and apparatus for quality searching
CN111914062B (en) Long text question-answer pair generation system based on keywords
Balog et al. On interpretation and measurement of soft attributes for recommendation
Torki A document descriptor using covariance of word vectors
CN112188312A (en) Method and apparatus for determining video material of news
Xie et al. Learning tfidf enhanced joint embedding for recipe-image cross-modal retrieval service
Hallac et al. user2vec: Social media user representation based on distributed document embeddings
Gupta et al. Depression detection on social media with the aid of machine learning platform: A comprehensive survey
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
CN114003726A (en) Subspace embedding-based academic thesis difference analysis method
CN117076963B (en) Information heat analysis method based on big data platform
Achilles et al. Using Surface and Semantic Features for Detecting Early Signs of Self-Harm in Social Media Postings.
Khan et al. Stress detection from Twitter posts using LDA
CN109254993B (en) Text-based character data analysis method and system
CN115510326A (en) Internet forum user interest recommendation algorithm based on text features and emotional tendency
Duane Melodic patterns and tonal cadences: Bayesian learning of cadential categories from contrapuntal information
Thomas et al. Synthesized feature space for multiclass emotion classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant