CN115248888A - Data identification system for searching hot words through big data - Google Patents
Data identification system for searching hot words through big data Download PDFInfo
- Publication number
- CN115248888A CN115248888A CN202210065399.6A CN202210065399A CN115248888A CN 115248888 A CN115248888 A CN 115248888A CN 202210065399 A CN202210065399 A CN 202210065399A CN 115248888 A CN115248888 A CN 115248888A
- Authority
- CN
- China
- Prior art keywords
- classification
- vocabulary
- hot
- module
- big data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9532—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of data identification, in particular to a data identification system for searching hot words through big data. The system comprises a vocabulary income unit, a classification storage unit, a recognition unit, a vocabulary classification unit and a big data analysis unit. In the invention, the big data analysis unit is used for carrying out classification analysis, so that the whole system considers whether the class of the vocabulary has certain heat, the vocabulary reflects hot topics and the problems of livelihood in one period, the heat of the class of the vocabulary is difficult to change even if the number of the vocabulary is brushed, and the problem of brushing the number of the vocabulary is avoided.
Description
Technical Field
The invention relates to the technical field of data identification, in particular to a data identification system for searching hot words through big data.
Background
Hot words, i.e., hot vocabularies; as a lexical phenomenon, problems and things which are generally concerned by people in one country and one region in one period are reflected. Has epoch characteristics and reflects hot topics and civil problems in one period. The main expression forms include language, words and network pictures.
In the prior art:
chinese patent application No. CN201810737959.1 discloses a title hot word automatic metering method, a storage medium, electronic equipment and a system, and relates to the field of big data. And setting continuous time periods, and counting the occurrence times of all the hot words to be measured in each time period. And accumulating the occurrence times of all the hot words to be measured to obtain the total occurrence times, and dividing the occurrence times of each hot word to be measured in each period by the total occurrence times to obtain the duty ratio of the hot words to be measured corresponding to the time period. And calculating the heat value of the hot words to be measured by using a preset heat measurement algorithm according to the occurrence frequency and the duty ratio of the hot words to be measured obtained in each time period, wherein the higher the duty ratio is, the higher the heat value of the hot words to be measured is.
However, it is not representative enough to analyze the heat degree of the hot word by the number of times, because the number of times is likely to be brushed up, and thus the heat degree cannot satisfy the definition of the hot word, and because the definition of the hot word says "generally concern about the problem and thing", it is not representative to say that the heat degree of the hot word is judged only by the number of times.
And the measurement calculated amount is large through times and duty ratio, the cost of the whole process is too high, and the realized effect is only to complete the measurement of the heat.
Disclosure of Invention
The present invention is directed to a data recognition system for searching for a hotword through big data, so as to solve the problems in the background art.
In order to achieve the above object, there is provided a data recognition system for searching a hot word by big data, comprising a vocabulary income unit, a classification storage unit, a recognition unit, a vocabulary classification unit, and a big data analysis unit, wherein:
the big data analysis unit is used for carrying out big data classification analysis on the vocabulary recorded in the end;
the vocabulary income unit is used for recording the vocabulary of the recording end, and the vocabulary classification unit is used for classifying the recorded vocabulary according to the hot word category obtained by the analysis of the big data analysis unit;
the identification unit is used for identifying the hot classification analyzed by the big data analysis unit;
and the classification storage unit stores the vocabulary received and recorded by the vocabulary income unit according to the category of the hot words.
As a further improvement of the technical solution, the big data analysis unit includes a data search module, a data analysis module, and a hotword classification establishment module, wherein:
the data search module is used for retrieving vocabulary data of the Internet in real time;
the data analysis module is used for carrying out classification analysis by combining topics in the Internet;
the hot word classification establishing module establishes classification information of corresponding attributes according to the classification analysis structure of the data analysis module.
As a further improvement of the technical scheme, the topics in the data analysis module comprise hot topics and civil topics in the Internet.
As a further improvement of the technical solution, the classification storage unit establishes a classification storage block according to the category of the hotword, and the vocabulary received by the vocabulary receiving unit is stored in the corresponding classification storage block according to the category of the hotword.
As a further improvement of the technical solution, the data analysis module adopts an ID3 algorithm for classification analysis, and the algorithm steps are as follows:
s1, calculating information gain of hot word attributes;
s2, selecting an attribute A with the largest information gain;
s3, classifying the hot words with the same value at the position A into the same subset;
and S4, performing recursive operation on the subset under each value taking condition.
As a further improvement of the technical solution, the big data analysis unit further includes a classification analysis module, and the classification analysis module is configured to perform heat analysis on the classification storage block to obtain a hot classification.
As a further improvement of the technical solution, the recognition unit includes a recognition factor determination module, a hotword recognition module and a hotword output module; the identification factor determining module determines an identification factor according to the popular classification analyzed by the classification analyzing module; the hot word recognition module is used for recognizing words in hot classification; the hot word output module is used for outputting hot words.
As a further improvement of the technical scheme, the hot word recognition module adopts a heat reduction algorithm for recognition, and the algorithm steps are as follows:
firstly, the nth vocabulary number x in the hot classification is counted n ;
And then, counting the collection users of the vocabulary, and performing heat reduction operation.
Compared with the prior art, the invention has the following beneficial effects:
1. in the data recognition system for searching the hot words through the big data, the big data analysis unit is used for carrying out classification analysis, so that the whole system considers whether the class to which the words belong has certain heat, the hot topics and the civil problems in one period are reflected, the heat of the class of the words is difficult to change even if the number of the words is brushed, and the problem of brushing the number is avoided.
2. In the data identification system for searching the hot words through the big data, the workload of identification is reduced through hot classification, and meanwhile, the heat reducing degree is carried out by utilizing the psychology that people gradually decrease progressively, so that the heat is more authentic and representative.
Drawings
FIG. 1 is a block diagram of an integral unit module of the present invention;
FIG. 2 is a block diagram of a big data analysis unit according to one embodiment of the present invention;
FIG. 3 is a block diagram of a big data analysis unit module according to the present invention;
FIG. 4 is a block diagram of an identification unit module according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The present invention provides a data recognition system for searching hot words through big data, please refer to fig. 1, which includes a vocabulary income unit, a classification storage unit, a recognition unit, a vocabulary classification unit and a big data analysis unit, wherein:
the big data analysis unit is used for carrying out big data classification analysis on the vocabulary recorded in the end;
the vocabulary income unit is used for recording the vocabulary of the recording end, and the vocabulary classification unit is used for classifying the recorded vocabulary according to the hot word category obtained by the analysis of the big data analysis unit;
the identification unit is used for identifying the hot classification analyzed by the big data analysis unit;
the classification storage unit stores the vocabulary received and recorded by the vocabulary receiving unit according to the category of the hot words.
The working principle is as follows:
the method comprises the steps that firstly, vocabularies are collected through a vocabulary income unit, before the vocabulary income unit, a big data analysis unit carries out classification analysis to obtain hot word categories, then the vocabulary classification unit classifies the collected vocabularies according to the hot word categories (the collected vocabularies can be updated according to hot topics and civil problems in a period in real time), after classification, the vocabularies are not directly analyzed, the big data analysis unit is used for analyzing the hot word categories to obtain hot classifications, other hot word categories which do not belong to the hot classifications are not used as identification factors, then the identification unit identifies the hot classifications, and a heat reduction algorithm is specifically adopted for identification, namely, each vocabulary has a specific number in the hot classifications, but the authenticity problem of the number is considered, so the heat reduction is carried out on the vocabularies input by the same user, and the heat reduction is carried out by utilizing the psychology that people gradually reduce the heat, so that the heat is more authentic and representative.
The specific principle is illustrated by the following examples:
example 1
Referring to fig. 2, the big data analysis unit includes a data search module, a data analysis module, and a hotword classification establishment module, where the data search module searches data of the internet in real time, and then the data analysis module analyzes the data by combining hot topics in the internet data and hotword (hotword) data such as a civil problem, specifically using an ID3 algorithm, and the algorithm steps are as follows:
s1, calculating information gain of hot word attributes;
s2, selecting an attribute A with the largest information gain;
s3, classifying the hot words with the same value at the position A into the same subset, namely obtaining a plurality of subsets by taking a plurality of values of A;
and S4, performing recursion operation (namely, building a tree algorithm) on the subsets under each value taking condition, if the subsets only contain a single attribute, branching into leaf nodes, judging the attribute, then returning to a recursion calling position, or reaching the specified depth of the tree, or belonging to one attribute by all hotwords in the subsets, and then ending.
And finally, the hot word classification establishing module establishes classification information of corresponding attributes according to the attributes, and then the hot word classification unit establishes a classification storage block in the classification storage unit according to the classification information, so that the vocabularies received and included by the vocabulary income unit are stored in the classification storage block according to the corresponding hot word categories, and the vocabulary receiving and including are more orderly in the way.
Example 2
As shown in fig. 3, the big data analysis unit further includes a classification analysis module, after the hot classification is determined, the classification analysis module further performs hot analysis on the classification storage block, that is, the hot classification is obtained by analyzing the hot of the hot word class in a time period in combination with internet data, and the classification analysis module performs real-time hot classification, that is, the hot classification is also changed according to the difference of the hot degrees of topics in different time periods.
Example 3
Referring to fig. 4, the recognition unit includes a recognition factor determination module, a hot word recognition module and a hot word output module, and first, the recognition factor determination module determines the recognition factor according to the hot classification analyzed by the classification analysis module, that is, the hot word recognition module only recognizes words in the hot classification, and specifically adopts a heat reduction algorithm, which includes the following steps:
firstly, the nth vocabulary number x in the hot classification is counted n ;
Then, the users who include the vocabulary are counted, and if the same user inputs y vocabularies, the same user has a decreasing area and inputs y vocabularies<1000, the degree of reducing the heat is not carried out, when y input by a user is more than or equal to 1000, the degree of reducing the heat is carried out on the number of words with y being more than or equal to 1000 by utilizing the psychology that the curiosity heat of a fresh object is gradually reduced along with the increase of the cognitive times of people, namely 1100 is subtracted from the total number of the number of words when every 1000 words are input until the number of the words is reduced to 0, so that the authenticity of calculation of the number of the words is improved, the problem of the degree of refreshing is further avoided, and finally when x is more than or equal to 1000, the degree of reducing the heat is carried out n And if the hot word is larger than the hot word threshold value, the hot word is taken as the hot word, and then the hot word output module outputs the hot word.
It should be noted that the number of 1000 is merely an example, and the specifically set value is set according to the situation.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (8)
1. A data recognition system for searching for a hotword through big data, characterized by: including vocabulary income unit, categorised memory cell, recognition cell, vocabulary taxon and big data analysis unit, wherein:
the big data analysis unit is used for carrying out big data classification analysis on the vocabulary recorded in the end;
the vocabulary income unit is used for recording the vocabulary of the recording end, and the vocabulary classification unit is used for classifying the recorded vocabulary according to the hot word category obtained by the analysis of the big data analysis unit;
the identification unit is used for identifying the hot classification analyzed by the big data analysis unit;
and the classified storage unit stores the vocabulary received and recorded by the vocabulary receiving unit according to the category of the hot words.
2. The data recognition system for searching for hotwords by big data according to claim 1, wherein: the big data analysis unit comprises a data search module, a data analysis module and a hot word classification building module, wherein:
the data search module is used for retrieving vocabulary data of the Internet in real time;
the data analysis module is used for carrying out classification analysis by combining topics in the Internet;
the hot word classification establishing module establishes classification information of corresponding attributes according to the classification analysis structure of the data analysis module.
3. The data recognition system for searching for hotwords by big data according to claim 2, wherein: topics in the data analysis module comprise hot topics and civil topics in the Internet.
4. The data recognition system for searching for hotwords by big data according to claim 3, wherein: the classified storage unit establishes a classified storage block according to the category of the hot words, and the vocabulary received and recorded by the vocabulary receiving unit is stored into the corresponding classified storage block according to the category of the hot words.
5. The data recognition system for searching for hotwords by big data according to claim 4, wherein: the data analysis module adopts an ID3 algorithm for classification analysis, and the algorithm steps are as follows:
s1, calculating information gain of hot word attributes;
s2, selecting an attribute A with the largest information gain;
s3, classifying the hot words with the same value at the position A into the same subset;
and S4, performing recursive operation on the subsets under each value taking condition.
6. The data recognition system for searching for hotwords by big data according to claim 5, wherein: the big data analysis unit further comprises a classification analysis module, and the classification analysis module is used for carrying out heat analysis on the classification storage blocks to obtain hot classifications.
7. The data recognition system for searching for hotwords by big data of claim 6, wherein: the identification unit comprises an identification factor determination module, a hot word identification module and a hot word output module; the identification factor determining module determines an identification factor according to the popular classification analyzed by the classification analyzing module; the hot word recognition module is used for recognizing words in hot classification; the hot word output module is used for outputting hot words.
8. The data recognition system for searching for hotwords by big data of claim 7, wherein: the hot word recognition module adopts a heat reduction algorithm for recognition, and the algorithm steps are as follows:
firstly, the nth vocabulary number x in the hot classification is counted n ;
And then, counting the collection users of the vocabulary, and performing heat reduction operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210065399.6A CN115248888A (en) | 2022-01-19 | 2022-01-19 | Data identification system for searching hot words through big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210065399.6A CN115248888A (en) | 2022-01-19 | 2022-01-19 | Data identification system for searching hot words through big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115248888A true CN115248888A (en) | 2022-10-28 |
Family
ID=83698346
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210065399.6A Pending CN115248888A (en) | 2022-01-19 | 2022-01-19 | Data identification system for searching hot words through big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115248888A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117371436A (en) * | 2023-10-09 | 2024-01-09 | 北京睿企信息科技有限公司 | Hot word acquisition system with incremental heat |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101923544A (en) * | 2009-06-15 | 2010-12-22 | 北京百分通联传媒技术有限公司 | Method for monitoring and displaying Internet hot spots |
CN106503233A (en) * | 2016-11-03 | 2017-03-15 | 北京挖玖电子商务有限公司 | Top search term commending system |
CN111523041A (en) * | 2020-04-30 | 2020-08-11 | 掌阅科技股份有限公司 | Recommendation method of heat data, computing device and computer storage medium |
-
2022
- 2022-01-19 CN CN202210065399.6A patent/CN115248888A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101923544A (en) * | 2009-06-15 | 2010-12-22 | 北京百分通联传媒技术有限公司 | Method for monitoring and displaying Internet hot spots |
CN106503233A (en) * | 2016-11-03 | 2017-03-15 | 北京挖玖电子商务有限公司 | Top search term commending system |
CN111523041A (en) * | 2020-04-30 | 2020-08-11 | 掌阅科技股份有限公司 | Recommendation method of heat data, computing device and computer storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117371436A (en) * | 2023-10-09 | 2024-01-09 | 北京睿企信息科技有限公司 | Hot word acquisition system with incremental heat |
CN117371436B (en) * | 2023-10-09 | 2024-04-12 | 北京睿企信息科技有限公司 | Hot word acquisition system with incremental heat |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299271B (en) | Training sample generation method, text data method, public opinion event classification method and related equipment | |
CN108932945B (en) | Voice instruction processing method and device | |
WO2021073116A1 (en) | Method and apparatus for generating legal document, device and storage medium | |
CN110543564B (en) | Domain label acquisition method based on topic model | |
CN111008337B (en) | Deep attention rumor identification method and device based on ternary characteristics | |
WO2002025479A1 (en) | A document categorisation system | |
CN111414461A (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN110826618A (en) | Personal credit risk assessment method based on random forest | |
CN116150651A (en) | AI-based depth synthesis detection method and system | |
CN115248888A (en) | Data identification system for searching hot words through big data | |
CN117131345A (en) | Multi-source data parameter evaluation method based on data deep learning calculation | |
CN112100341B (en) | Intelligent question classification and recommendation method for rapid expressive force test | |
CN113282641A (en) | Webpage search data information intelligent classification management method and system based on user behavior deep analysis and computer storage medium | |
CN114943285B (en) | Intelligent auditing system for internet news content data | |
Leng et al. | Audio scene recognition based on audio events and topic model | |
CN114490951B (en) | Multi-label text classification method and model | |
CN114443930A (en) | News public opinion intelligent monitoring and analyzing method, system and computer storage medium | |
CN113158669B (en) | Method and system for identifying positive and negative comments of employment platform | |
CN110119465B (en) | Mobile phone application user preference retrieval method integrating LFM potential factors and SVD | |
Zhong et al. | Gender recognition of speech based on decision tree model | |
CN113177164A (en) | Multi-platform collaborative new media content monitoring and management system based on big data | |
CN114339859B (en) | Method and device for identifying WiFi potential users of full-house wireless network and electronic equipment | |
CN111353297B (en) | Biomedical literature topic extraction method based on field topic interaction density | |
CN116823069B (en) | Intelligent customer service quality inspection method based on text analysis and related equipment | |
CN111723223B (en) | Multi-label image retrieval method based on subject inference |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |