CN115248888A - Data identification system for searching hot words through big data - Google Patents

Data identification system for searching hot words through big data Download PDF

Info

Publication number
CN115248888A
CN115248888A CN202210065399.6A CN202210065399A CN115248888A CN 115248888 A CN115248888 A CN 115248888A CN 202210065399 A CN202210065399 A CN 202210065399A CN 115248888 A CN115248888 A CN 115248888A
Authority
CN
China
Prior art keywords
classification
vocabulary
hot
module
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210065399.6A
Other languages
Chinese (zh)
Inventor
白艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xijing University
Original Assignee
Xijing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xijing University filed Critical Xijing University
Priority to CN202210065399.6A priority Critical patent/CN115248888A/en
Publication of CN115248888A publication Critical patent/CN115248888A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data identification, in particular to a data identification system for searching hot words through big data. The system comprises a vocabulary income unit, a classification storage unit, a recognition unit, a vocabulary classification unit and a big data analysis unit. In the invention, the big data analysis unit is used for carrying out classification analysis, so that the whole system considers whether the class of the vocabulary has certain heat, the vocabulary reflects hot topics and the problems of livelihood in one period, the heat of the class of the vocabulary is difficult to change even if the number of the vocabulary is brushed, and the problem of brushing the number of the vocabulary is avoided.

Description

Data identification system for searching hot words through big data
Technical Field
The invention relates to the technical field of data identification, in particular to a data identification system for searching hot words through big data.
Background
Hot words, i.e., hot vocabularies; as a lexical phenomenon, problems and things which are generally concerned by people in one country and one region in one period are reflected. Has epoch characteristics and reflects hot topics and civil problems in one period. The main expression forms include language, words and network pictures.
In the prior art:
chinese patent application No. CN201810737959.1 discloses a title hot word automatic metering method, a storage medium, electronic equipment and a system, and relates to the field of big data. And setting continuous time periods, and counting the occurrence times of all the hot words to be measured in each time period. And accumulating the occurrence times of all the hot words to be measured to obtain the total occurrence times, and dividing the occurrence times of each hot word to be measured in each period by the total occurrence times to obtain the duty ratio of the hot words to be measured corresponding to the time period. And calculating the heat value of the hot words to be measured by using a preset heat measurement algorithm according to the occurrence frequency and the duty ratio of the hot words to be measured obtained in each time period, wherein the higher the duty ratio is, the higher the heat value of the hot words to be measured is.
However, it is not representative enough to analyze the heat degree of the hot word by the number of times, because the number of times is likely to be brushed up, and thus the heat degree cannot satisfy the definition of the hot word, and because the definition of the hot word says "generally concern about the problem and thing", it is not representative to say that the heat degree of the hot word is judged only by the number of times.
And the measurement calculated amount is large through times and duty ratio, the cost of the whole process is too high, and the realized effect is only to complete the measurement of the heat.
Disclosure of Invention
The present invention is directed to a data recognition system for searching for a hotword through big data, so as to solve the problems in the background art.
In order to achieve the above object, there is provided a data recognition system for searching a hot word by big data, comprising a vocabulary income unit, a classification storage unit, a recognition unit, a vocabulary classification unit, and a big data analysis unit, wherein:
the big data analysis unit is used for carrying out big data classification analysis on the vocabulary recorded in the end;
the vocabulary income unit is used for recording the vocabulary of the recording end, and the vocabulary classification unit is used for classifying the recorded vocabulary according to the hot word category obtained by the analysis of the big data analysis unit;
the identification unit is used for identifying the hot classification analyzed by the big data analysis unit;
and the classification storage unit stores the vocabulary received and recorded by the vocabulary income unit according to the category of the hot words.
As a further improvement of the technical solution, the big data analysis unit includes a data search module, a data analysis module, and a hotword classification establishment module, wherein:
the data search module is used for retrieving vocabulary data of the Internet in real time;
the data analysis module is used for carrying out classification analysis by combining topics in the Internet;
the hot word classification establishing module establishes classification information of corresponding attributes according to the classification analysis structure of the data analysis module.
As a further improvement of the technical scheme, the topics in the data analysis module comprise hot topics and civil topics in the Internet.
As a further improvement of the technical solution, the classification storage unit establishes a classification storage block according to the category of the hotword, and the vocabulary received by the vocabulary receiving unit is stored in the corresponding classification storage block according to the category of the hotword.
As a further improvement of the technical solution, the data analysis module adopts an ID3 algorithm for classification analysis, and the algorithm steps are as follows:
s1, calculating information gain of hot word attributes;
s2, selecting an attribute A with the largest information gain;
s3, classifying the hot words with the same value at the position A into the same subset;
and S4, performing recursive operation on the subset under each value taking condition.
As a further improvement of the technical solution, the big data analysis unit further includes a classification analysis module, and the classification analysis module is configured to perform heat analysis on the classification storage block to obtain a hot classification.
As a further improvement of the technical solution, the recognition unit includes a recognition factor determination module, a hotword recognition module and a hotword output module; the identification factor determining module determines an identification factor according to the popular classification analyzed by the classification analyzing module; the hot word recognition module is used for recognizing words in hot classification; the hot word output module is used for outputting hot words.
As a further improvement of the technical scheme, the hot word recognition module adopts a heat reduction algorithm for recognition, and the algorithm steps are as follows:
firstly, the nth vocabulary number x in the hot classification is counted n
And then, counting the collection users of the vocabulary, and performing heat reduction operation.
Compared with the prior art, the invention has the following beneficial effects:
1. in the data recognition system for searching the hot words through the big data, the big data analysis unit is used for carrying out classification analysis, so that the whole system considers whether the class to which the words belong has certain heat, the hot topics and the civil problems in one period are reflected, the heat of the class of the words is difficult to change even if the number of the words is brushed, and the problem of brushing the number is avoided.
2. In the data identification system for searching the hot words through the big data, the workload of identification is reduced through hot classification, and meanwhile, the heat reducing degree is carried out by utilizing the psychology that people gradually decrease progressively, so that the heat is more authentic and representative.
Drawings
FIG. 1 is a block diagram of an integral unit module of the present invention;
FIG. 2 is a block diagram of a big data analysis unit according to one embodiment of the present invention;
FIG. 3 is a block diagram of a big data analysis unit module according to the present invention;
FIG. 4 is a block diagram of an identification unit module according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The present invention provides a data recognition system for searching hot words through big data, please refer to fig. 1, which includes a vocabulary income unit, a classification storage unit, a recognition unit, a vocabulary classification unit and a big data analysis unit, wherein:
the big data analysis unit is used for carrying out big data classification analysis on the vocabulary recorded in the end;
the vocabulary income unit is used for recording the vocabulary of the recording end, and the vocabulary classification unit is used for classifying the recorded vocabulary according to the hot word category obtained by the analysis of the big data analysis unit;
the identification unit is used for identifying the hot classification analyzed by the big data analysis unit;
the classification storage unit stores the vocabulary received and recorded by the vocabulary receiving unit according to the category of the hot words.
The working principle is as follows:
the method comprises the steps that firstly, vocabularies are collected through a vocabulary income unit, before the vocabulary income unit, a big data analysis unit carries out classification analysis to obtain hot word categories, then the vocabulary classification unit classifies the collected vocabularies according to the hot word categories (the collected vocabularies can be updated according to hot topics and civil problems in a period in real time), after classification, the vocabularies are not directly analyzed, the big data analysis unit is used for analyzing the hot word categories to obtain hot classifications, other hot word categories which do not belong to the hot classifications are not used as identification factors, then the identification unit identifies the hot classifications, and a heat reduction algorithm is specifically adopted for identification, namely, each vocabulary has a specific number in the hot classifications, but the authenticity problem of the number is considered, so the heat reduction is carried out on the vocabularies input by the same user, and the heat reduction is carried out by utilizing the psychology that people gradually reduce the heat, so that the heat is more authentic and representative.
The specific principle is illustrated by the following examples:
example 1
Referring to fig. 2, the big data analysis unit includes a data search module, a data analysis module, and a hotword classification establishment module, where the data search module searches data of the internet in real time, and then the data analysis module analyzes the data by combining hot topics in the internet data and hotword (hotword) data such as a civil problem, specifically using an ID3 algorithm, and the algorithm steps are as follows:
s1, calculating information gain of hot word attributes;
s2, selecting an attribute A with the largest information gain;
s3, classifying the hot words with the same value at the position A into the same subset, namely obtaining a plurality of subsets by taking a plurality of values of A;
and S4, performing recursion operation (namely, building a tree algorithm) on the subsets under each value taking condition, if the subsets only contain a single attribute, branching into leaf nodes, judging the attribute, then returning to a recursion calling position, or reaching the specified depth of the tree, or belonging to one attribute by all hotwords in the subsets, and then ending.
And finally, the hot word classification establishing module establishes classification information of corresponding attributes according to the attributes, and then the hot word classification unit establishes a classification storage block in the classification storage unit according to the classification information, so that the vocabularies received and included by the vocabulary income unit are stored in the classification storage block according to the corresponding hot word categories, and the vocabulary receiving and including are more orderly in the way.
Example 2
As shown in fig. 3, the big data analysis unit further includes a classification analysis module, after the hot classification is determined, the classification analysis module further performs hot analysis on the classification storage block, that is, the hot classification is obtained by analyzing the hot of the hot word class in a time period in combination with internet data, and the classification analysis module performs real-time hot classification, that is, the hot classification is also changed according to the difference of the hot degrees of topics in different time periods.
Example 3
Referring to fig. 4, the recognition unit includes a recognition factor determination module, a hot word recognition module and a hot word output module, and first, the recognition factor determination module determines the recognition factor according to the hot classification analyzed by the classification analysis module, that is, the hot word recognition module only recognizes words in the hot classification, and specifically adopts a heat reduction algorithm, which includes the following steps:
firstly, the nth vocabulary number x in the hot classification is counted n
Then, the users who include the vocabulary are counted, and if the same user inputs y vocabularies, the same user has a decreasing area and inputs y vocabularies<1000, the degree of reducing the heat is not carried out, when y input by a user is more than or equal to 1000, the degree of reducing the heat is carried out on the number of words with y being more than or equal to 1000 by utilizing the psychology that the curiosity heat of a fresh object is gradually reduced along with the increase of the cognitive times of people, namely 1100 is subtracted from the total number of the number of words when every 1000 words are input until the number of the words is reduced to 0, so that the authenticity of calculation of the number of the words is improved, the problem of the degree of refreshing is further avoided, and finally when x is more than or equal to 1000, the degree of reducing the heat is carried out n And if the hot word is larger than the hot word threshold value, the hot word is taken as the hot word, and then the hot word output module outputs the hot word.
It should be noted that the number of 1000 is merely an example, and the specifically set value is set according to the situation.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. A data recognition system for searching for a hotword through big data, characterized by: including vocabulary income unit, categorised memory cell, recognition cell, vocabulary taxon and big data analysis unit, wherein:
the big data analysis unit is used for carrying out big data classification analysis on the vocabulary recorded in the end;
the vocabulary income unit is used for recording the vocabulary of the recording end, and the vocabulary classification unit is used for classifying the recorded vocabulary according to the hot word category obtained by the analysis of the big data analysis unit;
the identification unit is used for identifying the hot classification analyzed by the big data analysis unit;
and the classified storage unit stores the vocabulary received and recorded by the vocabulary receiving unit according to the category of the hot words.
2. The data recognition system for searching for hotwords by big data according to claim 1, wherein: the big data analysis unit comprises a data search module, a data analysis module and a hot word classification building module, wherein:
the data search module is used for retrieving vocabulary data of the Internet in real time;
the data analysis module is used for carrying out classification analysis by combining topics in the Internet;
the hot word classification establishing module establishes classification information of corresponding attributes according to the classification analysis structure of the data analysis module.
3. The data recognition system for searching for hotwords by big data according to claim 2, wherein: topics in the data analysis module comprise hot topics and civil topics in the Internet.
4. The data recognition system for searching for hotwords by big data according to claim 3, wherein: the classified storage unit establishes a classified storage block according to the category of the hot words, and the vocabulary received and recorded by the vocabulary receiving unit is stored into the corresponding classified storage block according to the category of the hot words.
5. The data recognition system for searching for hotwords by big data according to claim 4, wherein: the data analysis module adopts an ID3 algorithm for classification analysis, and the algorithm steps are as follows:
s1, calculating information gain of hot word attributes;
s2, selecting an attribute A with the largest information gain;
s3, classifying the hot words with the same value at the position A into the same subset;
and S4, performing recursive operation on the subsets under each value taking condition.
6. The data recognition system for searching for hotwords by big data according to claim 5, wherein: the big data analysis unit further comprises a classification analysis module, and the classification analysis module is used for carrying out heat analysis on the classification storage blocks to obtain hot classifications.
7. The data recognition system for searching for hotwords by big data of claim 6, wherein: the identification unit comprises an identification factor determination module, a hot word identification module and a hot word output module; the identification factor determining module determines an identification factor according to the popular classification analyzed by the classification analyzing module; the hot word recognition module is used for recognizing words in hot classification; the hot word output module is used for outputting hot words.
8. The data recognition system for searching for hotwords by big data of claim 7, wherein: the hot word recognition module adopts a heat reduction algorithm for recognition, and the algorithm steps are as follows:
firstly, the nth vocabulary number x in the hot classification is counted n
And then, counting the collection users of the vocabulary, and performing heat reduction operation.
CN202210065399.6A 2022-01-19 2022-01-19 Data identification system for searching hot words through big data Pending CN115248888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210065399.6A CN115248888A (en) 2022-01-19 2022-01-19 Data identification system for searching hot words through big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210065399.6A CN115248888A (en) 2022-01-19 2022-01-19 Data identification system for searching hot words through big data

Publications (1)

Publication Number Publication Date
CN115248888A true CN115248888A (en) 2022-10-28

Family

ID=83698346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210065399.6A Pending CN115248888A (en) 2022-01-19 2022-01-19 Data identification system for searching hot words through big data

Country Status (1)

Country Link
CN (1) CN115248888A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117371436A (en) * 2023-10-09 2024-01-09 北京睿企信息科技有限公司 Hot word acquisition system with incremental heat

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN106503233A (en) * 2016-11-03 2017-03-15 北京挖玖电子商务有限公司 Top search term commending system
CN111523041A (en) * 2020-04-30 2020-08-11 掌阅科技股份有限公司 Recommendation method of heat data, computing device and computer storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN106503233A (en) * 2016-11-03 2017-03-15 北京挖玖电子商务有限公司 Top search term commending system
CN111523041A (en) * 2020-04-30 2020-08-11 掌阅科技股份有限公司 Recommendation method of heat data, computing device and computer storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117371436A (en) * 2023-10-09 2024-01-09 北京睿企信息科技有限公司 Hot word acquisition system with incremental heat
CN117371436B (en) * 2023-10-09 2024-04-12 北京睿企信息科技有限公司 Hot word acquisition system with incremental heat

Similar Documents

Publication Publication Date Title
CN109299271B (en) Training sample generation method, text data method, public opinion event classification method and related equipment
CN108932945B (en) Voice instruction processing method and device
WO2021073116A1 (en) Method and apparatus for generating legal document, device and storage medium
CN110543564B (en) Domain label acquisition method based on topic model
CN111008337B (en) Deep attention rumor identification method and device based on ternary characteristics
WO2002025479A1 (en) A document categorisation system
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN110826618A (en) Personal credit risk assessment method based on random forest
CN116150651A (en) AI-based depth synthesis detection method and system
CN115248888A (en) Data identification system for searching hot words through big data
CN117131345A (en) Multi-source data parameter evaluation method based on data deep learning calculation
CN112100341B (en) Intelligent question classification and recommendation method for rapid expressive force test
CN113282641A (en) Webpage search data information intelligent classification management method and system based on user behavior deep analysis and computer storage medium
CN114943285B (en) Intelligent auditing system for internet news content data
Leng et al. Audio scene recognition based on audio events and topic model
CN114490951B (en) Multi-label text classification method and model
CN114443930A (en) News public opinion intelligent monitoring and analyzing method, system and computer storage medium
CN113158669B (en) Method and system for identifying positive and negative comments of employment platform
CN110119465B (en) Mobile phone application user preference retrieval method integrating LFM potential factors and SVD
Zhong et al. Gender recognition of speech based on decision tree model
CN113177164A (en) Multi-platform collaborative new media content monitoring and management system based on big data
CN114339859B (en) Method and device for identifying WiFi potential users of full-house wireless network and electronic equipment
CN111353297B (en) Biomedical literature topic extraction method based on field topic interaction density
CN116823069B (en) Intelligent customer service quality inspection method based on text analysis and related equipment
CN111723223B (en) Multi-label image retrieval method based on subject inference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination