CN115455269A - Article heat analysis method and device, data processing architecture and analysis system - Google Patents

Article heat analysis method and device, data processing architecture and analysis system Download PDF

Info

Publication number
CN115455269A
CN115455269A CN202211006785.4A CN202211006785A CN115455269A CN 115455269 A CN115455269 A CN 115455269A CN 202211006785 A CN202211006785 A CN 202211006785A CN 115455269 A CN115455269 A CN 115455269A
Authority
CN
China
Prior art keywords
article
analysis
popularity
heat
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211006785.4A
Other languages
Chinese (zh)
Other versions
CN115455269B (en
Inventor
吴钟健
乔素林
唐雪
蔡华
先树森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayun Tianxia Nanjing Technology Co ltd
Original Assignee
Huayun Tianxia Nanjing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huayun Tianxia Nanjing Technology Co ltd filed Critical Huayun Tianxia Nanjing Technology Co ltd
Priority to CN202211006785.4A priority Critical patent/CN115455269B/en
Publication of CN115455269A publication Critical patent/CN115455269A/en
Application granted granted Critical
Publication of CN115455269B publication Critical patent/CN115455269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to an article heat analysis method, an article heat analysis device, a data processing framework and an analysis system, based on the application, the application is a method for analyzing article heat, which can be set based on retrieval dimensions and can be conducted in relevance, website news, paper publications and digital format media article information in the industry and the relevant fields are automatically collected in a personalized mode, fresh hotspots are picked up, analysis trend is dynamic, data-driven intellectual library research reports are provided for industry system information analysis of various enterprises and office personnel related to enterprise propaganda service of internal and external connections, hotspot analysis of thousands of lines and thousands of faces is realized, a news assembly function of new events is realized, and intelligent data support service is provided for industry information collection and accurate rough manuscript reduction.

Description

Article heat analysis method and device, data processing architecture and analysis system
Technical Field
The present disclosure relates to the field of text processing technologies, and in particular, to an article popularity analysis method, an article popularity analysis device, a data processing architecture, and an article popularity analysis system.
Background
The industry system information analysis and the internal and external enterprise announcement service of various enterprises need related office personnel to read, confirm and feed back according to the mass articles on the related line of the system. In the process of reading articles and classifying and judging the articles, the articles are scattered in a plurality of intranet systems and portal websites, personal subjective judgment is needed each time, and the acquisition capacity of hot articles needs to be improved. When hot articles are screened, the whole vein of the whole hot event is difficult to comb, the source of the event cannot be accurately judged, articles which are important articles under the hot event are judged, and the articles are identified and reminded and are difficult to effectively distinguish depending on manpower.
In the prior art, the acquisition of hot articles mainly depends on general indicators such as click rate, reference number, and overall flow of an outbound web portal. There is no personalized solution for subjective requirements (post, industry, timeliness) of article selection and associated heat recommendation aspects on event context. What various enterprises need is the information searching and extracting method with thousands of lines and faces and extensible dimension.
Disclosure of Invention
In order to solve the above problems, the present application provides an article popularity analysis method, an article popularity analysis device, a data processing architecture and an analysis system.
On one hand, the application provides an article popularity analysis method, which comprises the following steps:
collecting and extracting online article data sources through a data collector;
carrying out format conversion and preprocessing on the extracted article data source to obtain a preprocessed data source;
and carrying out heat analysis on the preprocessed data source according to a preset analysis algorithm by utilizing an operator module in the analysis controller to obtain heat value ranking information of each article.
As an optional embodiment of the present application, optionally, the converting the format of the proposed article data source includes:
presetting a standardized raw conversion format;
based on the preset standardized raw conversion format, carrying out format conversion on the extracted article data source to obtain a standardized and converted done file;
detecting whether the done file is successfully converted in a standardized way:
if yes, outputting the done file;
otherwise, entering the format conversion step again.
As an optional implementation of this application, optionally, the performing, by using an operator module in the analysis controller, heat analysis on the preprocessed data source according to a preset analysis algorithm to obtain heat ranking information of each article includes:
acquiring the preprocessing data source;
performing word-level dimension analysis on the preprocessed data source by using an operator module configured in an analysis controller, and respectively checking out the heat information of the basic hot words and the new words;
and adjusting the heat ranking of the basic hot words and the new words by using a change rate operator, and outputting corresponding heat values.
As an optional implementation of this application, optionally, the performing, by using an operator module in the analysis controller, heat analysis on the preprocessed data source according to a preset analysis algorithm to obtain heat ranking information of each article, further includes:
collecting the heat values of the basic hot words and the new words;
calculating the heat value of the collected article by adopting a statistical analysis method, analyzing the paragraph article and checking the real-time heat arrangement information of the article;
the heat value calculation formula is as follows:
Figure BDA0003809179080000021
wherein the lev parameter is the article level weight; a is a presentation degree operator which is formed by calculation of access quantity numerical value index reference quantity; k and j represent subscripts of basic hotwords and new words contained in the article, respectively; loc is a position weight operator; poc is a part-of-speech weight; the coo is a composite operator, is added after all articles are subjected to heat degree analysis, satisfies a co-occurrence operator by hot words contained in the articles, and transmits heat degree numerical values of other articles to be accumulated, and the specific algorithm is as follows:
Figure BDA0003809179080000031
as an optional implementation of the present application, optionally, the performing, by using an operator module in the analysis controller, heat analysis on the preprocessed data source according to a preset analysis algorithm to obtain heat ranking information of each article, further includes:
presetting a coding rule;
based on the encoding rule, performing one-hot encoding on the article by using the basic hot words to construct a label vector corresponding to the article;
and inputting the label vector into a preset xgboost algorithm model, performing model classification prediction, and outputting the prediction score grade corresponding to the article in a regression manner.
As an optional implementation of the present application, optionally, the performing, by using an operator module in the analysis controller, heat analysis on the preprocessed data source according to a preset analysis algorithm to obtain heat ranking information of each article, further includes:
presetting a heat value weighting adjustment formula;
according to the heat value weighting adjustment formula, carrying out weighting adjustment on the real-time heat of the article by utilizing the predicted score grade;
and after adjustment, outputting and displaying the heat analysis result and ranking information of each article.
In another aspect of the present application, a device for implementing the article popularity analysis method is provided, including:
the data acquisition module is used for acquiring and extracting an online article data source through the data acquisition unit;
the format conversion module is used for carrying out format conversion on the extracted article data source and carrying out preprocessing on the article data source to obtain a preprocessed data source;
and the heat analysis module is used for carrying out heat analysis on the preprocessed data source according to a preset analysis algorithm by utilizing an operator module in the analysis controller to obtain heat value ranking information of each article.
In another aspect of the present application, a data processing architecture is further provided for executing the article popularity analysis method, including:
the data collector is used for collecting and extracting the online article data source;
the operator module is used for completing the processes of processing, converting and extracting the data by the analysis controller to obtain a preprocessed data source;
and the analysis controller is used for performing word-level dimension analysis, paragraph article analysis and model classification prediction on the preprocessed data source by adopting the operator module to obtain the popularity value ranking information of each article.
In another aspect of the present application, an analysis system is further provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the article heat analysis method when executing the executable instructions.
The invention has the technical effects that:
based on the implementation of the application, the data sources of the online articles are collected and extracted through a data collector; carrying out format conversion on the extracted article data source and carrying out pretreatment to obtain a pretreated data source; and carrying out heat analysis on the preprocessed data source according to a preset analysis algorithm by utilizing an operator module in the analysis controller to obtain heat value ranking information of each article. The personalized and accurate article popularity analysis method with settable retrieval dimensionality and conductibility can be obtained and is used for supporting various enterprises to complete the data intelligent industry system information analysis and the internal and external enterprise publicity service process. The method not only can integrate the traditional heat analysis indexes to be used as the basic label construction of articles, but also provides the analysis capability consistent with that of the existing method, and provides the initialized dimension material for the subsequent expansion optimization processing of the method.
The method for analyzing the article popularity can be set based on retrieval dimensions and can be conducted in relevance, website news, paper publications and digital format media article information in the industry and relevant fields is automatically collected in a personalized mode, fresh hotspots are picked up, analysis trend is dynamic, data-driven intelligence library research reports are provided for industry system information analysis of various enterprises and office personnel related to enterprise propaganda service of internal and external connections, hotspot analysis of thousands of lines and thousands of faces is achieved, a news assembling function of new events is achieved, and intelligent data supporting services are provided for industry information collection and accurate manuscript reduction.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a schematic flow chart of an embodiment of the method for analyzing the popularity of the article of the present invention;
FIG. 2 illustrates a timing diagram for an implementation of the present invention;
FIG. 3 is a block diagram of the data processing architecture of the present invention.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
Example 1
The personalized and accurate article popularity analysis method with settable retrieval dimensions and conductibility is obtained based on the personalized and accurate article popularity analysis method with settable retrieval dimensions and conductibility, and is used for supporting various enterprises to complete the information analysis of a data intelligent industry system and the propaganda service process of an internal and external enterprise. The method not only can integrate the traditional heat analysis indexes to be used as the basic label construction of articles, but also provides the analysis capability consistent with that of the existing method, and simultaneously provides the initialized dimension material for the subsequent expansion optimization processing of the method, and provides a solution for searching and extracting thousands of lines and thousands of surfaces of information for the industry.
As shown in fig. 1, in one aspect, the present application provides an article popularity analysis method, including the following steps:
s1, collecting and extracting an online article data source through a data collector;
the hot articles are mainly acquired through a data acquisition unit, and the acquisition mode of the articles can be acquired according to the type of the data acquisition unit. In this embodiment, the data acquisition device may be set to extract structured, CEB format, OCR scanned paper document and internet online article data sources driven by periodic task scheduling through a program API, a file transfer service, a web crawler, and the like. This embodiment, the type of article, etc., are not limited. The type of article collection can be set by a user.
S2, carrying out format conversion on the extracted article data source and carrying out preprocessing to obtain a preprocessed data source;
and after the hot articles are acquired, the data acquisition unit performs conversion on the extracted various heterogeneous article data to a standardized raw format.
Fig. 2 shows a timing chart of the method.
The task of article collection is started by a timing task. After the user sets a timing task for collecting the hot articles, the system executes article collection at a fixed time.
As an optional embodiment of the present application, optionally, the converting the format of the proposed article data source includes:
presetting a standardized raw conversion format; the collected articles are different in type, so that uniform conversion is required to be performed, and the conversion format of the articles is preset. The raw conversion format is preferred in this embodiment.
Based on the preset standardized raw conversion format, carrying out format conversion on the extracted article data source to obtain a standardized and converted done file; during specific conversion, the data acquisition unit executes conversion of various extracted heterogeneous data into a standardized raw format to obtain a standardized and converted done file (a file in a standard format). During conversion, the articles need to be preprocessed, incomplete, repeated and various illegal data are eliminated at the same time, and the operations can be executed manually or through algorithms, and the operation is not limited in the present.
The method also comprises a step of judging the format of the converted file, namely: detecting whether the done file is successfully converted in a standardized way:
if yes, outputting the done file;
otherwise, the format conversion step is entered again. The purpose of the judging step is to carry out format detection on the converted files and unify the converted files. If the file format of a certain article after conversion is not agreed, namely the detection fails, the standardization processing is carried out, and the article is converted again according to the format.
The converted raw format is defined by meeting the system standard, and records the static dimension information of the data:
[ information source | classification label | access amount | reference amount | update time | title | summary | chapter paragraph | evaluation score | original document ].
The field content of the dimensionality can integrate the traditional heat analysis index to be used as the basic label construction of an article, provides the analysis capability consistent with the existing method, and provides an initialized dimensionality material for the subsequent expansion optimization processing of the method.
And S3, carrying out heat analysis on the preprocessed data source according to a preset analysis algorithm by using an operator module in the analysis controller to obtain heat value ranking information of each article.
After the article data in the standardized raw format obtained by converting the data acquisition unit is obtained, the analysis controller is utilized to execute processing on the data in the standardized raw format, and the execution comprises the following steps:
analyzing word-level dimensions;
paragraph article analysis;
model classification prediction;
when the method is implemented specifically, the configured operator module completes the processes of processing, converting and extracting data by the analysis controller. As shown in fig. 2, the operator module mainly includes a basic operator, a content operator, and an association operator, and the data processing objects and schemes executed by each operator are different, and the obtained results are also different. Several operator types and functions are described below:
Figure BDA0003809179080000071
Figure BDA0003809179080000081
Figure BDA0003809179080000091
Figure BDA0003809179080000101
the following describes in detail the data processing process of the analysis controller performing word-level dimension analysis, paragraph article analysis and model classification prediction through the operator module.
1. Word-level dimension analysis
As an optional implementation of the present application, optionally, the performing, by using an operator module in the analysis controller, heat analysis on the preprocessed data source according to a preset analysis algorithm to obtain heat ranking information of each article includes:
acquiring the preprocessing data source;
performing word-level dimension analysis on the preprocessed data source by using an operator module configured in an analysis controller, and respectively calculating heat information of basic hot words and new words;
and adjusting the heat ranking of the basic hot words and the new words by using a change rate operator, and outputting corresponding heat values.
Firstly, word-level dimension analysis is performed, and hot word calculation is performed to support the mining of personalized interventable hotspot materials from the atomic level of article contents. Performing word segmentation processing on the full text to obtain parts of speech, word frequency and tfdif numerical values; in the process, an externally maintained dictionary file is linked, manual intervention is performed on the industry words and the service key words (red and black words), the corresponding heat score algorithm weight of the words is improved or reduced, and heat arrangement information of the basic words is calculated; meanwhile, a mutual information algorithm is adopted to perform new word discovery processing on the article, a word stacking operator (a part-of-speech combination mode) is arranged, mode matching is performed on the obtained alternative new word list items, the word stacking operator extracts the part of speech in a combined word construction mode, a part-of-speech combination frequent mode is obtained from the historical article by relying on a classical Apriori algorithm, confidence weight is added to the new words conforming to the mode, tfdif value is calculated for the new words higher than a set threshold value, and heat ranking information of the new words is calculated.
The method comprises the steps of obtaining heat information of basic words and new words through word-level dimension analysis, retrieving the heat value corresponding to a historical acquisition time window of each word by using a change rate operator, calculating the change rate (heat difference value/acquisition window number), increasing and decreasing the heat of the basic words and the new words through positive and negative values, and providing input source parameters for paragraph article analysis after re-adjusting ranking information.
2. Paragraph analysis
Paragraph article analysis is mainly to use statistical analysis method to complete the calculation of heat value of collected articles.
As an optional implementation of the present application, optionally, the performing, by using an operator module in the analysis controller, heat analysis on the preprocessed data source according to a preset analysis algorithm to obtain heat ranking information of each article, further includes:
collecting the heat values of the basic hot words and the new words;
calculating the popularity value of the collected article by adopting a statistical analysis method, analyzing the paragraph article and checking the real-time popularity arrangement information of the article;
the heat value calculation formula is as follows:
Figure BDA0003809179080000121
wherein the lev parameter is the article level weight; a is a presentation degree operator which is formed by calculating the access quantity numerical value index reference quantity; k and j represent subscripts of basic hotwords and new words contained in the article, respectively; loc is a position weight operator; poc is a part-of-speech weight; coo is a composite operator, is added after all articles are analyzed for popularity, satisfies the operator of the co-occurrence degree by the hot words contained in the article, and transmits the popularity numerical value of other articles to be accumulated, and the specific algorithm is as follows:
Figure BDA0003809179080000122
3. model classification prediction
And (3) model classification prediction, namely firstly, carrying out one-hot encoding on the article based on hot words, constructing a label vector corresponding to the article, inputting an xgboost algorithm model pre-trained through a manual labeling data set, and regressing and outputting a prediction score grade corresponding to the article.
As an optional implementation of the present application, optionally, the performing, by using an operator module in the analysis controller, heat analysis on the preprocessed data source according to a preset analysis algorithm to obtain heat ranking information of each article, further includes:
presetting a coding rule; the encoding rule is set by the user, and is not limited herein.
Based on the encoding rule, performing one-hot encoding on the article by using the basic hot words to construct a label vector corresponding to the article; and constructing a label vector according to a coding rule by using sentence vectors, word vectors and the like preset by the operator module.
And inputting the label vector into a preset xgboost algorithm model, performing model classification prediction, and outputting the prediction score grade corresponding to the article in a regression manner. The operator module is internally stored with an algorithm model for processing the label vector, and the xgboost algorithm model and the like can be obtained through pre-training of the FineTune model and data. The model training method is a known technique, and is not described in detail in this embodiment.
As an optional implementation of the present application, optionally, the performing, by using an operator module in the analysis controller, heat analysis on the preprocessed data source according to a preset analysis algorithm to obtain heat ranking information of each article, further includes:
presetting a heat value weighting adjustment formula;
according to the heat value weighting adjustment formula, carrying out weighting adjustment on the real-time heat of the article by utilizing the prediction score grade;
and after adjustment, outputting and displaying the heat analysis result and ranking information of each article.
After the score grades corresponding to the articles are obtained, the generated real-time heat is weighted and adjusted in a coefficient product mode, so that the statistical-based instant numerical value can be linked with the historical dimension, the extensibility is more objective, and the heat analysis result and ranking information of each article are finally formed.
After the heat analysis result and ranking information of the article are formed, the heat analysis result and the ranking information can be displayed on different terminals. The method can be used for personalized automatic collection of website news, paper publications and digital format media article information in the industry and related fields, picking up fresh hotspots and dynamic analysis trend, providing data-driven intellectual library research reports for industry system information analysis of various enterprises and office personnel related to internal and external enterprise propaganda service, realizing the functions of hotspot analysis of thousands of lines and thousands of faces and news assembly of new events, and providing data intelligent support service for industry information collection and accurate manuscript reduction.
Based on the implementation principle of the method, a hardware architecture for implementing the method is correspondingly provided herein, so as to support the implementation and execution of the method.
As shown in fig. 3, in another aspect of the present application, a data processing architecture is further provided for executing the article popularity analysis method, including:
the data collector is used for collecting and extracting the online article data source;
the operator module is used for completing the processes of processing, converting and extracting the data by the analysis controller to obtain a preprocessed data source;
and the analysis controller is used for performing word-level dimension analysis, paragraph article analysis and model classification prediction on the preprocessed data source by adopting the operator module to obtain the popularity value ranking information of each article.
The specific structure and functional design of the architecture can be specifically referred to the implementation principle of the article popularity analysis method, and the architecture structure is designed according to the data interaction and data processing functions thereof, which is not described herein again.
Therefore, the personalized and accurate article heat analysis method with settable retrieval dimensionality and conductibility provides a solution for searching and extracting thousands of lines and thousands of faces of information.
It should be noted that, although the foregoing heat analysis processing scheme is described by using Apriori algorithm and xgboost algorithm model as examples, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each application algorithm and application model according to the actual application scenario, as long as the technical function of the present application can be realized according to the above technical method.
Example 2
Based on the implementation principle of embodiment 1, in another aspect of the present application, a device for implementing the article popularity analysis method is provided, including:
the data acquisition module is used for acquiring and extracting an online article data source through the data acquisition unit;
the format conversion module is used for carrying out format conversion on the extracted article data source and carrying out preprocessing on the article data source to obtain a preprocessed data source;
and the heat analysis module is used for carrying out heat analysis on the preprocessed data source according to a preset analysis algorithm by utilizing an operator module in the analysis controller to obtain heat value ranking information of each article.
The functional principle and the interaction implementation of each module are described in detail in the software application description of embodiment 1, and this embodiment is not described again.
It should be apparent to those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, and the program may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the control methods as described above. The modules or steps of the invention described above can be implemented by a general purpose computing device, they can be centralized on a single computing device or distributed over a network of multiple computing devices, and they can alternatively be implemented by program code executable by a computing device, so that they can be stored in a storage device and executed by a computing device, or they can be separately fabricated into various integrated circuit modules, or multiple modules or steps thereof can be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed to implement the processes of the embodiments of the control methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-only memory (ROM), a Random Access Memory (RAM), a flash memory (FlashMemory), a hard disk (hard disk drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Example 3
Still further, in another aspect of the present application, an analysis system is further provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the article heat analysis method when executing the executable instructions.
Embodiments of the present disclosure provide an analysis system that includes a processor and a memory for storing processor-executable instructions. Wherein the processor is configured to execute the executable instructions to implement a method of article popularity analysis as described in any one of the preceding paragraphs.
Here, it should be noted that the number of processors may be one or more. Meanwhile, in the analysis system of the embodiment of the present disclosure, an input device and an output device may be further included. The processor, the memory, the input device, and the output device may be connected by a bus, or may be connected by other means, and are not limited specifically herein.
The memory, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and various modules, such as: the embodiment of the disclosure provides a program or a module corresponding to an article heat analysis method. The processor executes various functional applications of the analysis system and data processing by executing software programs or modules stored in the memory.
The input device may be used to receive an input number or signal. Wherein the signal may be a key signal generated in connection with user settings and function control of the device/terminal/server. The output device may include a display device such as a display screen.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (9)

1. An article popularity analysis method is characterized by comprising the following steps:
collecting and extracting online article data sources through a data collector;
carrying out format conversion on the extracted article data source and carrying out pretreatment to obtain a pretreated data source;
and carrying out heat analysis on the preprocessed data source according to a preset analysis algorithm by utilizing an operator module in the analysis controller to obtain heat value ranking information of each article.
2. The article popularity analysis method of claim 1, wherein the converting the format of the proposed article data source comprises:
presetting a standardized raw conversion format;
based on the preset standardized raw conversion format, carrying out format conversion on the extracted article data source to obtain a standardized and converted done file;
detecting whether the done file is successfully converted in a standardized way:
if yes, outputting the done file;
otherwise, entering the format conversion step again.
3. The article popularity analysis method of claim 1, wherein the obtaining of popularity ranking information for each article by performing popularity analysis on the preprocessed data sources according to a preset analysis algorithm by using an operator module in the analysis controller comprises:
acquiring the preprocessing data source;
performing word-level dimension analysis on the preprocessed data source by using an operator module configured in an analysis controller, and respectively calculating heat information of basic hot words and new words;
and adjusting the heat ranking of the basic hot words and the new words by using a change rate operator, and outputting corresponding heat values.
4. The article popularity analysis method of claim 3, wherein the method for performing popularity analysis on the preprocessed data sources according to a preset analysis algorithm by using an operator module in the analysis controller to obtain popularity ranking information of each article further comprises:
acquiring heat values of the basic hot words and the new words;
calculating the popularity value of the collected article by adopting a statistical analysis method, analyzing the paragraph article and checking the real-time popularity arrangement information of the article;
the heat value calculation formula is as follows:
Figure FDA0003809179070000021
wherein the lev parameter is the article level weight; a is a presentation degree operator which is formed by calculation of access quantity numerical value index reference quantity; k and j represent subscripts of basic hotwords and new words contained in the article, respectively; loc is a position weight operator; poc is a part-of-speech weight; the coo is a composite operator, is added after all articles are subjected to heat degree analysis, satisfies a co-occurrence operator by hot words contained in the articles, and transmits heat degree numerical values of other articles to be accumulated, and the specific algorithm is as follows:
Figure FDA0003809179070000022
5. the article popularity analysis method of claim 4, wherein the using an operator module in the analysis controller to perform popularity analysis on the preprocessed data source according to a preset analysis algorithm to obtain popularity ranking information of each article, further comprises:
presetting a coding rule;
based on the encoding rule, performing one-hot encoding on the article by using the basic hot words to construct a label vector corresponding to the article;
and inputting the label vector into a preset xgboost algorithm model, performing model classification prediction, and outputting the prediction score grade corresponding to the article in a regression manner.
6. The article popularity analysis method of claim 5, wherein the using an operator module in the analysis controller to perform popularity analysis on the preprocessed data sources according to a preset analysis algorithm to obtain popularity ranking information of each article, further comprises:
presetting a heat value weighting adjustment formula;
according to the heat value weighting adjustment formula, carrying out weighting adjustment on the real-time heat of the article by utilizing the prediction score grade;
and after adjustment, outputting and displaying the heat analysis result and ranking information of each article.
7. An apparatus for implementing the article popularity analysis method of any one of claims 1-6, comprising:
the data acquisition module is used for acquiring and extracting an online article data source through the data acquisition unit;
the format conversion module is used for carrying out format conversion on the extracted article data source and carrying out preprocessing on the article data source to obtain a preprocessed data source;
and the heat analysis module is used for carrying out heat analysis on the preprocessed data source according to a preset analysis algorithm by utilizing an operator module in the analysis controller to obtain heat value ranking information of each article.
8. A data processing architecture for performing the article popularity analysis method of any one of claims 1-6, comprising:
the data collector is used for collecting and extracting an online article data source;
the operator module is used for completing the processes of processing, converting and extracting the data by the analysis controller to obtain a preprocessed data source;
and the analysis controller is used for performing word-level dimension analysis, paragraph article analysis and model classification prediction on the preprocessed data source by adopting the operator module to obtain the popularity value ranking information of each article.
9. An analysis system, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the article popularity analysis method of any one of claims 1 to 6 when executing the executable instructions.
CN202211006785.4A 2022-08-22 2022-08-22 Article heat analysis method, device, data processing architecture and analysis system Active CN115455269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211006785.4A CN115455269B (en) 2022-08-22 2022-08-22 Article heat analysis method, device, data processing architecture and analysis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211006785.4A CN115455269B (en) 2022-08-22 2022-08-22 Article heat analysis method, device, data processing architecture and analysis system

Publications (2)

Publication Number Publication Date
CN115455269A true CN115455269A (en) 2022-12-09
CN115455269B CN115455269B (en) 2023-08-29

Family

ID=84299414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211006785.4A Active CN115455269B (en) 2022-08-22 2022-08-22 Article heat analysis method, device, data processing architecture and analysis system

Country Status (1)

Country Link
CN (1) CN115455269B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060167784A1 (en) * 2004-09-10 2006-07-27 Hoffberg Steven M Game theoretic prioritization scheme for mobile ad hoc networks permitting hierarchal deference
US20140081793A1 (en) * 2003-02-05 2014-03-20 Steven M. Hoffberg System and method
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts
CN111858934A (en) * 2015-12-04 2020-10-30 杭州数梦工场科技有限公司 Method and device for predicting article popularity
CN112749341A (en) * 2021-01-22 2021-05-04 南京莱斯网信技术研究院有限公司 Key public opinion recommendation method, readable storage medium and data processing device
CN113051484A (en) * 2019-12-27 2021-06-29 北京国双科技有限公司 Method and device for determining hot social information
CN113569129A (en) * 2021-02-02 2021-10-29 腾讯科技(深圳)有限公司 Click rate prediction model processing method, content recommendation method, device and equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140081793A1 (en) * 2003-02-05 2014-03-20 Steven M. Hoffberg System and method
US20060167784A1 (en) * 2004-09-10 2006-07-27 Hoffberg Steven M Game theoretic prioritization scheme for mobile ad hoc networks permitting hierarchal deference
CN103793503A (en) * 2014-01-24 2014-05-14 北京理工大学 Opinion mining and classification method based on web texts
CN111858934A (en) * 2015-12-04 2020-10-30 杭州数梦工场科技有限公司 Method and device for predicting article popularity
CN113051484A (en) * 2019-12-27 2021-06-29 北京国双科技有限公司 Method and device for determining hot social information
CN112749341A (en) * 2021-01-22 2021-05-04 南京莱斯网信技术研究院有限公司 Key public opinion recommendation method, readable storage medium and data processing device
CN113569129A (en) * 2021-02-02 2021-10-29 腾讯科技(深圳)有限公司 Click rate prediction model processing method, content recommendation method, device and equipment

Also Published As

Publication number Publication date
CN115455269B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
US11663254B2 (en) System and engine for seeded clustering of news events
US20230222366A1 (en) Systems and methods for semantic analysis based on knowledge graph
US8713023B1 (en) Systems and methods for classifying electronic information using advanced active learning techniques
US9910829B2 (en) Automatic document separation
CN102165490B (en) Image identity scale calculating system
CN102576372A (en) Content-based image search
CN102165486B (en) Image characteristic amount extraction device
CN106815605B (en) Data classification method and equipment based on machine learning
CN115577698A (en) Data and text processing system and method based on machine learning
CN114610865A (en) Method, device and equipment for recommending recalled text and storage medium
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN112182451A (en) Webpage content abstract generation method, equipment, storage medium and device
CN108875014B (en) Precise project recommendation method based on big data and artificial intelligence and robot system
CN115455269A (en) Article heat analysis method and device, data processing architecture and analysis system
KR102322212B1 (en) Apparatus and method for recommending learning contents
CN112699949B (en) Potential user identification method and device based on social platform data
EP4002151A1 (en) Data tagging and synchronisation system
Joglekar et al. Search Engine Optimization Using Unsupervised Learning
CN111353101A (en) Data pushing method
CN111061939A (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN112580681B (en) User classification method and device, electronic equipment and readable storage medium
CN114036302A (en) Classification method and device, data transaction system and readable storage medium
KR20230057841A (en) Nuclear-related industry information collection, analysis and classification system and method thereof
CN116932906A (en) Search term pushing method, device, equipment and storage medium
CN117557226A (en) Intelligent matching system for benefit-enterprise policy based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant