CN108399257B - Personalized news clue recommendation method based on intelligent manuscript analysis - Google Patents

Personalized news clue recommendation method based on intelligent manuscript analysis Download PDF

Info

Publication number
CN108399257B
CN108399257B CN201810189147.8A CN201810189147A CN108399257B CN 108399257 B CN108399257 B CN 108399257B CN 201810189147 A CN201810189147 A CN 201810189147A CN 108399257 B CN108399257 B CN 108399257B
Authority
CN
China
Prior art keywords
information
news
clue
data
reporter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810189147.8A
Other languages
Chinese (zh)
Other versions
CN108399257A (en
Inventor
顾建国
苏琦
吴昊
马晨阳
王亮
许辰铭
侯方天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Broadcasting Corp
Original Assignee
Jiangsu Broadcasting Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Broadcasting Corp filed Critical Jiangsu Broadcasting Corp
Priority to CN201810189147.8A priority Critical patent/CN108399257B/en
Publication of CN108399257A publication Critical patent/CN108399257A/en
Application granted granted Critical
Publication of CN108399257B publication Critical patent/CN108399257B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The clue automatic recommendation system based on the reporter work attribute extracts corresponding tags and stores the tags into the system by intelligently analyzing clues, questions, manuscripts and clue information of broadcast series in a news manuscript system; collecting the whole network information through an internet acquisition system; meanwhile, in combination with a television station or broadcast station manuscript system, Autonomy creates a new information layer: the intelligent information operation layer IDOL automatically unifies various information sources and various file formats under an intelligent information operation system, and the intelligent information operation system can provide various information operation functions, including automatic linking of information implementation, automatic information classification, content summarization and information clustering.

Description

Personalized news clue recommendation method based on intelligent manuscript analysis
Technical Field
The invention relates to an information release technology, in particular to automatic clue recommendation based on cloud reporter work attributes (news work attributes).
Background
Big data has entered human society and aspects of life. The media industry is the mainstay of "information consumption" and has a wealth of audiovisual and user data in its own right. With its trend of increasingly tighter integration with the internet, it has become another main battlefield for large data applications. The media big data mainly has the following three sources:
1. user behavior data generated by novel media service
2. Media resource data accumulated by traditional broadcast station for long time
3. Media big data from internet
The strategic importance of large media data is not to understand how large data information is, but to specialize the data that contains significance. For a long time, more research directions are put in the direction of realizing profit of big data, a great deal of effort is spent on exploring news spreading rules, and people can often see how to accurately position the most reports in research media and operation markets, how to realize accurate media advertisement marketing support, how to detect and monitor the influence of the spreading, and the like, and neglect news clues which are the most fundamental requirements of news production. If you are a media newsreader, you can often see that the reporter is busy and busy with a lot of thought on-line searching and telephone consultation, just to find a proper clue. Production tools are an important component of productivity development, and if reporters can use the powerful tool of "big data", the efficiency of news production will be greatly improved.
Disclosure of Invention
The invention aims to provide an intelligent-manuscript-analysis-based personalized news clue recommendation method, which can realize the reutilization of internal resources of a television station, deeply excavate attributes of journalists in a news manuscript system, push news clues in real time and improve the efficiency of the journalists for finding news clues.
In order to achieve the purpose, the technical scheme of the invention is as follows: a personalized news clue recommendation method based on intelligent manuscript analysis is characterized in that clues, reports, manuscripts and clue information of broadcast series in a news (television station, broadcast station and the like) manuscript system are intelligently analyzed, corresponding labels are extracted, and the extracted labels are stored in the system;
collecting whole network information (including websites, microblog WeChat, forum, app, pgc, ugc, local communication, telephone hotline system and the like) through an internet collection system, accessing the clue information into a data analysis engine through a connector, and forming clue labels to be stored in the system through data preprocessing, semantic analysis and cluster analysis; meanwhile, the news work attribute of the reporter is analyzed by combining a television station or a broadcast station manuscript system; extracting 'entries' of the reporter news work attribute, namely news work attribute tags, and recommending matched clues to the reporter for use by comparing the reporter news work attribute tags with the clue tags;
the data analysis engine adopts an HP Autonom engine combined with a Spark open source algorithm, and the HP Autonom core guarantees that information in any form is: text files or basic understanding based on speech, video, unstructured or structured actual content; autonomy creates a new information layer: the intelligent information operation layer IDOL automatically unifies all information sources and all file formats under an intelligent information operation system, and the intelligent information operation system can provide various information operation functions, including automatic linking of information implementation, automatic information classification, content summarization, information clustering and the like;
creating a new layer, namely an Intelligent Data Operating Layer (IDOL), in the system by Autonom, so that a media unit, namely an enterprise system, is centered on data; the back end of the operation platform is connected with various data sources, content searching can be carried out according to any language and format, and no matter where the content is stored, summaries and links of similar information are automatically presented in real time. Since autonomy's technology is built on probabilistic modeling and therefore does not rely on any language for analysis, and does not need to maintain any cumbersome vocabulary, IDOL treats this as an abstract notation of meaning that forms an understanding of a word by its context of occurrence rather than by strict grammatical definition, thereby identifying linguistic characteristics of any data that enters and utilizes autonomy's framework. In addition, autonomy also has classification and clustering functions.
The IDOL automatically classifies information according to concepts in unstructured text, which can ensure the most accurate classification of all data according to content. Automatic clustering can collect a large amount of document data or user profile information and automatically identify the major categories within the information, allowing the IDOL to automatically and consistently calculate which category the new information should belong to.
The characteristics of IDOL determine its powerful cluster analysis capabilities, but are deficient in the ability to stream calculations. The streaming computing capability of Spark is combined to process the information with burstiness, real-time performance and disorder.
The processing flow of the data engine is as follows: there are many invalid advertisement information and promotion columns in the internet information, and these information are all removed by the system.
Removing weight: there are many pieces of information transferred in the internet, and we need to remove the duplicated content and only keep one as a clue.
Clustering: similar contents are grouped into a category according to the word segmentation result. One is reserved as a main title, and the remaining similar contents are displayed as recommendation results.
Content word segmentation: and performing word segmentation on the converged content information, and extracting keywords in the article as tag records.
Comparing the reporter news attribute label with the news clue label, and if the matching degree exceeds 70, considering the clue as an effective clue and pushing the effective clue to the reporter for use through the litchi cloud report app.
The internet acquisition system collects objects of the whole network information: the whole network information comprises the Internet and a news intranet. The internet information comprises various major mainstream authoritative websites, plus V authentication official microblog accounts, WeChat public numbers, main news apps and local main forums; the news intranet information comprises news reporter stations, PGC (program guide) feeds of all-media reporters, UGC (civil reporter), local communication, hot-line telephones and television station reporters, and the news intranet information can require a user to input corresponding news attribute tags.
The invention takes a news system as an example, and a manuscript system is roughly divided into a clue publishing platform, a manuscript Inews system and a broadcasting serial list; the clue publishing platform has the function of clue entering personnel entering clues from sources such as telephone, fax, Internet and the like, and can be stored in a news clue library according to the confidentiality degrees of different levels; supporting tagged management of threads; the personnel data providing clues can be managed; the introduction of the clue attribute of the reporter can be directly recorded in a labeling mode.
The manuscript system and the broadcasting serial list both adopt an avid system, and the manuscript information is supported to be exported in an XML form.
The invention utilizes big data engine, combines with the news manuscript system of the television station, and realizes the automatic clue recommending capability based on the working attribute of the reporter through autonomous analysis, thereby providing news report clues for the reporter, prompting the news occurrence place and the interview direction, and being capable of acquiring clues needed by the reporter from numerous news clue sources.
In conclusion, the invention has the following beneficial effects: the method realizes the reutilization of internal resources of media such as television stations and the like, fully excavates the media value, and deeply analyzes the working attribute and the news attribute of a reporter. By combining the HP Autonomy engine and Spark architecture, the internal resource data analysis capability of media such as a television station and the like is effectively improved, and if effective clues are pushed in real time through a litchi cloud mobile cloud report App at the first time, the timeliness of news production of traditional media can be improved.
Drawings
FIG. 1 is a diagram based on the HP Autonomy algorithm framework;
fig. 2 is an information cluster architecture diagram.
Fig. 3 is a reporter attribute label extraction diagram.
Detailed Description
In order to better understand the technical content of the present invention, specific embodiments are described below with reference to the accompanying drawings.
The figures are not drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures may be represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. Embodiments of various aspects of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:
FIG. 1 shows a block diagram based on the HP Autonom algorithm.
The big data analysis system combined with HP autonomy shown in FIG. 1 mainly comprises: connector, CFS, IDOL Server, as shown in FIG. 1:
connectors, data collector. Supporting multi-format acquisition, comprising: internet data, audio-visual, text documents, databases … …;
CFS (the Connector Framework Server). Preprocessing the acquired data, such as: converting the data into a uniform format;
idol (intelligent Data Operating layer). The intelligent analysis of the data is the core of the whole system;
the data source is collected to the CFS through the Connectors, the CFS uniformly converts the multi-format data into a specific format, and the data is transmitted to the IDOL server after rich improvements such as extraction of keywords, transcoding and the like. And the IDOL server reclassifies, clusters and analyzes the hot spots according to actual requirements. And finally, visually presenting the analysis result.
Fig. 2 is an information cluster architecture diagram.
The IDOL server can automatically cluster the information and help you to view the trend and development change in the information. Clustering is the process of extracting massive unstructured data and automatically partitioning the data, thereby clustering similar information. Each cluster represents a conceptual region within the knowledge base containing items having a set of identical attributes.
The basis of clustering is to generate snapshots of the IDOL stored data before the data in the snapshots can be variously clustered. The snapshot represents the content of the data index at a particular time, supporting the generation of clustering information and spectral analysis, even if the data index has changed. Clustering information and spectral analysis data are generated simultaneously using a single snapshot, thereby shortening the process time. Ideally, the IDOL server data index from which the snapshot was taken must contain at least thousands of premium documents.
Fig. 3 is a label extraction diagram of reporter working attributes, and after information is preprocessed, the data size is still large. It is clearly meaningless to push around 6000 threads to the reporter each day. The invention provides an assumption of clue recommendation based on an intraboard manuscript system. It can be known from the existing text system in tv and radio stations that the text system in tv station accumulates the data of thousands of notes, and these notes are in one-to-one correspondence with the notes. Through analysis of the document system by the IDOL Server, the system calculates key participles of the document, and as shown in FIG. 3, the system calculates that hot participles of three documents and one document by a reporter are testless, drunk driving, high-speed and policemen. Assume that the weights corresponding to the four participles are W1, W2, W3, W4, where the count is incremented by 1 for each occurrence. According to the distribution of the keywords in fig. 3, the counting result is W1-2, W2-5, W3-1, and W4-6.
All manuscripts in the manuscript system are analyzed in the mode, and the final user word segmentation and the corresponding weight are obtained in a weighted average mode. All of these segments are stored in the system as tags of the portrait of the reporter. And performing matching analysis on the participles and the participles of all clues in the system, and recommending the clues to corresponding reporters when the matching degree reaches a set threshold value as effective clues. As shown in fig. 3.
In this way, we can get more accurate clue recommendation result. Meanwhile, the clues are sent to the mobile phone mobile cloud report app by utilizing interface calling of the litchi cloud. Therefore, the reporter can check the recommended clues through the mobile phone.
Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims (8)

1. A clue automatic recommendation system based on the reporter work attribute is characterized in that clue information, a report, a manuscript and clue information which is played in series in a news manuscript system are intelligently analyzed, and corresponding tags are extracted and stored in the system; collecting whole network information including websites, microblog WeChat, forum, app, pgc, ugc, local communication and a telephone hotline system through an internet collection system, accessing the clue information into a data analysis engine through a connector, and forming clue labels to be stored in the system through data preprocessing, semantic analysis and cluster analysis; meanwhile, the news work attribute of the reporter is analyzed by combining a television station or a broadcast station manuscript system; extracting 'entries' of the reporter news work attribute, namely news work attribute tags, and recommending matched clues to the reporter for use by comparing the reporter news work attribute tags with the clue tags; the data analysis engine adopts HP Autonomy engine combined with Spark open source algorithm, and the HP Autonomy core guarantees that for any form of information: text files or basic understanding based on speech, video, unstructured or structured actual content; autonomy creates a new information layer: the intelligent information operation layer IDOL automatically unifies various information sources and various file formats under an intelligent information operation system, and the intelligent information operation system can provide various information operation functions, including automatic linking of information implementation, automatic information classification, content summarization and information clustering.
2. The auto recommendation system according to claim 1, wherein the Autonomy creates a new layer in the system, i.e. an intelligent data operating layer (i.e. an IDOL), so that the media unit (i.e. the enterprise system) is "data-centric"; the back end of the operation platform is connected with various data sources, content searching can be carried out according to any language and format, and no matter where the content is stored, summaries and links of similar information are automatically presented in real time.
3. The auto-recommendation system according to claim 2, wherein IDOL treats an entry as a defined abstract notation that forms an understanding of the word by the context in which the entry appears rather than by a strict grammatical definition, thereby identifying linguistic characteristics of any data that enters and utilizes the autonomy framework; the autonomy is utilized to have the functions of classification and clustering.
4. The automated recommendation system according to claim 2, wherein IDOL automatically classifies information according to terms in unstructured text; automatic clustering can collect a large amount of document data or user profile information and automatically identify the major categories within the information, allowing the IDOL to automatically and consistently calculate which category the new information should belong to.
5. The automated recommendation system according to claim 2, wherein for streaming computing, information of burstiness, real-time, non-orderliness is processed by combining streaming computing power of Spark.
6. The automated recommendation system according to claim 2, wherein the data engine processes as follows, data cleansing: the Internet information contains a plurality of invalid advertisement information and promotion columns, and the information is removed through the system; removing weight: removing repeated contents of a plurality of pieces of information transferred in the Internet, and only reserving one piece of information as a clue; clustering: clustering similar contents into a class according to the word segmentation result; reserving one piece of content as a main title, and displaying the rest similar content as a recommendation result; content word segmentation: performing word segmentation on the converged content information, and extracting keywords in the article as tag records; comparing the reporter news attribute label with the news clue label, and if the matching degree exceeds more than 70%, determining that the clue is effective.
7. The automated recommendation system according to claim 2, wherein the internet collection system collects objects of the full web information: the whole network information comprises the Internet and a news intranet; the internet information comprises various major mainstream authoritative websites, plus V authentication official microblog accounts, WeChat public numbers, main news apps and local main forums; the news intranet information comprises news reporter stations, PGC (program guide) feeds of all-media reporters, UGC (civil reporter), local communication, hot-line telephones and television station reporters, and the news intranet information can require a user to input corresponding news attribute tags.
8. The automatic recommendation system of claim 2, wherein in the news network production and broadcasting system application in the television station, the manuscript system is divided into a clue publishing platform, a manuscript system and a broadcasting serial list; the clue publishing platform has the function of clue entering personnel entering clues from telephone, fax and Internet sources and stores the clues into a news clue library according to the confidentiality degrees of different levels; supporting tagged management of threads; the personnel data providing clues can be managed; therefore, the introduction system directly records the clue attribute of the reporter in a labeling mode; the manuscript system and the broadcasting serial list both adopt a news network production and broadcasting system in a television station to support the manuscript information to be exported in an XML form.
CN201810189147.8A 2018-03-08 2018-03-08 Personalized news clue recommendation method based on intelligent manuscript analysis Active CN108399257B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810189147.8A CN108399257B (en) 2018-03-08 2018-03-08 Personalized news clue recommendation method based on intelligent manuscript analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810189147.8A CN108399257B (en) 2018-03-08 2018-03-08 Personalized news clue recommendation method based on intelligent manuscript analysis

Publications (2)

Publication Number Publication Date
CN108399257A CN108399257A (en) 2018-08-14
CN108399257B true CN108399257B (en) 2021-05-18

Family

ID=63092595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810189147.8A Active CN108399257B (en) 2018-03-08 2018-03-08 Personalized news clue recommendation method based on intelligent manuscript analysis

Country Status (1)

Country Link
CN (1) CN108399257B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1072983A3 (en) * 1996-04-12 2003-11-12 Avid Technology, Inc. A multimedia system with improved data management mechanisms
CN102750390A (en) * 2012-07-05 2012-10-24 翁时锋 Automatic news webpage element extracting method
CN105656932A (en) * 2016-03-01 2016-06-08 中国传媒大学 Emergency news collecting method and system oriented to user-generated content
CN105706070A (en) * 2013-06-14 2016-06-22 T-数据系统(新加坡)有限公司 System and method for uploading, showcasing and selling news footage

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140358891A1 (en) * 2013-06-04 2014-12-04 Listener Driven Radio Llc System for collecting, calculating, and ranking interest in information in real time

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1072983A3 (en) * 1996-04-12 2003-11-12 Avid Technology, Inc. A multimedia system with improved data management mechanisms
CN102750390A (en) * 2012-07-05 2012-10-24 翁时锋 Automatic news webpage element extracting method
CN105706070A (en) * 2013-06-14 2016-06-22 T-数据系统(新加坡)有限公司 System and method for uploading, showcasing and selling news footage
CN105656932A (en) * 2016-03-01 2016-06-08 中国传媒大学 Emergency news collecting method and system oriented to user-generated content

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
动态文本流中的话题线索检测算法;曹月芹;《计算机工程》;20111231;第37卷(第24期);45-49 *
基于微博用户创作内容的新闻线索自动发现研究;傅湘玲,齐佳音,高威;《情报学报》;20161031;第35卷(第10期);1038-1047 *

Also Published As

Publication number Publication date
CN108399257A (en) 2018-08-14

Similar Documents

Publication Publication Date Title
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN103020159A (en) Method and device for news presentation facing events
CN104504081A (en) Intelligent analysis system for all-media detection and monitoring big data behaviors
CN111008321A (en) Recommendation method and device based on logistic regression, computing equipment and readable storage medium
CN111460252A (en) Automatic search engine method and system based on network public opinion analysis
CN105808722B (en) Information discrimination method and system
CN107977678B (en) Method and apparatus for outputting information
CN102542061A (en) Intelligent product classification method
CN111723256A (en) Government affair user portrait construction method and system based on information resource library
CN111680125A (en) Litigation case analysis method, litigation case analysis device, computer device, and storage medium
CN113360599A (en) Multi-source heterogeneous information convergence cooperative processing platform based on content identification
CN108363748A (en) Based on the topic portrait system and topic portrait method known
CN116384889A (en) Intelligent analysis method for information big data based on natural language processing technology
EP2595065B1 (en) Categorizing data sets
CN113015171A (en) System with network public opinion monitoring and analyzing functions
CN108399257B (en) Personalized news clue recommendation method based on intelligent manuscript analysis
CN111859108A (en) Public opinion system search word recommendation system
CN109710730B (en) Patrol information system and analysis method based on natural language analysis processing
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
CN116595043A (en) Big data retrieval method and device
CN107291952B (en) Method and device for extracting meaningful strings
KR100667917B1 (en) A method of providing website searching service and a system thereof
KR101664358B1 (en) Apparatus and method for multi-dimensional customer's clustering using topic analysis
Xu et al. The study of content security for mobile internet
CN111368550A (en) Public opinion information management system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant