CN108399257B

CN108399257B - Personalized news clue recommendation method based on intelligent manuscript analysis

Info

Publication number: CN108399257B
Application number: CN201810189147.8A
Authority: CN
Inventors: 顾建国; 苏琦; 吴昊; 马晨阳; 王亮; 许辰铭; 侯方天
Original assignee: Jiangsu Broadcasting Corp
Current assignee: Jiangsu Broadcasting Corp
Priority date: 2018-03-08
Filing date: 2018-03-08
Publication date: 2021-05-18
Anticipated expiration: 2038-03-08
Also published as: CN108399257A

Abstract

The clue automatic recommendation system based on the reporter work attribute extracts corresponding tags and stores the tags into the system by intelligently analyzing clues, questions, manuscripts and clue information of broadcast series in a news manuscript system; collecting the whole network information through an internet acquisition system; meanwhile, in combination with a television station or broadcast station manuscript system, Autonomy creates a new information layer: the intelligent information operation layer IDOL automatically unifies various information sources and various file formats under an intelligent information operation system, and the intelligent information operation system can provide various information operation functions, including automatic linking of information implementation, automatic information classification, content summarization and information clustering.

Description

Personalized news clue recommendation method based on intelligent manuscript analysis

Technical Field

The invention relates to an information release technology, in particular to automatic clue recommendation based on cloud reporter work attributes (news work attributes).

Background

Big data has entered human society and aspects of life. The media industry is the mainstay of "information consumption" and has a wealth of audiovisual and user data in its own right. With its trend of increasingly tighter integration with the internet, it has become another main battlefield for large data applications. The media big data mainly has the following three sources:

1. user behavior data generated by novel media service

2. Media resource data accumulated by traditional broadcast station for long time

3. Media big data from internet

The strategic importance of large media data is not to understand how large data information is, but to specialize the data that contains significance. For a long time, more research directions are put in the direction of realizing profit of big data, a great deal of effort is spent on exploring news spreading rules, and people can often see how to accurately position the most reports in research media and operation markets, how to realize accurate media advertisement marketing support, how to detect and monitor the influence of the spreading, and the like, and neglect news clues which are the most fundamental requirements of news production. If you are a media newsreader, you can often see that the reporter is busy and busy with a lot of thought on-line searching and telephone consultation, just to find a proper clue. Production tools are an important component of productivity development, and if reporters can use the powerful tool of "big data", the efficiency of news production will be greatly improved.

Disclosure of Invention

The invention aims to provide an intelligent-manuscript-analysis-based personalized news clue recommendation method, which can realize the reutilization of internal resources of a television station, deeply excavate attributes of journalists in a news manuscript system, push news clues in real time and improve the efficiency of the journalists for finding news clues.

In order to achieve the purpose, the technical scheme of the invention is as follows: a personalized news clue recommendation method based on intelligent manuscript analysis is characterized in that clues, reports, manuscripts and clue information of broadcast series in a news (television station, broadcast station and the like) manuscript system are intelligently analyzed, corresponding labels are extracted, and the extracted labels are stored in the system;

collecting whole network information (including websites, microblog WeChat, forum, app, pgc, ugc, local communication, telephone hotline system and the like) through an internet collection system, accessing the clue information into a data analysis engine through a connector, and forming clue labels to be stored in the system through data preprocessing, semantic analysis and cluster analysis; meanwhile, the news work attribute of the reporter is analyzed by combining a television station or a broadcast station manuscript system; extracting 'entries' of the reporter news work attribute, namely news work attribute tags, and recommending matched clues to the reporter for use by comparing the reporter news work attribute tags with the clue tags;

the data analysis engine adopts an HP Autonom engine combined with a Spark open source algorithm, and the HP Autonom core guarantees that information in any form is: text files or basic understanding based on speech, video, unstructured or structured actual content; autonomy creates a new information layer: the intelligent information operation layer IDOL automatically unifies all information sources and all file formats under an intelligent information operation system, and the intelligent information operation system can provide various information operation functions, including automatic linking of information implementation, automatic information classification, content summarization, information clustering and the like;

creating a new layer, namely an Intelligent Data Operating Layer (IDOL), in the system by Autonom, so that a media unit, namely an enterprise system, is centered on data; the back end of the operation platform is connected with various data sources, content searching can be carried out according to any language and format, and no matter where the content is stored, summaries and links of similar information are automatically presented in real time. Since autonomy's technology is built on probabilistic modeling and therefore does not rely on any language for analysis, and does not need to maintain any cumbersome vocabulary, IDOL treats this as an abstract notation of meaning that forms an understanding of a word by its context of occurrence rather than by strict grammatical definition, thereby identifying linguistic characteristics of any data that enters and utilizes autonomy's framework. In addition, autonomy also has classification and clustering functions.

The IDOL automatically classifies information according to concepts in unstructured text, which can ensure the most accurate classification of all data according to content. Automatic clustering can collect a large amount of document data or user profile information and automatically identify the major categories within the information, allowing the IDOL to automatically and consistently calculate which category the new information should belong to.

The characteristics of IDOL determine its powerful cluster analysis capabilities, but are deficient in the ability to stream calculations. The streaming computing capability of Spark is combined to process the information with burstiness, real-time performance and disorder.

The processing flow of the data engine is as follows: there are many invalid advertisement information and promotion columns in the internet information, and these information are all removed by the system.

Removing weight: there are many pieces of information transferred in the internet, and we need to remove the duplicated content and only keep one as a clue.

Clustering: similar contents are grouped into a category according to the word segmentation result. One is reserved as a main title, and the remaining similar contents are displayed as recommendation results.

Content word segmentation: and performing word segmentation on the converged content information, and extracting keywords in the article as tag records.

Comparing the reporter news attribute label with the news clue label, and if the matching degree exceeds 70, considering the clue as an effective clue and pushing the effective clue to the reporter for use through the litchi cloud report app.

The internet acquisition system collects objects of the whole network information: the whole network information comprises the Internet and a news intranet. The internet information comprises various major mainstream authoritative websites, plus V authentication official microblog accounts, WeChat public numbers, main news apps and local main forums; the news intranet information comprises news reporter stations, PGC (program guide) feeds of all-media reporters, UGC (civil reporter), local communication, hot-line telephones and television station reporters, and the news intranet information can require a user to input corresponding news attribute tags.

The invention takes a news system as an example, and a manuscript system is roughly divided into a clue publishing platform, a manuscript Inews system and a broadcasting serial list; the clue publishing platform has the function of clue entering personnel entering clues from sources such as telephone, fax, Internet and the like, and can be stored in a news clue library according to the confidentiality degrees of different levels; supporting tagged management of threads; the personnel data providing clues can be managed; the introduction of the clue attribute of the reporter can be directly recorded in a labeling mode.

The manuscript system and the broadcasting serial list both adopt an avid system, and the manuscript information is supported to be exported in an XML form.

The invention utilizes big data engine, combines with the news manuscript system of the television station, and realizes the automatic clue recommending capability based on the working attribute of the reporter through autonomous analysis, thereby providing news report clues for the reporter, prompting the news occurrence place and the interview direction, and being capable of acquiring clues needed by the reporter from numerous news clue sources.

In conclusion, the invention has the following beneficial effects: the method realizes the reutilization of internal resources of media such as television stations and the like, fully excavates the media value, and deeply analyzes the working attribute and the news attribute of a reporter. By combining the HP Autonomy engine and Spark architecture, the internal resource data analysis capability of media such as a television station and the like is effectively improved, and if effective clues are pushed in real time through a litchi cloud mobile cloud report App at the first time, the timeliness of news production of traditional media can be improved.

Drawings

FIG. 1 is a diagram based on the HP Autonomy algorithm framework;

fig. 2 is an information cluster architecture diagram.

Fig. 3 is a reporter attribute label extraction diagram.

Detailed Description

In order to better understand the technical content of the present invention, specific embodiments are described below with reference to the accompanying drawings.

The figures are not drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures may be represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. Embodiments of various aspects of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 shows a block diagram based on the HP Autonom algorithm.

The big data analysis system combined with HP autonomy shown in FIG. 1 mainly comprises: connector, CFS, IDOL Server, as shown in FIG. 1:

connectors, data collector. Supporting multi-format acquisition, comprising: internet data, audio-visual, text documents, databases … …;

CFS (the Connector Framework Server). Preprocessing the acquired data, such as: converting the data into a uniform format;

idol (intelligent Data Operating layer). The intelligent analysis of the data is the core of the whole system;

the data source is collected to the CFS through the Connectors, the CFS uniformly converts the multi-format data into a specific format, and the data is transmitted to the IDOL server after rich improvements such as extraction of keywords, transcoding and the like. And the IDOL server reclassifies, clusters and analyzes the hot spots according to actual requirements. And finally, visually presenting the analysis result.

Fig. 2 is an information cluster architecture diagram.

The IDOL server can automatically cluster the information and help you to view the trend and development change in the information. Clustering is the process of extracting massive unstructured data and automatically partitioning the data, thereby clustering similar information. Each cluster represents a conceptual region within the knowledge base containing items having a set of identical attributes.

The basis of clustering is to generate snapshots of the IDOL stored data before the data in the snapshots can be variously clustered. The snapshot represents the content of the data index at a particular time, supporting the generation of clustering information and spectral analysis, even if the data index has changed. Clustering information and spectral analysis data are generated simultaneously using a single snapshot, thereby shortening the process time. Ideally, the IDOL server data index from which the snapshot was taken must contain at least thousands of premium documents.

Fig. 3 is a label extraction diagram of reporter working attributes, and after information is preprocessed, the data size is still large. It is clearly meaningless to push around 6000 threads to the reporter each day. The invention provides an assumption of clue recommendation based on an intraboard manuscript system. It can be known from the existing text system in tv and radio stations that the text system in tv station accumulates the data of thousands of notes, and these notes are in one-to-one correspondence with the notes. Through analysis of the document system by the IDOL Server, the system calculates key participles of the document, and as shown in FIG. 3, the system calculates that hot participles of three documents and one document by a reporter are testless, drunk driving, high-speed and policemen. Assume that the weights corresponding to the four participles are W1, W2, W3, W4, where the count is incremented by 1 for each occurrence. According to the distribution of the keywords in fig. 3, the counting result is W1-2, W2-5, W3-1, and W4-6.

All manuscripts in the manuscript system are analyzed in the mode, and the final user word segmentation and the corresponding weight are obtained in a weighted average mode. All of these segments are stored in the system as tags of the portrait of the reporter. And performing matching analysis on the participles and the participles of all clues in the system, and recommending the clues to corresponding reporters when the matching degree reaches a set threshold value as effective clues. As shown in fig. 3.

In this way, we can get more accurate clue recommendation result. Meanwhile, the clues are sent to the mobile phone mobile cloud report app by utilizing interface calling of the litchi cloud. Therefore, the reporter can check the recommended clues through the mobile phone.

Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims

1. A clue automatic recommendation system based on the reporter work attribute is characterized in that clue information, a report, a manuscript and clue information which is played in series in a news manuscript system are intelligently analyzed, and corresponding tags are extracted and stored in the system; collecting whole network information including websites, microblog WeChat, forum, app, pgc, ugc, local communication and a telephone hotline system through an internet collection system, accessing the clue information into a data analysis engine through a connector, and forming clue labels to be stored in the system through data preprocessing, semantic analysis and cluster analysis; meanwhile, the news work attribute of the reporter is analyzed by combining a television station or a broadcast station manuscript system; extracting 'entries' of the reporter news work attribute, namely news work attribute tags, and recommending matched clues to the reporter for use by comparing the reporter news work attribute tags with the clue tags; the data analysis engine adopts HP Autonomy engine combined with Spark open source algorithm, and the HP Autonomy core guarantees that for any form of information: text files or basic understanding based on speech, video, unstructured or structured actual content; autonomy creates a new information layer: the intelligent information operation layer IDOL automatically unifies various information sources and various file formats under an intelligent information operation system, and the intelligent information operation system can provide various information operation functions, including automatic linking of information implementation, automatic information classification, content summarization and information clustering.

2. The auto recommendation system according to claim 1, wherein the Autonomy creates a new layer in the system, i.e. an intelligent data operating layer (i.e. an IDOL), so that the media unit (i.e. the enterprise system) is "data-centric"; the back end of the operation platform is connected with various data sources, content searching can be carried out according to any language and format, and no matter where the content is stored, summaries and links of similar information are automatically presented in real time.

3. The auto-recommendation system according to claim 2, wherein IDOL treats an entry as a defined abstract notation that forms an understanding of the word by the context in which the entry appears rather than by a strict grammatical definition, thereby identifying linguistic characteristics of any data that enters and utilizes the autonomy framework; the autonomy is utilized to have the functions of classification and clustering.

4. The automated recommendation system according to claim 2, wherein IDOL automatically classifies information according to terms in unstructured text; automatic clustering can collect a large amount of document data or user profile information and automatically identify the major categories within the information, allowing the IDOL to automatically and consistently calculate which category the new information should belong to.

5. The automated recommendation system according to claim 2, wherein for streaming computing, information of burstiness, real-time, non-orderliness is processed by combining streaming computing power of Spark.

6. The automated recommendation system according to claim 2, wherein the data engine processes as follows, data cleansing: the Internet information contains a plurality of invalid advertisement information and promotion columns, and the information is removed through the system; removing weight: removing repeated contents of a plurality of pieces of information transferred in the Internet, and only reserving one piece of information as a clue; clustering: clustering similar contents into a class according to the word segmentation result; reserving one piece of content as a main title, and displaying the rest similar content as a recommendation result; content word segmentation: performing word segmentation on the converged content information, and extracting keywords in the article as tag records; comparing the reporter news attribute label with the news clue label, and if the matching degree exceeds more than 70%, determining that the clue is effective.

7. The automated recommendation system according to claim 2, wherein the internet collection system collects objects of the full web information: the whole network information comprises the Internet and a news intranet; the internet information comprises various major mainstream authoritative websites, plus V authentication official microblog accounts, WeChat public numbers, main news apps and local main forums; the news intranet information comprises news reporter stations, PGC (program guide) feeds of all-media reporters, UGC (civil reporter), local communication, hot-line telephones and television station reporters, and the news intranet information can require a user to input corresponding news attribute tags.

8. The automatic recommendation system of claim 2, wherein in the news network production and broadcasting system application in the television station, the manuscript system is divided into a clue publishing platform, a manuscript system and a broadcasting serial list; the clue publishing platform has the function of clue entering personnel entering clues from telephone, fax and Internet sources and stores the clues into a news clue library according to the confidentiality degrees of different levels; supporting tagged management of threads; the personnel data providing clues can be managed; therefore, the introduction system directly records the clue attribute of the reporter in a labeling mode; the manuscript system and the broadcasting serial list both adopt a news network production and broadcasting system in a television station to support the manuscript information to be exported in an XML form.