News writing style modeling method, writing style-influence analysis method and news quality evaluation method
Technical Field
The invention relates to a news writing style modeling method, a writing style-influence analysis method and a news quality evaluation method. The method is suitable for the field of social media data mining.
Background
With the increasing influence of the internet on the lives of people and the wide popularization of mobile terminal devices, data in human production and life are rapidly growing in recent years. With the improvement and development of deep learning theory, the intrinsic value contained in the big data is continuously mined, and the value of the big data causes high importance to governments, business circles and scientific and technological circles.
For the media industry, as social media becomes a main way for people to acquire news events, many traditional news media begin to create accounts, post news and interact with readers in social media (such as a newsgang microblog), so how to generate high-quality news content according to user requirements in the social media becomes an important research task.
The quality of news on social media is often reflected in the influence, so the emphasis of the existing work is mostly how to predict the influence of news, and the work is mostly concentrated on three aspects: user, information dissemination and news content:
(1) the influence of the individual influence of a user who releases news on news influence is mainly researched about the work of the user, and the news influence is usually predicted by combining social networks of the user, such as a network of a spotlighter, on the basis of the individual influence of the user, such as the number of fans.
(2) The work on information dissemination mainly researches how to predict the future dissemination trend of news by the early network structure of the news dissemination, and mines important nodes in the dissemination network to facilitate information dissemination.
(3) Work on news content studies the effect of news content itself on its broadcast impact, can predict its impact before release, and help users to perform tuned retouching of news content as it is generated to produce higher quality news.
Work on news content is largely divided into two categories, based on topic and language style. Topic-based work is aimed at studying the impact of the topic to which the news belongs on its influence, but for authoritative news media, which need to ensure diversity of topics, including only some specific topics is not applicable. Therefore, it is more important how to retouch the writing style to produce high quality news, and this is less studied, especially for chinese. Most of the existing works only focus on some basic vocabulary information, including word embedding and n-gram (n-gram), and these linguistic knowledge can not accurately shape the characteristics of the news writing style, and it is difficult to capture the influence of the writing style on the influence.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the problems, a news writing style modeling method, a writing style-influence analysis method and a news quality evaluation method are provided.
The technical scheme adopted by the invention is as follows: a news writing style modeling method is characterized in that: constructed according to quantitative readability, logistical, credibility, literacy, interactivity, interestingness, humanity, and structural integrity.
The readability is measured in terms of the following features: the number of characters, the number of words, the number of sentences, the number of clauses, the average word length, the Sennce _ broken, the RIX, the LIX, the LW and the number of professional vocabularies;
wherein, sequence _ broken represents the average number of clauses contained in a Sentence in news; RIX LW/sentence number, LIX ═ word number/sentence number + (100 LW)/Words, where LW refers to the number of long Words in news;
the logical is measured in terms of the following features: forward _ reference and Conjs;
wherein Forward _ reference refers to the number of pronouns and third person pronouns in news; conjs refers to the number of conjunctions in the news.
The confidence level is measured by the following characteristics: @, Nums, Time, Places, Objects, Official _ speech, and Uncertainty;
where @ refers to whether the news contains an object that leads to the source of the news message or a news event; nums refers to whether the news contains detailed numbers; time refers to whether Time is included in the news; places refers to whether a place is contained in the news; objects refers to whether a person or organization is contained in the news; office _ speed refers to whether a word capable of indicating that the news is from an Official notice is contained in the news; uncertainty refers to whether or not words representing Uncertainty are contained in the statistical news.
The degree of writing is measured by the following characteristics: noun, Adj, Prep, Pron, Verb, and Adv;
where Noun refers to the number of nouns in the news; adj refers to the number of adjectives in the news; prep refers to the number of prepositions in the news; pronoun refers to the number of pronouns in the news; verb refers to the number of verbs in the news; adv refers to the number of adverbs in the news.
The interactivity is measured in terms of the following features: first _ prop, Second _ prop, indirect _ prop and Que _ mark;
wherein First _ prop refers to the number of First people in news; second _ prop refers to the number of Second people in the news; interrogative _ pron refers to the number of Interrogative pronouns in the news; que _ mark refers to the number of question marks in the news.
The interestingness is measured in terms of the following features: rhetoric, Idiom, adaptive, Exc _ mark, Emoticon, and Adj;
wherein Rhetoric refers to the number of metaphors in the news; idiom refers to the number of idioms in the news; adaptive refers to the number of inflection words in the news; exc _ mark refers to the number of exclamation marks in the news; emoticon refers to the number of emoticons in the news; adj refers to the number of adjectives in the news.
The passivity is measured with the following characteristics: adv of grid, Modal partition, First _ pron, Second _ pron, Exc _ mark and Que _ mark;
wherein Adv of degree refers to the number of degree adverbs in the news; modalparticle refers to the number of mood words in the news; first _ prop refers to the number of First-person pronouns in the news; second _ prop refers to the number of Second person pronouns in the news; exc _ mark refers to the number of exclamation marks in the news; que _ mark refers to the number of question marks in the news.
The structural integrity is measured with the following characteristics: HasHead, HasImage, HasVideo, and HasTag;
wherein HasHead refers to whether a news headline is included; HasImage refers to whether a news picture is included; HasVideo refers to whether news video is included; HasTag refers to whether an event related topic is included.
A writing style-influence analysis method is characterized by comprising the following steps:
acquiring news on a plurality of media, and determining the news quality of the news on the media according to the influence of the media;
obtaining information gain IG of each characteristic used for quantifying readability, logicality, credibility, writing degree, interactivity, interestingness, humanity and structural integrity in the news writing style modeling method based on news with different qualities;
the information gain IG is an index for measuring the influence of characteristics on the news quality;
counting the correlation between each characteristic and influence for quantifying readability, logicality, credibility, writing degree, interactivity, interestingness, humanity and structural integrity in each medium, and determining a Spearman correlation coefficient SRC;
the Spearman correlation coefficient SRC is an index that measures how much a feature has an effect on news quality.
The acquiring of the information gain IG of each feature includes:
the calculation formula corresponding to the characteristics T and IG is as follows:
IG(T)=H(C)-H(C|T)
H(C|T)=p(T)H(C|T)+P(T′)H(C|T′)
wherein H (C) represents information entropy, H (C | T) represents conditional entropy, p (T) represents probability of including the feature T, p (T') represents probability of not including the feature T, and IG ∈ [0,1 ].
A news quality assessment method is characterized in that:
obtaining information gain IG and Spearman correlation coefficient SRC of each characteristic obtained by news X based on the writing style-influence analysis method;
calculating a news quality assessment score Q _ score:
Wi=IGi*SRCi
wherein, WiIndicating the magnitude of the impact of the ith feature on news quality, FiAnd f is the number of the features.
The invention has the beneficial effects that: the invention constructs a news writing style model according to the quantified readability, logicality, credibility, literacy, interactivity, interestingness, humanity and structural integrity, and realizes the writing style characteristic of high-quality news media with higher accuracy through the readability, logicality, credibility, literacy, interactivity, interestingness, animateness and structural integrity quantified by a plurality of characteristics.
The invention analyzes the influence of the writing style on the influence from two angles of the space and the inside of the media respectively, removes other noise factors and can better strip the relation between the writing style and the influence.
The invention can evaluate the news quality based on the writing style, does not need the early spread information of the news in the network, can evaluate before the news is released, can give interpretable analysis to the evaluation result, and further can provide modification opinions with guiding significance for the news writer.
Drawings
FIG. 1 is a diagram of a writing style-based news quality assessment model according to the present invention.
Detailed Description
Example 1: the embodiment provides a writing style modeling method based on news criteria, which is constructed based on eight news criteria most relevant to social media news quality, and comprises the following steps: readability, logistical, credibility, literacy, interactivity, interestingness, humanity, and structural integrity. The present embodiment quantifies these eight news criteria by the following textual features, including:
1. readability. Clear and easy reading is a basic requirement of news, especially for a short text platform such as a Sing microblog, the content of the news is limited within 140 words, and a reader often spends less time to quickly browse the news.
In the present embodiment, there are 10 features that can affect the readability of a piece of news, including: the number of characters, the number of words, the number of sentences, the number of clauses, the average word length, the Sennce _ broken, the RIX, the LIX, the LW and the number of professional vocabularies; wherein, the sequence _ broken represents the average number of clauses contained in a Sentence, and the sequence _ broken is the number of clauses/number of clauses; RIX is LW/sentence number; LIX ═ word count/sentence count + (100 × LW)/Words; LW refers to the number of long words (a long word refers to a word containing more than 2 words).
2. And (4) logical property. Good news should be logical, up and down coherent. The logic correlation measurement features are two in total and comprise Forward _ reference and Conjs; the Forward reference is used for capturing logicality among different sentences in news, and the calculation method is to count the number of pronouns and third-person nominal words in the news. Since the connection word such as "therefore, so" also has a positive effect on the logical character of the news, the present embodiment proposes a feature Conjs for counting the number of connection words and further measuring the logical character of the news.
3. And (4) reliability. The credibility-related features are 7 in total, including @, Nums, Time, Places, Objects, Official _ speed, Uncertainty. For official media in the sweepstakes microblog, "@" is often used to elicit a source of news messages or an object of a news event, which is beneficial to enhance the credibility of news. Detailed numeric (Nums), Time (Time), place (Places), and object (people or organization) information is also useful to improve the credibility of news. Some words indicate that news originates from an Official notification (Official _ watch), such as a "advisory," are also counted for measuring the credibility of the news. In addition, words that represent Uncertainty (uncertainties), such as "possible," can reduce the credibility of news.
4. Degree of writing. News on social media tends to be more spoken than traditional media news. The written degree of news is related to the use of different parts of speech, including nouns (Noun), adjectives (Adj), prepositions (Prep), pronouns (Pron), verbs (Verb) and adverbs (Adv), and the written degree is Noun + Adj + Prep-Pron-Verb-Adv-sequence _ broken.
5. And (4) interactivity. Appropriate interaction with the reader is useful to elicit the reader's thought and participate in the discussion. "do you feel worsted? Because such sentences can achieve the effect of interacting with readers, the present embodiment counts the number of the First person's name (First _ prop), the Second person's name (Second _ prop), the query pronouns (interactive _ prop) and the question marks (Que _ mark) in the sentences to measure the interactivity.
6. Is interesting. Naturally, interesting news is more attractive to readers. The use of metaphors (Rhetoric), idioms (idim), milestones (adaptive), exclamation marks (Exc _ mark), emoticons (Emoticon), and adjectives (Adj) in news all contribute to the enjoyment of the news.
7. Is moving and humanized. Good news can cause emotional resonance of the reader. In the embodiment, it is considered that the use of the adverb (advofdepth), the word (Modal), the First person pronoun (First _ pron), the Second person pronoun (Second _ pron), the exclamation mark (Exc _ mark), and the question mark (quee _ mark) can improve the vividness of the news.
8. Structural integrity. News in the Sina microblog has a specific canonical format, including news headlines (HasHead) separated by "[ in ]), multimedia content including video/pictures (HasImage, HasVideo), and including event-related topics (HasTag).
Example 2: the embodiment is a multi-angle writing style-influence analysis method, which comprises inter-media analysis and intra-media analysis. In the analysis among the media, because the theme overlap among each media is larger, the influence of the theme can be weakened; in the intra-media analysis, the publisher of the news is unchanged, and thus, the influence of the user can be eliminated. Finally, the relationship between the writing style and the news quality is obtained by combining the common conclusion of the two angle analysis experiments.
1. Analysis experiment between media
a. The method comprises the steps of acquiring news on a plurality of different media, and determining the news quality of the news on the media according to the influence of the media.
In this embodiment, according to the "two micro-end" media fusion propagation ranking list published by the people network, the "people's daily news" and the "central news" belong to the most influential news media in the newwave microblog, so that all the news published on the newwave microblog by the two people are crawled first, and the time is from 2012 to 2018, and the news is taken as a high-quality news category (category 1). In contrast, two other medium impact news media were selected: the Xinhua network and the Xinhua viewpoint crawl all the Xinwang microblog news published by the Xinhua network as medium-quality news (class 0). Among them, the influence of all news of class 1 is always tens of times that of news of class 0.
b. The news writing style modeling method of example 1 is obtained based on news of different qualities, and information gains IG for quantifying characteristics of readability, logicality, credibility, writing degree, interactivity, interestingness, humanity, and structural integrity are obtained.
Based on the writing style related characteristics provided by the embodiment, the method combines a machine learning classification method (random forest ) to classify the middle and high influence news, and the obtained classification accuracy, precision, recall rate and F1 values are respectively 94%, 95%, 94% and 94%, so that the middle and high quality news are proved to have obvious and distinguishable writing style differences.
In the embodiment, Information Gain (IG) of each feature is obtained, the IG is used for measuring influence of the feature on classification, and the measurement criterion is to see how much Information the feature can bring to a classification system, and the more Information the feature is brought, the more important the feature is. For a feature, the amount of information will change when the system has it and when it does not, and the difference between the previous and next information amounts is the amount of information the feature brings to the system. The amount of information, entropy, is calculated for the characteristic T, IG as follows,
IG(T)=H(C)-H(C|T)
H(CT)=p(T)H(CT)+P(T')H(CT')
wherein H (C) represents information entropy, H (C | T) represents conditional entropy, p (T) represents probability of including the feature T, p (T') represents probability of not including the feature T, and IG ∈ [0,1 ].
c. The information gain IG is used as an index for measuring the influence of the characteristics on the news quality. In this embodiment, the IGs of each feature are sorted, and the features are sorted, and the top 10 features are obtained by: LIX, Exc _ mark, Clauses (number of Clauses), setents (number of Clauses), Ave _ word _ len (average word length), RIX, setence _ broken, Words (number of Words), Second _ prop, charcters (number of Characters). It is stated that the difference in writing style of high quality news as compared to medium quality news is mainly reflected in readability, interactivity and interestingness.
2. in-Medium analysis experiments
The in-media analysis method includes statistics of the correlation between each authoring style feature and influence, such as Spearman's Rank Correlation (SRC), in each media.
SRC reflects the direction and extent of the trend between the two variables, with values ranging from-1 to +1, with 0 indicating that the two variables are uncorrelated, positive values indicating positive correlation, and negative values indicating negative correlation.
SRC is a measure of the magnitude of the impact of a feature on news quality. And after calculating the SRC between all the characteristics and the influence in each medium, obtaining the SRC of the final characteristics on the influence in an averaging mode. The features are ranked according to SRC to obtain a conclusion similar to the inter-media analysis experiment, namely the features most relevant to the influence mainly come from readability, interestingness and interactivity.
Example 3: the embodiment is a news quality evaluation method based on writing style:
the effect of various features on news quality, including Information Gain (IG) and Spearman correlation coefficient (SRC), in a number of analysis perspectives, including inter-media and intra-media, was obtained according to example 2.
Given news X, the quality assessment score (Q score) for that news X is calculated as follows, according to the news quality assessment model as follows:
Wi=IGi*SRCi
wherein, WiIndicating the magnitude of the impact of the ith feature on news quality, FiAnd f is the number of the features.