CN112417088A

CN112417088A - Evaluation method and device for text value in community

Info

Publication number: CN112417088A
Application number: CN201910763287.6A
Authority: CN
Inventors: 刘垚; 邹更; 任钰欣; 黄梓杰
Original assignee: Wuhan Yujianwan Technology Co ltd
Current assignee: Wuhan Yujianwan Technology Co ltd
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2021-02-26
Anticipated expiration: 2039-08-19
Also published as: CN112417088B

Abstract

The invention discloses a method for evaluating the text value in a community, which comprises the following steps: collecting all corpus texts in a community, constructing a corpus, and preprocessing the corpus texts; preprocessing a target text, taking x words which are linked in sequence as phrases, integrating the target text into a corpus, and updating a vocabulary database and a phrase database; calculating the probability of the phrases contained in the target text appearing in the updated phrase database; calculating the information content of each phrase according to the probability of the occurrence of the phrase in the target text; determining the propagation potential of the phrases in the community according to the coverage of the phrases in the community, wherein the coverage is inversely proportional to the propagation potential; and obtaining the corrected information amount of the phrases according to the information amount and the propagation potential of the phrases, and obtaining the information amount score of the target text according to the corrected information amounts of all the phrases contained in the target text. The method of the invention can improve the accuracy of the scoring and improve the evaluation effect.

Description

Evaluation method and device for text value in community

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for evaluating text value in a community.

Background

With the rapid development of the internet era, information networks among people are increasingly compact, people are gathered into intangible communities by different types of internet products, and information transmission is the most important subject in the internet communities.

In the prior art, the evaluation of the text value in the community, which is commonly used, mainly depends on the feedback of the users in the community. The value evaluation of the text content is formed through the feedback of the user and serves as an important basis for popularization and control of the text. In addition, for quality evaluation of text information, a machine learning-based method is commonly used at present to construct a text classification model through a high-quality text training set labeled manually, or to evaluate the quality of a text with respect to the number of language components, such as expressions and metaphors, which can represent the quality of the text.

The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:

the prior art text evaluation systems that rely on user feedback have hysteresis and have an inevitable time-cumulative effect. The hysteresis can cause that the information cannot be evaluated and controlled before the information is spread to a certain extent, and the user is relied on to screen and evaluate the high-quality information, so that the reading cost of the user is increased; the time accumulation effect enables the information appearing first to continuously accumulate the propagation advantages thereof, occupies an information acquisition channel of a user, and enables subsequent high-quality information to be blocked, so that the high-quality information is difficult to effectively expose on one hand, and the information receiving of the user is homogenized on the other hand. For the quality evaluation of the text information, the evaluation is performed only in isolation, resulting in poor evaluation effect.

Therefore, the method in the prior art has the technical problem of poor evaluation effect.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for evaluating a value of a text in a community, so as to solve or at least partially solve the technical problem of poor evaluation effect of the method in the prior art.

The invention provides a method for evaluating the text value in a community, which comprises the following steps:

collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are linked in sequence as phrases, and forming a vocabulary database by all the words; all phrases are combined into a vocabulary database, wherein x is a positive integer greater than or equal to 2;

preprocessing a target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database;

calculating the probability of the appearance of the phrases contained in the target text (T) in the updated phrase database;

calculating the information content of each word group according to the occurrence probability of the word group in the target text (T), which specifically comprises the following steps: h (phrase) ═ log₂p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase;

determining the propagation potential of the phrases in the community according to the coverage of the phrases in the community, wherein the coverage is inversely proportional to the propagation potential;

and obtaining the corrected information amount of the phrases according to the information amount and the propagation potential of the phrases, and obtaining the original information amount score of the target text according to the corrected information amounts of all the phrases contained in the target text.

In one embodiment, after obtaining the raw information content score for the target text, the method further comprises:

carrying out normalization processing on the original information quantity score to obtain an information quantity score, and controlling the value range of the information quantity score to be between [0 and 100), wherein the normalization processing mode is as follows:

NSH(T)＝actan(SH(T))*200/π，

where sh (t) represents raw traffic score and nsh (t) represents traffic score.

In one embodiment, in a phrase formed by x words linked in sequence, the phrase is divided into 1-bit word, 2-bit word and x-bit word according to the appearance order, and the probability of appearance of the phrase contained in the target text (T) is calculated, including:

for each word group (phrase) in the target text (T), calculating the frequency count of an x-tuple (word1,.., word) formed by 1-bit words to x-bit words in an updated word group database, and dividing the frequency count by the total number of the word groups in the word group database to obtain the probability p (word1,., word x) of the simultaneous occurrence of the x words;

when x is 2, namely the phrase comprises two words, calculating the frequency of the 1-bit word in the vocabulary database, and making a quotient with the total number of the vocabularies in the vocabulary database to obtain the probability of the 1-bit word; calculating the probability of 2-bit word occurrence under the condition of 1-bit word occurrence, namely the probability of word group occurrence according to a conditional probability formula,

p(phrase)＝p(word2|word1)＝p(word1∩word2)/p(word1)

wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, word2 represents a 2-bit word, p (word1) represents the probability of occurrence of a 1-bit word, and p (word 1:word 2) represents the probability of occurrence of both a 1-bit word and a 2-bit word;

when x is greater than 2, namely the phrase comprises more than two words, for 1 to x-1 bit words in the phrase, calculating the frequency of the 1 to x-1 bit words (word1_ x-1) in a corresponding x-1-element phrase database, and making a quotient with the total number of the phrases in the phrase database to obtain the probability of the 1 to x-1 bit words; calculating the probability of the occurrence of the x-bit words under the condition that the 1 to x-1-bit words occur, namely the probability of the occurrence of the phrases,

p(phrase)＝p(wordx|word1_x-1)＝p(word1...wordx)/p(word1_x-1)

wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, wordx represents an x-bit word, p (word1_ x-1) represents the probability of occurrence of 1-x-1-bit words, and p (word1.. wordx) represents the probability of simultaneous occurrence of x words.

In one embodiment, determining the propagation potential of the phrases in the communities according to the coverage of the phrases in the communities comprises:

acquiring the total number of text units in a community and the number of text units containing phrases phrase, wherein the text units represent each individual in the community and all text corpora corresponding to the individual;

calculating a coverage correction parameter of the information quantity of the single phrase according to the following formula, and taking the coverage correction parameter as the propagation potential of the phrase in the community:

S_index＝logN(N/n)

wherein, S _ index represents a coverage correction parameter of the phrase information amount, N represents the total number of text units, and N represents the number of text units containing the phrase.

In one embodiment, the method further comprises:

carrying out messy code judgment on the target text (T);

and correcting the information content score of the target text (T) according to the messy code judgment result.

In one embodiment, the method further comprises:

judging the repetitive content of the target text (T);

and correcting the information content score of the target text (T) according to the result of the repetitive content judgment.

In one embodiment, the method further comprises:

detecting whether the target text (T) uses a preset expression word;

and correcting the information content score of the target text (T) according to the detection result.

Based on the same inventive concept, a second aspect of the present invention provides an apparatus for evaluating a text value in a community, comprising:

the corpus construction module is used for collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are sequentially linked as phrases, and forming a vocabulary database by all the words; all phrases are combined into a vocabulary database, wherein x is a positive integer greater than or equal to 2;

the target text preprocessing module is used for preprocessing a target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus and updating a vocabulary database and a phrase database;

the phrase occurrence probability calculation module is used for calculating the probability of the appearance of the phrases contained in the target text (T) in the updated phrase database;

the phrase information amount calculation module is used for calculating the information amount of each phrase according to the probability of the occurrence of the phrase in the target text (T), and specifically comprises the following steps: h (phrase) ═ log₂p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase;

the system comprises a phrase propagation information determining module, a phrase propagation information determining module and a phrase propagation information determining module, wherein the phrase propagation information determining module is used for determining the propagation potential of a phrase in a community according to the coverage of the phrase in the community, and the coverage is inversely proportional to the propagation potential;

and the scoring module is used for obtaining the corrected information quantity of the phrases according to the information quantity and the propagation potential of the phrases, and obtaining the original information quantity score of the target text according to the corrected information quantities of all the phrases contained in the target text.

Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.

Based on the same inventive concept, a fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides an evaluation method of text value in a community, which comprises the steps of firstly, collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are sequentially linked as word groups, and forming a vocabulary database by all the words; all phrases are combined into a vocabulary database; then preprocessing the target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database; next, calculating the probability of the appearance of the phrases contained in the target text (T); then, calculating the information content of each word group according to the probability of the occurrence of the word group in the target text (T); then determining the propagation potential of the phrases in the community according to the coverage of the phrases in the community, wherein the coverage is inversely proportional to the propagation potential; and obtaining the corrected information quantity of the phrases according to the information quantity and the propagation potential of the phrases, and obtaining the information quantity score of the target text according to the corrected information quantities of all the phrases contained in the target text.

According to the method provided by the invention, the information quantity borne by the text is calculated based on the vocabulary and the organization sequence of the vocabulary, the propagation potential of the word group in the community is further determined according to the coverage degree of the word group in the community on the basis of the information quantity, then the corrected information quantity of the word group is obtained according to the information quantity and the propagation potential of the word group, and the information quantity score of the target text is obtained according to the corrected information quantity of all the word groups contained in the target text, namely, the text value in one community can be evaluated based on two dimensions of the information quantity and the propagation potential, so that the calculation result is more accurate, the valuable information of the text can be more favorably mined, and the evaluation effect is improved.

Furthermore, the invention also judges the messy codes of the target text (T), and corrects the information content score of the target text (T) according to the messy code judgment result, thereby further improving the evaluation effect.

Further, the invention also judges the repetitive content of the target text (T); and the information content score of the target text (T) is corrected according to the result of the repetitive content judgment, so that the evaluation effect is improved.

Further, the invention also detects whether the target text (T) uses the preset expression words; and the information content score of the target text (T) is corrected according to the detection result, so that the evaluation effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart illustrating a method for evaluating the value of a text in a community according to the present invention;

FIG. 2 is a flow diagram illustrating the preprocessing of corpus text within a community in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a process for preprocessing a target text according to an embodiment;

FIG. 4 is a flow diagram illustrating an implementation of evaluating text value in an exemplary embodiment;

FIG. 5 is a flow chart of the calculation of the scrambling code index (U _ index) in the embodiment;

FIG. 6 is a flow chart of the calculation of the repetition index (R _ index) in an embodiment;

FIG. 7 is a flowchart illustrating the calculation of the term richness index (D _ index) in accordance with an embodiment;

FIG. 8 is a flow diagram illustrating the modification of text quality scoring in accordance with an exemplary embodiment;

FIG. 9 is a block diagram of an apparatus for evaluating the value of a text in a community according to an embodiment of the present invention;

FIG. 10 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;

fig. 11 is a block diagram of a computer device in an embodiment of the present invention.

Detailed Description

The invention aims to provide a method for evaluating text value in a community, aiming at the technical problem that the evaluation effect is poor due to the fact that the text quality is evaluated only in an isolated mode by the method in the prior art, and the text value is evaluated from two dimensions of information quantity and propagation potential so as to achieve the purposes of improving the evaluation accuracy and the evaluation effect.

In order to achieve the above purpose, the main concept of the invention is as follows:

the method comprises the steps of calculating the information quantity borne by a text based on vocabularies and the organization sequence of the vocabularies, further determining the propagation potential of the word group in the community according to the coverage degree of the word group in the community on the basis of the information quantity, then obtaining the corrected information quantity of the word group according to the information quantity and the propagation potential of the word group, and obtaining the information quantity score of a target text according to the corrected information quantity of all the word groups contained in the target text.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The present embodiment provides a method for evaluating the value of a text in a community, please refer to fig. 1, the method includes:

step S1: collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are linked in sequence as phrases, and forming a vocabulary database by all the words; and (3) forming a vocabulary database by all phrases, wherein x is a positive integer greater than or equal to 2.

In particular, a community may represent a network of relationships. Please refer to fig. 2, which is a flowchart illustrating an implementation of step S1, wherein the preprocessing the text includes: punctuation marks in each corpus text are replaced by line feed marks, sentence break processing is carried out, and then word segmentation processing is carried out on the corpus after sentence break, so that individual word segments (words) are obtained. And 2 or more words appearing in sequence are taken as phrases. The phrase database includes a binary phrase database, a ternary phrase database and the like according to different values of x. The vocabulary database and the phrase database are not subjected to redundancy removal processing. Each individual and all text corpora in the community are called as a text unit, the number of the text units of each phrase in the community is counted, and the text units are stored in a phrase database.

Step S2: preprocessing the target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database.

Specifically, referring to fig. 3, which is a flowchart of a specific implementation of step S2, the process of preprocessing the target text (T) is similar to the preprocessing of the text, including punctuation replacement, sentence segmentation and word segmentation.

Step S3: and calculating the probability of the appearance of the phrases contained in the target text (T) in the updated phrase database.

Specifically, the phrase included in the target text (T) may be a binary phrase, a ternary phrase, or an x-element phrase. And (4) according to the probability of occurrence of word collocation combinations (namely word groups) in the word groups, calculating the information content of the word collocation combinations in the subsequent steps. The inventor of the application finds out through a large amount of practice and research that: the information is carried by the words and the organized sequence of the words, and the organized sequence of the words is an important characteristic of the carried information because the number of the words is limited, so that the invention considers that when a user reads a new organized sequence of the words, the user obtains new information.

p(phrase)＝p(word2|word1)＝p(word1∩word2)/p(word1)

p(phrase)＝p(wordx|word1_x-1)＝p(word1...wordx)/p(word1_x-1)

Specifically, the phrase formed by 1-bit words to x-bit words may be a binary phrase, a ternary phrase, an n-gram phrase, or the like. Specifically, referring to tables 1 to 3, examples of the vocabulary data, the binary phrase database, and the ternary phrase data are shown.

TABLE 1 lexical database

Vocabulary and phrases	Number of occurrences	Number of text units
			On the upper part	65320	2139
Topography	304	150
			Complexity of	1823	717
...	...	...

TABLE 2 binary phrase database

Word group	Number of occurrences	Number of text units
			World + top	4795	1034
Is complicated and	5	5
			the south-most end of	144	86
...	...	...

TABLE 3 ternary phrase database

When x >2, a word group includes more than two words, for example, x ═ 3, and the probability p (word1.., wordx) of x words occurring simultaneously indicates the probability of three words occurring simultaneously. The frequency (times) of 1-bit words to 2-bit words in the corresponding binary phrase database is calculated, then the frequency is divided by the total number of phrases in the binary phrase database, the probability of the 1-bit words to 2-bit words is calculated, then the probability of the 3-bit words under the condition that the 1-bit words to 2-bit words appear is calculated according to a conditional probability formula, and the probability of the three-bit phrase is obtained.

Step S4: calculating the information content of each word group according to the occurrence probability of the word group in the target text (T), which specifically comprises the following steps: h (phrase) ═ log₂p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase.

Specifically, according to the information amount formula, the information amount that each phrase brings to the reader can be calculated. Since the target text (T) has been integrated into the vocabulary database and the phrase database in step S1, an arbitrary phrase appears at least once. The larger the information content of a phrase is, the less the combination of the phrase is, the more novel the vocabulary collocation is, and the larger the information content provided for readers is; conversely, the smaller the information content, the more common the combination of the phrase is, the more fixed the word collocation is, the smaller the information content provided for the reader.

Step S5: determining the propagation potential of the phrases in the communities according to the coverage of the phrases in the communities, wherein the coverage is inversely proportional to the propagation potential.

Specifically, the inventors of the present application found through a great deal of practice that: in the prior art, a machine learning-based method constructs a text classification model through a high-quality text training set labeled manually, or evaluates the quality of a text according to the number of language components such as expressions and metaphors which can represent the quality of the text. However, these methods are used for evaluating the quality of the text in isolation, and do not consider the propagation condition of the text in the community, and cannot determine whether the text brings new information to the user, and further cannot determine the propagation potential of the text in the community.

However, within a community, the informational value of text is in terms of the value of the text's dissemination in addition to the informational content layer of the text. The value of the dissemination of information is measured using its coverage in communities. The larger the coverage, the less the remaining amount of the text in the community, and the smaller the coverage, the more the remaining amount of the text in the community.

S_index＝logN(N/n)

Specifically, one phrase appears in 5 text units, each of which appears 100 times, and one phrase appears in 500 text units, each of which appears 1 time, and the propagation residual value is different for the entire community despite the same calculation result of the information amount. Therefore, the information amount h (phrase) needs to be corrected based on the propagation force of phrase. For example, if the total number of text units in the community is N, and the number of text units containing the phrase is N, the formula of the coverage correction parameter S _ index for a single phrase information amount is represented as follows: logN (N/N). When N is equal to N, S _ index is equal to 0, that is, the phrase has no propagation value, and since the target text is integrated into the vocabulary database and the phrase database, any phrase appears at least once, so that N is minimum 1, and when S _ index is equal to 1, the propagation value of the phrase is maximum.

Step S6: and obtaining the corrected information amount of the phrases according to the information amount and the propagation potential of the phrases, and obtaining the original information amount score of the target text according to the corrected information amounts of all the phrases contained in the target text.

Specifically, the S _ index value of the phrase is multiplied by the information amount h (phrase) to obtain the corrected information amount SH (phrase), and then the corrected information amounts SH (phrase) of all the phrases phrase obtained by segmenting the target text (T) are averaged, so that the information amount score SH of the text can be obtained.

NSH(T)＝actan(SH(T))*200/π，

where sh (t) represents raw traffic score and nsh (t) represents traffic score.

In one embodiment, the method further comprises:

carrying out messy code judgment on the target text (T);

Specifically, if the text is composed of meaningless messy codes, the probability of the occurrence of word collocation therein is very small, which easily causes the information content score nsh (t) of the text to be larger. To avoid this, it is necessary to discriminate the scrambling code condition. If a phrase (phrase) occurs in less than 2 text units (any phrase occurs at least once because the target text (T) has been integrated into the vocabulary database and the phrase database), the phrase is defined as a suspect phrase (uncartiain _ phrase).

In the implementation process, the scrambling index (U _ index) of the target text (T) can be used for measurement. Please refer to fig. 5, which is a flowchart of calculating the garbled index (U _ index), wherein the total number N (All _ phrase) of phrases in the target text (T) is divided by the number N (uncartiain _ phrase) of suspicious phrases, so as to obtain the garbled index of the text, and the calculation formula is U _ index (T) ═ N (All _ phrase)/N (uncartiain _ phrase).

The information content score of the target text (T) is corrected according to the messy code judgment result, and the score can be: when U _ index (t) <2.0 of the target text, the text is determined to be garbled, y (t) ═ 0, and y (t) represents the score after correction.

In one embodiment, the method further comprises:

judging the repetitive content of the target text (T);

Specifically, if the same content appears repeatedly in the target text (T), the score of the whole text is increased when nsh (T) of the repeated content is high. Therefore, the present embodiment corrects the duplicate content by a penalty.

In a specific implementation, the repetition index (R _ index) may be used for measurement, please refer to fig. 6, which is a flowchart for calculating the repetition index (R _ index). Dividing the total phrase number N (All _ phrase) of the target text (T) by the non-repeated phrase number N (Nr _ phrase) of the article to obtain the repetition index of the text, wherein the specific calculation formula is as follows: r _ index (t) ═ N (All _ phrase)/N (Nr _ phrase).

The correction method can be as follows: when the text R _ index (T) >1.5, a penalty mechanism is started, and the penalty is calculated by the following method:

Y(T)＝NSH(T)-(R_index(T)-1.5)*10

in one embodiment, the method further comprises:

detecting whether the target text (T) uses a preset expression word;

Specifically, in the target text, if a rich expressive word (Eword) is used, a verb, an adjective, or an adverb is included. The target text (T) is considered to have a better expressive tension and therefore penalty corrections are required for text that does not use rich Eword.

In a specific implementation, the expressive word richness index (D _ index) of the target text (T) may be used for measurement. Please refer to fig. 7, which is a flowchart illustrating the calculation of the word richness index (D _ index). And marking all Eword in the target text (T) by adopting a semantic analysis model. Dividing the total number N (All _ Eword) of the Eword by the number N (Nr _ Eword) of the non-repeated Eword to obtain an expression word abundance index of the target text, wherein the specific calculation formula is as follows: d _ index (t) N (All _ Eword)/N (Nr _ Eword).

The correction method can be as follows: when the text D _ index (T) >3.0, a penalty mechanism is started, and the penalty is calculated by the following method:

Y(T)＝NSH(T)-(R_index(T)-3.0)*10。

it should be noted that, in the specific implementation process, the scoring of the text value may be corrected by combining one or more of the random code determination, the repetitive content determination, and the richness determination, so as to improve the scoring accuracy. Please refer to fig. 8, which is a flowchart of a text quality score correction method, in which a scrambling code index (U _ index), a repetition index (R _ index), and an expression word abundance index (D _ index) are combined to correct, first, it is determined whether the scrambling code index is less than 2, when the scrambling code index is less than 2, the text is a scrambling code, at this time, the score is set to 0, otherwise, it is determined whether the repetition index is greater than 1.5, if so, the text is corrected by the repetition index, and after the repetition index is corrected, or when the repetition index is not greater than 1.5, the expression word abundance index is determined. Further, when the text causes y (t) <0 due to the penalty, y (t) ═ 0.

In order to more clearly illustrate the implementation flow of the present invention, the following detailed description is made by using several examples, please refer to fig. 4, which is an implementation flow chart for evaluating text value.

Firstly, sentence breaking processing is carried out on a target text to obtain sentences (sentence 1, sentence 2 … sentence n), then word segmentation processing is carried out on each sentence, taking sentence 1 as an example, different words (word1, word2, word3 and word) can be obtained after word segmentation, and taking binary word groups as an example, word1 and word2 form word groups phrase. The probability of the appearance of the phrase is calculated according to the probability of the appearance of word1 and the probability of the appearance of two words simultaneously, and then the information quantity h (phrase) brought to the reader by the phrase is calculated. And calculating a coverage correction parameter S _ index of the phrase information quantity according to the number of books containing the phrase and the total number of text units in the community. And then, calculating a corrected information quantity Sh (phrase) according to the information quantity h (phrase) and the coverage correction parameter S _ index, then averaging the corrected information quantities of all phrases in the target text to obtain an information quantity score of the target text, and finally performing normalization processing to obtain a final information quantity score.

Example one:

the Henran angle is the south-most end of south America, the terrain is complex, the climate is severe, the strong wind is continuous throughout the year, the waves are heavy, and the Henran angle can be one of the worst navigation channels in the world under sea conditions.

Glossary of the words, [ 'syngen', 'is', 'south america', 'is extreme', 'is complex', 'and', 'is climate', 'is bad', 'is strong wind', 'is constantly', 'is rough', 'can', 'is world', 'is up', 'is sea', 'is bad', 'is channel', 'one of' ]

The phrase list of [ ' syner angle + is ', ' is + south america ', ' is in south america ', ' is at the + most south end ', ' terrain + is complex ', ' is also + climate ', ' is climate + is bad ', ' is strong wind + is not broken ', ' can be ' world ', ' is + is ', ' is up + is up ', ' is down ', ' is bad ', ' is down ', ' is channel ', ' is one of channel ' ]

The calculation for the phrase "world + up" is as follows:

p word1 word2 (world + over) ═ 0.00031978839945072423

p _ word1 (world) 0.0013908097037316097

p _ phrase (world + above) 0.22992965794868736

h _ phrase (world + above) 2.1207355278488125

S _ index (world + up) ═ 0.09578403120899699

Sh (world + up) 0.20313259798549937

The calculation for the phrase "complex + and" is as follows:

p word1 word2 (complex + -) 3.334602705429867e-07

p word1 (Complex) 9.321150288234715e-05

p _ phrase (complex + and) ═ 0.0035774583633082805

h _ phrase (complex + and) ═ 8.126849308597139

S _ index (complex + and) ═ 0.7903415048925304

Sh (complex + and) ═ 6.422986312591483

The calculation for the phrase "and + climate" is as follows:

p _ word1_ word2 (and + climate) ═ 6.669205410859734e-08

p word1 (and) 0.00020938074838354993

p _ phrase (and + climate) 0.00031852046868429773

h _ phrase (and + climate) 11.616326293973737

S _ index (and + climate) 1.0

Sh (and + climate) 11.616326293973737

The calculation for the phrase "south america + is as follows:

p _ word1_ word2 (of south america) ═ 1.0670728657375574e-06

p _ word1 (south America) 5.828914607014577e-06

p _ phrase (of south america) ═ 0.18306544831750163

h _ phrase (of south america) ═ 2.4495685716101083

S _ index (of south america) ═ 0.647227317405562

Sh (south america +) -1.5854276954041848

The calculation for the phrase "terrain + complex" is as follows:

p _ word1_ word2 (terrain + complex) ═ 1.4005331362805442e-06

p _ word1 (terrain) 1.5543772285372205e-05

p _ phrase (terrain + complex) ═ 0.09010252534377035

h _ phrase (terrain + complex) ═ 3.4722886481101916

S _ index (terrain + complex) ═ 0.6309225466580092

Sh (terrain + complex) ═ 2.190745196597378

The calculation for the phrase "Kanji + world" is as follows:

p _ word1_ word2 (Kan + world) (+ 2.2675298396923095 e-06) p _ word1 (Kan) (+ 3.737663664673382 e-05)

p _ phrase (Kancai + world) ═ 0.06066703810521857

h _ phrase (Kanji + world) ═ 4.0429433121474645

S _ index (can be called as world) ═ 0.5706574375390948

Sh (Kan + world) ═ 2.3071356706258928

The calculation for the phrase "climate + bad" is as follows:

p _ word1_ word2 (climate + bad) ═ 3.334602705429867e-07

p _ word1 (climate) 5.767557611151266e-05

0.005781654783963649 for p _ phrase (climate + bad)

7.434301814956798 for h _ phrase (climate + bad)

S _ index (climate + bad) ═ 0.7903415048925304

Sh (climate + Severe) 5.875637284258225

The calculation for the phrase "bad + is as follows:

p _ word1_ word2 (bad) ═ 7.536202114271499e-06

p _ word1 (bad) 1.4418894027878165e-05

p _ phrase (bad) ═ 0.5226615924703139

h _ phrase (bad) ═ 0.9360509474289722

S _ index (bad) + 0.4123786891864524

Sh (bad) + 0.3860074627124964

The calculation for the phrase "is + south america" is as follows:

p _ word1_ word2 (is + south america) ═ 1.5339172444977387e-06

p _ word1 (Yes) 0.015105274288269074

p _ phrase (is + south america) ═ 0.00010154845355499411

h _ phrase (is + south america) ═ 13.265544110033836

S _ index (is + south America) 0.6388200038250708

Sh (is + south america) 8.47429493911346

The calculation for the phrase "most + bad" is as follows:

p _ word1_ word2 (worst) ═ 1.0670728657375574e-06

p _ word1 (maximum) 0.0021943306953915577

p _ phrase (worst) 0.00048628625939498525

h _ phrase (worst) ═ 11.00590655248606

S _ index (worst) 0.6562148909536591

Sh (worst) 7.222239768185802

The calculation for the + southerly end of the phrase "is as follows:

p _ word1_ word2 (at the + southerst end) ═ 9.603655791638017e-06

0.06442218685332284 of p _ word1

(the + southerst end of) p _ phrase 0.00014907373159349168

(south-most of) h _ phrase 12.71168631802845

(the + southerst end of) S _ index 0.4197404301194033

(of the + southerly end) of Sh 5.335608682672196

The calculation for the phrase "channel + of" is as follows:

p _ word1_ word2 (lane + 6.002284869773761 e-07)

0.06442218685332284 of p _ word1

p _ phrase (channel + 9.31710822459323 e-06)

h _ phrase + channel 16.71168631802845

(channel + S) 0.7137716250260632

Sh (channel + 11.928327500144992)

The calculation for the phrase "coanda angle + is" is as follows:

p _ word1_ word2 (union angle + yes) ═ 6.669205410859734e-08

p _ word1 (Henn angle) 4.090466390887423e-07

p _ phrase (synen angle + yes) ═ 0.1630426649077749

h _ phrase (synen angle + yes) ═ 2.6166785574453666

S _ index (union angle + yes) ═ 1.0

Sh (Henn angle + is) 2.6166785574453666

The calculation for the phrase "strong wind + constant" is as follows:

p _ word1_ word2 (strong wind + constant) ═ 6.669205410859734e-08

p _ word1 (strong wind) 1.2271399172662268e-06

0.05434755496925829 for p _ phrase (strong wind + continuous)

4.201641058166523 for h _ phrase (strong wind + continuous)

S _ index (strong wind + constant) 1.0

Sh (strong wind + constant) 4.201641058166523

The calculation for the phrase "go + sea state" is as follows:

p _ word1_ word2 (go + sea state) ═ 6.669205410859734e-08

p _ word1 (up) 0.0033398658081595805

p _ phrase (go + sea state) ═ 1.996848314853336e-05

h _ phrase (go + sea state) ═ 15.611915727894425

S _ index (upper + sea state) ═ 1.0

Sh (Shang + Hai Condition) 15.611915727894425

The calculation for the phrase "sea state + max" is as follows:

p _ word1_ word2 (sea state + max) ═ 6.669205410859734e-08

p _ word1 (sea state) 1.5339248965827835e-07

p _ phrase (sea state + max) ═ 0.43478043975406633

h _ phrase (sea state + max) ═ 1.2016410581665227

S _ index (sea state + max) ═ 1.0

Sh (sea state + max) ═ 1.2016410581665227

The calculation for the phrase "one of the channels + is as follows:

p _ word1_ word2 (one of the lanes) ═ 6.669205410859734e-08

p _ word1 (channel) 1.4827940666966906e-06

0.044977286871110314 for p _ phrase (one of the lanes).)

h _ phrase (one of the channels) ═ 4.4746595525729385

S _ index (one of the channels) ═ 1.0

Sh (one of the navigation channels) ═ 4.4746595525729385

After the correction information quantity Sh (phrase) of each phrase is calculated, an average is calculated to obtain SH, then normalization processing is carried out, messy code judgment is carried out, the repetition index R _ index and the abundance index D _ index are respectively calculated without the messy code, penalty correction is carried out, finally obtained score Y is 88.32469777023694, and the higher the score is, the higher the value of the text is.

SH＝5.3914356093241835

NSH＝88.32469777023694

U_index＝2.83333333333335

R_index＝1.4285714285714286

D_index＝1.1666666666666667

Y＝88.32469777023694

Example two:

the infinite curve is the abstraction of the universe, one end is connected with the infinite past, the other end is connected with the infinite future, and only irregular and non-life random fluctuation exists in the middle.

Glossary of words [ ' Infinite ', ' Long ', ' curved ', ' just ', ' cosmic ', ' Abstract ', ' connected ', ' Infinite ', ' past ', ' another ', ' connected ', ' infinite ', ' in ' future ', ' intermediate ', ' only irregular ', ' no ', ' Life ', ' random ', ' fluctuating ' ]

List of binary phrases [ ' Infinite + Long ', ' Curve + is ', ' is + universe ', ' is + abstract ', ' is even + is ', ' is unlimited ', ' is even + is ', ' is even ', ' is unlimited ', ' is future ', ' is even ', ' is not

The phrase list of words of the three phrases [ 'infinite + long +', 'long + curve', 'curve + is + universe', 'universe + is + abstraction', 'link + infinite', 'link + future', 'infinite', 'link + future', 'intermediate + only + irregular', 'only irregular + no', 'irregular + no + life', 'no life + life', 'random + random', 'random + heave' ]

And respectively calculating the appearance probability of the phrases in the binary phrases and the ternary phrases, the information content of the phrases, the coverage correction of the information content of the phrases, the information content evaluation and the like.

p _ word1_ word2_ word3 (infinite + long) ═ 2.1658926488303964e-07

p _ word1_ word2 (infinite + length) ═ 1.7184988843505244e-07

p _ phrase (infinite + long) ═ 1.2603398632108838

h _ phrase (infinite + long) — 0.33381282329128703

S _ index (infinite + long) ═ 1.0

Sh (infinite + long) — 0.33381282329128703

p _ word1_ word2_ word3 (random + fluctuation of) ═ 2.1658926488303964e-07

p _ word1_ word2 (random of) ═ 1.5466489959154718e-06

p _ phrase (random + fluctuation of p) ═ 0.14003776257898712

h _ phrase + random + fluctuation 2.836112178151025

(random + fluctuation of) S _ index is 1.0

(random + fluctuation of) Sh 2.836112178151025

Other phrase calculations are similar to the calculation process described above and are not listed here. And finally, integrating the information of all phrases, and grading the target text, wherein the grading result is as follows:

SH＝3.6597203716282865

NSH＝83.01919915574385

U_index＝3.83333333333335

R_index＝1.15

D_index＝1.25

Y＝83.01919915574385

generally speaking, the method provided by the invention calculates the information quantity borne by the text based on the vocabulary and the organization sequence of the vocabulary, further determines the propagation potential of the word group in the community according to the coverage of the word group in the community on the basis of the information quantity, then obtains the corrected information quantity of the word group according to the information quantity and the propagation potential of the word group, and obtains the information quantity score of the target text according to the corrected information quantity of all the word groups contained in the target text, namely, the method can evaluate the text value in one community based on two dimensions of the information quantity and the propagation potential, thereby enabling the calculation result to be more accurate, being more beneficial to mining the valuable information of the text and improving the evaluation effect.

Example two

Based on the same inventive concept, the present embodiment provides an apparatus for evaluating the value of a text in a community, please refer to fig. 9, the apparatus includes:

a corpus construction module 201, configured to collect all corpus texts in a community, construct a corpus, preprocess the corpus texts, use x words linked in sequence as phrases, and form a vocabulary database with all words; all phrases are combined into a vocabulary database, wherein x is a positive integer greater than or equal to 2;

the target text preprocessing module 202 is used for preprocessing a target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database;

a phrase occurrence probability calculation module 203, configured to calculate a probability that a phrase included in the target text (T) appears in the updated phrase database;

the phrase information amount calculation module 204 is configured to calculate, according to the probability of occurrence of a word group in the target text (T), an information amount of each word group, specifically: h (phrase) ═ log₂p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase;

the phrase propagation information determining module 205 is configured to determine a propagation potential of the phrase in the community according to a coverage of the phrase in the community, where the coverage is inversely proportional to the propagation potential;

and the scoring module 206 is configured to obtain the corrected information amount of the phrase according to the information amount and the propagation potential of the phrase, and obtain the information amount score of the target text according to the corrected information amounts of all phrases contained in the target text.

In one embodiment, the apparatus further comprises a normalization processing module for, after obtaining the information content score of the target text:

NSH(T)＝actan(SH(T))*200/π，

where sh (t) represents the information content score, and nsh (t) represents the information content score.

In an embodiment, in a phrase formed by x words linked in sequence, the phrase is divided into 1-bit word, 2-bit word and x-bit word according to an appearance order, and the phrase occurrence probability calculation module 203 is specifically configured to:

p(phrase)＝p(word2+word1)＝p(word1∩word2)/p(word1)

p(phrase)＝p(wordx+word1_x-1)＝p(word1...wordx)/p(word1_x-1)

wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, wordx represents an x-bit word, p (word1_ x-1) represents the probability of occurrence of 1-x-1-bit words, and p (word1.. wordx) represents the probability of occurrence of 1-x-bit words at the same time.

In one embodiment, the phrase propagation information determining module 205 is specifically configured to:

S_index＝logN(N/n)

In one embodiment, the apparatus further includes a misjudgment module configured to:

carrying out messy code judgment on the target text (T);

In one embodiment, the apparatus further comprises a repetitive content determination module configured to:

judging the repetitive content of the target text (T);

In one embodiment, the apparatus further includes a preset expressive word detection module configured to:

detecting whether the target text (T) uses a preset expression word;

Since the apparatus described in the second embodiment of the present invention is a system for implementing the method for evaluating the value of the text in the community in the first embodiment of the present invention, those skilled in the art can understand the specific structure and modification of the system based on the method described in the first embodiment of the present invention, and thus the detailed description thereof is omitted here. All systems adopted by the method of the first embodiment of the present invention are within the intended protection scope of the present invention.

EXAMPLE III

Referring to fig. 10, based on the same inventive concept, the present application further provides a computer-readable storage medium 300, on which a computer program 311 is stored, which when executed implements the method according to the first embodiment.

Since the computer-readable storage medium described in the third embodiment of the present invention is a computer device used for implementing the method for evaluating the value of the text in the community in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, those skilled in the art can understand the specific structure and modification of the computer-readable storage medium, and thus details are not described herein again. Any computer readable storage medium used in the method of the first embodiment of the present invention is within the scope of the present invention.

Example four

Based on the same inventive concept, the present application further provides a computer device, please refer to fig. 11, which includes a storage 401, a processor 402, and a computer program 403 stored in the storage and running on the processor, and when the processor 402 executes the above program, the method in the first embodiment is implemented.

Since the computer device described in the fourth embodiment of the present invention is a computer device used for implementing the method for evaluating the value of the text in the community in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, those skilled in the art can understand the specific structure and the deformation of the computer device, and thus the detailed description thereof is omitted. All the computer devices used in the method in the first embodiment of the present invention are within the scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A method for evaluating a text value in a community, comprising:

2. The method of claim 1, wherein after obtaining the raw information content score for the target text, the method further comprises:

NSH(T)＝actan(SH(T))*200/π，

where sh (t) represents raw traffic score and nsh (t) represents traffic score.

3. The method according to claim 1, wherein calculating the probability of occurrence of a phrase contained in the target text (T) in a phrase consisting of x words linked in sequence, which is divided into 1-bit words, 2-bit words, and x-bit words according to the order of occurrence, comprises:

p(phrase)＝p(word2|word1)＝p(word1∩word2)/p(word1)

p(phrase)＝p(wordx|word1_x-1)＝p(word1...wordx)/p(word1_x-1)

4. The method of claim 1, wherein determining the propagation potential of the phrase within the community based on the coverage of the phrase within the community comprises:

S_index＝logN(N/n)

5. The method of claim 1, wherein the method further comprises:

carrying out messy code judgment on the target text (T);

6. The method of claim 1, wherein the method further comprises:

judging the repetitive content of the target text (T);

7. The method of claim 1, wherein the method further comprises:

detecting whether the target text (T) uses a preset expression word;

8. An apparatus for evaluating a value of a text in a community, comprising:

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the method of any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.