CN112417088A - Evaluation method and device for text value in community - Google Patents

Evaluation method and device for text value in community Download PDF

Info

Publication number
CN112417088A
CN112417088A CN201910763287.6A CN201910763287A CN112417088A CN 112417088 A CN112417088 A CN 112417088A CN 201910763287 A CN201910763287 A CN 201910763287A CN 112417088 A CN112417088 A CN 112417088A
Authority
CN
China
Prior art keywords
phrase
probability
phrases
target text
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910763287.6A
Other languages
Chinese (zh)
Other versions
CN112417088B (en
Inventor
刘垚
邹更
任钰欣
黄梓杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Yujianwan Technology Co ltd
Original Assignee
Wuhan Yujianwan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Yujianwan Technology Co ltd filed Critical Wuhan Yujianwan Technology Co ltd
Priority to CN201910763287.6A priority Critical patent/CN112417088B/en
Publication of CN112417088A publication Critical patent/CN112417088A/en
Application granted granted Critical
Publication of CN112417088B publication Critical patent/CN112417088B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for evaluating the text value in a community, which comprises the following steps: collecting all corpus texts in a community, constructing a corpus, and preprocessing the corpus texts; preprocessing a target text, taking x words which are linked in sequence as phrases, integrating the target text into a corpus, and updating a vocabulary database and a phrase database; calculating the probability of the phrases contained in the target text appearing in the updated phrase database; calculating the information content of each phrase according to the probability of the occurrence of the phrase in the target text; determining the propagation potential of the phrases in the community according to the coverage of the phrases in the community, wherein the coverage is inversely proportional to the propagation potential; and obtaining the corrected information amount of the phrases according to the information amount and the propagation potential of the phrases, and obtaining the information amount score of the target text according to the corrected information amounts of all the phrases contained in the target text. The method of the invention can improve the accuracy of the scoring and improve the evaluation effect.

Description

Evaluation method and device for text value in community
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for evaluating text value in a community.
Background
With the rapid development of the internet era, information networks among people are increasingly compact, people are gathered into intangible communities by different types of internet products, and information transmission is the most important subject in the internet communities.
In the prior art, the evaluation of the text value in the community, which is commonly used, mainly depends on the feedback of the users in the community. The value evaluation of the text content is formed through the feedback of the user and serves as an important basis for popularization and control of the text. In addition, for quality evaluation of text information, a machine learning-based method is commonly used at present to construct a text classification model through a high-quality text training set labeled manually, or to evaluate the quality of a text with respect to the number of language components, such as expressions and metaphors, which can represent the quality of the text.
The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:
the prior art text evaluation systems that rely on user feedback have hysteresis and have an inevitable time-cumulative effect. The hysteresis can cause that the information cannot be evaluated and controlled before the information is spread to a certain extent, and the user is relied on to screen and evaluate the high-quality information, so that the reading cost of the user is increased; the time accumulation effect enables the information appearing first to continuously accumulate the propagation advantages thereof, occupies an information acquisition channel of a user, and enables subsequent high-quality information to be blocked, so that the high-quality information is difficult to effectively expose on one hand, and the information receiving of the user is homogenized on the other hand. For the quality evaluation of the text information, the evaluation is performed only in isolation, resulting in poor evaluation effect.
Therefore, the method in the prior art has the technical problem of poor evaluation effect.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for evaluating a value of a text in a community, so as to solve or at least partially solve the technical problem of poor evaluation effect of the method in the prior art.
The invention provides a method for evaluating the text value in a community, which comprises the following steps:
collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are linked in sequence as phrases, and forming a vocabulary database by all the words; all phrases are combined into a vocabulary database, wherein x is a positive integer greater than or equal to 2;
preprocessing a target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database;
calculating the probability of the appearance of the phrases contained in the target text (T) in the updated phrase database;
calculating the information content of each word group according to the occurrence probability of the word group in the target text (T), which specifically comprises the following steps: h (phrase) ═ log2p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase;
determining the propagation potential of the phrases in the community according to the coverage of the phrases in the community, wherein the coverage is inversely proportional to the propagation potential;
and obtaining the corrected information amount of the phrases according to the information amount and the propagation potential of the phrases, and obtaining the original information amount score of the target text according to the corrected information amounts of all the phrases contained in the target text.
In one embodiment, after obtaining the raw information content score for the target text, the method further comprises:
carrying out normalization processing on the original information quantity score to obtain an information quantity score, and controlling the value range of the information quantity score to be between [0 and 100), wherein the normalization processing mode is as follows:
NSH(T)=actan(SH(T))*200/π,
where sh (t) represents raw traffic score and nsh (t) represents traffic score.
In one embodiment, in a phrase formed by x words linked in sequence, the phrase is divided into 1-bit word, 2-bit word and x-bit word according to the appearance order, and the probability of appearance of the phrase contained in the target text (T) is calculated, including:
for each word group (phrase) in the target text (T), calculating the frequency count of an x-tuple (word1,.., word) formed by 1-bit words to x-bit words in an updated word group database, and dividing the frequency count by the total number of the word groups in the word group database to obtain the probability p (word1,., word x) of the simultaneous occurrence of the x words;
when x is 2, namely the phrase comprises two words, calculating the frequency of the 1-bit word in the vocabulary database, and making a quotient with the total number of the vocabularies in the vocabulary database to obtain the probability of the 1-bit word; calculating the probability of 2-bit word occurrence under the condition of 1-bit word occurrence, namely the probability of word group occurrence according to a conditional probability formula,
p(phrase)=p(word2|word1)=p(word1∩word2)/p(word1)
wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, word2 represents a 2-bit word, p (word1) represents the probability of occurrence of a 1-bit word, and p (word 1:word 2) represents the probability of occurrence of both a 1-bit word and a 2-bit word;
when x is greater than 2, namely the phrase comprises more than two words, for 1 to x-1 bit words in the phrase, calculating the frequency of the 1 to x-1 bit words (word1_ x-1) in a corresponding x-1-element phrase database, and making a quotient with the total number of the phrases in the phrase database to obtain the probability of the 1 to x-1 bit words; calculating the probability of the occurrence of the x-bit words under the condition that the 1 to x-1-bit words occur, namely the probability of the occurrence of the phrases,
p(phrase)=p(wordx|word1_x-1)=p(word1...wordx)/p(word1_x-1)
wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, wordx represents an x-bit word, p (word1_ x-1) represents the probability of occurrence of 1-x-1-bit words, and p (word1.. wordx) represents the probability of simultaneous occurrence of x words.
In one embodiment, determining the propagation potential of the phrases in the communities according to the coverage of the phrases in the communities comprises:
acquiring the total number of text units in a community and the number of text units containing phrases phrase, wherein the text units represent each individual in the community and all text corpora corresponding to the individual;
calculating a coverage correction parameter of the information quantity of the single phrase according to the following formula, and taking the coverage correction parameter as the propagation potential of the phrase in the community:
S_index=logN(N/n)
wherein, S _ index represents a coverage correction parameter of the phrase information amount, N represents the total number of text units, and N represents the number of text units containing the phrase.
In one embodiment, the method further comprises:
carrying out messy code judgment on the target text (T);
and correcting the information content score of the target text (T) according to the messy code judgment result.
In one embodiment, the method further comprises:
judging the repetitive content of the target text (T);
and correcting the information content score of the target text (T) according to the result of the repetitive content judgment.
In one embodiment, the method further comprises:
detecting whether the target text (T) uses a preset expression word;
and correcting the information content score of the target text (T) according to the detection result.
Based on the same inventive concept, a second aspect of the present invention provides an apparatus for evaluating a text value in a community, comprising:
the corpus construction module is used for collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are sequentially linked as phrases, and forming a vocabulary database by all the words; all phrases are combined into a vocabulary database, wherein x is a positive integer greater than or equal to 2;
the target text preprocessing module is used for preprocessing a target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus and updating a vocabulary database and a phrase database;
the phrase occurrence probability calculation module is used for calculating the probability of the appearance of the phrases contained in the target text (T) in the updated phrase database;
the phrase information amount calculation module is used for calculating the information amount of each phrase according to the probability of the occurrence of the phrase in the target text (T), and specifically comprises the following steps: h (phrase) ═ log2p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase;
the system comprises a phrase propagation information determining module, a phrase propagation information determining module and a phrase propagation information determining module, wherein the phrase propagation information determining module is used for determining the propagation potential of a phrase in a community according to the coverage of the phrase in the community, and the coverage is inversely proportional to the propagation potential;
and the scoring module is used for obtaining the corrected information quantity of the phrases according to the information quantity and the propagation potential of the phrases, and obtaining the original information quantity score of the target text according to the corrected information quantities of all the phrases contained in the target text.
Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.
Based on the same inventive concept, a fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides an evaluation method of text value in a community, which comprises the steps of firstly, collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are sequentially linked as word groups, and forming a vocabulary database by all the words; all phrases are combined into a vocabulary database; then preprocessing the target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database; next, calculating the probability of the appearance of the phrases contained in the target text (T); then, calculating the information content of each word group according to the probability of the occurrence of the word group in the target text (T); then determining the propagation potential of the phrases in the community according to the coverage of the phrases in the community, wherein the coverage is inversely proportional to the propagation potential; and obtaining the corrected information quantity of the phrases according to the information quantity and the propagation potential of the phrases, and obtaining the information quantity score of the target text according to the corrected information quantities of all the phrases contained in the target text.
According to the method provided by the invention, the information quantity borne by the text is calculated based on the vocabulary and the organization sequence of the vocabulary, the propagation potential of the word group in the community is further determined according to the coverage degree of the word group in the community on the basis of the information quantity, then the corrected information quantity of the word group is obtained according to the information quantity and the propagation potential of the word group, and the information quantity score of the target text is obtained according to the corrected information quantity of all the word groups contained in the target text, namely, the text value in one community can be evaluated based on two dimensions of the information quantity and the propagation potential, so that the calculation result is more accurate, the valuable information of the text can be more favorably mined, and the evaluation effect is improved.
Furthermore, the invention also judges the messy codes of the target text (T), and corrects the information content score of the target text (T) according to the messy code judgment result, thereby further improving the evaluation effect.
Further, the invention also judges the repetitive content of the target text (T); and the information content score of the target text (T) is corrected according to the result of the repetitive content judgment, so that the evaluation effect is improved.
Further, the invention also detects whether the target text (T) uses the preset expression words; and the information content score of the target text (T) is corrected according to the detection result, so that the evaluation effect is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart illustrating a method for evaluating the value of a text in a community according to the present invention;
FIG. 2 is a flow diagram illustrating the preprocessing of corpus text within a community in accordance with an exemplary embodiment;
FIG. 3 is a schematic diagram illustrating a process for preprocessing a target text according to an embodiment;
FIG. 4 is a flow diagram illustrating an implementation of evaluating text value in an exemplary embodiment;
FIG. 5 is a flow chart of the calculation of the scrambling code index (U _ index) in the embodiment;
FIG. 6 is a flow chart of the calculation of the repetition index (R _ index) in an embodiment;
FIG. 7 is a flowchart illustrating the calculation of the term richness index (D _ index) in accordance with an embodiment;
FIG. 8 is a flow diagram illustrating the modification of text quality scoring in accordance with an exemplary embodiment;
FIG. 9 is a block diagram of an apparatus for evaluating the value of a text in a community according to an embodiment of the present invention;
FIG. 10 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;
fig. 11 is a block diagram of a computer device in an embodiment of the present invention.
Detailed Description
The invention aims to provide a method for evaluating text value in a community, aiming at the technical problem that the evaluation effect is poor due to the fact that the text quality is evaluated only in an isolated mode by the method in the prior art, and the text value is evaluated from two dimensions of information quantity and propagation potential so as to achieve the purposes of improving the evaluation accuracy and the evaluation effect.
In order to achieve the above purpose, the main concept of the invention is as follows:
the method comprises the steps of calculating the information quantity borne by a text based on vocabularies and the organization sequence of the vocabularies, further determining the propagation potential of the word group in the community according to the coverage degree of the word group in the community on the basis of the information quantity, then obtaining the corrected information quantity of the word group according to the information quantity and the propagation potential of the word group, and obtaining the information quantity score of a target text according to the corrected information quantity of all the word groups contained in the target text.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The present embodiment provides a method for evaluating the value of a text in a community, please refer to fig. 1, the method includes:
step S1: collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are linked in sequence as phrases, and forming a vocabulary database by all the words; and (3) forming a vocabulary database by all phrases, wherein x is a positive integer greater than or equal to 2.
In particular, a community may represent a network of relationships. Please refer to fig. 2, which is a flowchart illustrating an implementation of step S1, wherein the preprocessing the text includes: punctuation marks in each corpus text are replaced by line feed marks, sentence break processing is carried out, and then word segmentation processing is carried out on the corpus after sentence break, so that individual word segments (words) are obtained. And 2 or more words appearing in sequence are taken as phrases. The phrase database includes a binary phrase database, a ternary phrase database and the like according to different values of x. The vocabulary database and the phrase database are not subjected to redundancy removal processing. Each individual and all text corpora in the community are called as a text unit, the number of the text units of each phrase in the community is counted, and the text units are stored in a phrase database.
Step S2: preprocessing the target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database.
Specifically, referring to fig. 3, which is a flowchart of a specific implementation of step S2, the process of preprocessing the target text (T) is similar to the preprocessing of the text, including punctuation replacement, sentence segmentation and word segmentation.
Step S3: and calculating the probability of the appearance of the phrases contained in the target text (T) in the updated phrase database.
Specifically, the phrase included in the target text (T) may be a binary phrase, a ternary phrase, or an x-element phrase. And (4) according to the probability of occurrence of word collocation combinations (namely word groups) in the word groups, calculating the information content of the word collocation combinations in the subsequent steps. The inventor of the application finds out through a large amount of practice and research that: the information is carried by the words and the organized sequence of the words, and the organized sequence of the words is an important characteristic of the carried information because the number of the words is limited, so that the invention considers that when a user reads a new organized sequence of the words, the user obtains new information.
In one embodiment, in a phrase formed by x words linked in sequence, the phrase is divided into 1-bit word, 2-bit word and x-bit word according to the appearance order, and the probability of appearance of the phrase contained in the target text (T) is calculated, including:
for each word group (phrase) in the target text (T), calculating the frequency count of an x-tuple (word1,.., word) formed by 1-bit words to x-bit words in an updated word group database, and dividing the frequency count by the total number of the word groups in the word group database to obtain the probability p (word1,., word x) of the simultaneous occurrence of the x words;
when x is 2, namely the phrase comprises two words, calculating the frequency of the 1-bit word in the vocabulary database, and making a quotient with the total number of the vocabularies in the vocabulary database to obtain the probability of the 1-bit word; calculating the probability of 2-bit word occurrence under the condition of 1-bit word occurrence, namely the probability of word group occurrence according to a conditional probability formula,
p(phrase)=p(word2|word1)=p(word1∩word2)/p(word1)
wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, word2 represents a 2-bit word, p (word1) represents the probability of occurrence of a 1-bit word, and p (word 1:word 2) represents the probability of occurrence of both a 1-bit word and a 2-bit word;
when x is greater than 2, namely the phrase comprises more than two words, for 1 to x-1 bit words in the phrase, calculating the frequency of the 1 to x-1 bit words (word1_ x-1) in a corresponding x-1-element phrase database, and making a quotient with the total number of the phrases in the phrase database to obtain the probability of the 1 to x-1 bit words; calculating the probability of the occurrence of the x-bit words under the condition that the 1 to x-1-bit words occur, namely the probability of the occurrence of the phrases,
p(phrase)=p(wordx|word1_x-1)=p(word1...wordx)/p(word1_x-1)
wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, wordx represents an x-bit word, p (word1_ x-1) represents the probability of occurrence of 1-x-1-bit words, and p (word1.. wordx) represents the probability of simultaneous occurrence of x words.
Specifically, the phrase formed by 1-bit words to x-bit words may be a binary phrase, a ternary phrase, an n-gram phrase, or the like. Specifically, referring to tables 1 to 3, examples of the vocabulary data, the binary phrase database, and the ternary phrase data are shown.
TABLE 1 lexical database
Vocabulary and phrases Number of occurrences Number of text units
On the upper part 65320 2139
Topography 304 150
Complexity of 1823 717
... ... ...
TABLE 2 binary phrase database
Word group Number of occurrences Number of text units
World + top 4795 1034
Is complicated and 5 5
the south-most end of 144 86
... ... ...
TABLE 3 ternary phrase database
Figure BDA0002171058450000081
Figure BDA0002171058450000091
When x >2, a word group includes more than two words, for example, x ═ 3, and the probability p (word1.., wordx) of x words occurring simultaneously indicates the probability of three words occurring simultaneously. The frequency (times) of 1-bit words to 2-bit words in the corresponding binary phrase database is calculated, then the frequency is divided by the total number of phrases in the binary phrase database, the probability of the 1-bit words to 2-bit words is calculated, then the probability of the 3-bit words under the condition that the 1-bit words to 2-bit words appear is calculated according to a conditional probability formula, and the probability of the three-bit phrase is obtained.
Step S4: calculating the information content of each word group according to the occurrence probability of the word group in the target text (T), which specifically comprises the following steps: h (phrase) ═ log2p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase.
Specifically, according to the information amount formula, the information amount that each phrase brings to the reader can be calculated. Since the target text (T) has been integrated into the vocabulary database and the phrase database in step S1, an arbitrary phrase appears at least once. The larger the information content of a phrase is, the less the combination of the phrase is, the more novel the vocabulary collocation is, and the larger the information content provided for readers is; conversely, the smaller the information content, the more common the combination of the phrase is, the more fixed the word collocation is, the smaller the information content provided for the reader.
Step S5: determining the propagation potential of the phrases in the communities according to the coverage of the phrases in the communities, wherein the coverage is inversely proportional to the propagation potential.
Specifically, the inventors of the present application found through a great deal of practice that: in the prior art, a machine learning-based method constructs a text classification model through a high-quality text training set labeled manually, or evaluates the quality of a text according to the number of language components such as expressions and metaphors which can represent the quality of the text. However, these methods are used for evaluating the quality of the text in isolation, and do not consider the propagation condition of the text in the community, and cannot determine whether the text brings new information to the user, and further cannot determine the propagation potential of the text in the community.
However, within a community, the informational value of text is in terms of the value of the text's dissemination in addition to the informational content layer of the text. The value of the dissemination of information is measured using its coverage in communities. The larger the coverage, the less the remaining amount of the text in the community, and the smaller the coverage, the more the remaining amount of the text in the community.
In one embodiment, determining the propagation potential of the phrases in the communities according to the coverage of the phrases in the communities comprises:
acquiring the total number of text units in a community and the number of text units containing phrases phrase, wherein the text units represent each individual in the community and all text corpora corresponding to the individual;
calculating a coverage correction parameter of the information quantity of the single phrase according to the following formula, and taking the coverage correction parameter as the propagation potential of the phrase in the community:
S_index=logN(N/n)
wherein, S _ index represents a coverage correction parameter of the phrase information amount, N represents the total number of text units, and N represents the number of text units containing the phrase.
Specifically, one phrase appears in 5 text units, each of which appears 100 times, and one phrase appears in 500 text units, each of which appears 1 time, and the propagation residual value is different for the entire community despite the same calculation result of the information amount. Therefore, the information amount h (phrase) needs to be corrected based on the propagation force of phrase. For example, if the total number of text units in the community is N, and the number of text units containing the phrase is N, the formula of the coverage correction parameter S _ index for a single phrase information amount is represented as follows: logN (N/N). When N is equal to N, S _ index is equal to 0, that is, the phrase has no propagation value, and since the target text is integrated into the vocabulary database and the phrase database, any phrase appears at least once, so that N is minimum 1, and when S _ index is equal to 1, the propagation value of the phrase is maximum.
Step S6: and obtaining the corrected information amount of the phrases according to the information amount and the propagation potential of the phrases, and obtaining the original information amount score of the target text according to the corrected information amounts of all the phrases contained in the target text.
Specifically, the S _ index value of the phrase is multiplied by the information amount h (phrase) to obtain the corrected information amount SH (phrase), and then the corrected information amounts SH (phrase) of all the phrases phrase obtained by segmenting the target text (T) are averaged, so that the information amount score SH of the text can be obtained.
In one embodiment, after obtaining the raw information content score for the target text, the method further comprises:
carrying out normalization processing on the original information quantity score to obtain an information quantity score, and controlling the value range of the information quantity score to be between [0 and 100), wherein the normalization processing mode is as follows:
NSH(T)=actan(SH(T))*200/π,
where sh (t) represents raw traffic score and nsh (t) represents traffic score.
In one embodiment, the method further comprises:
carrying out messy code judgment on the target text (T);
and correcting the information content score of the target text (T) according to the messy code judgment result.
Specifically, if the text is composed of meaningless messy codes, the probability of the occurrence of word collocation therein is very small, which easily causes the information content score nsh (t) of the text to be larger. To avoid this, it is necessary to discriminate the scrambling code condition. If a phrase (phrase) occurs in less than 2 text units (any phrase occurs at least once because the target text (T) has been integrated into the vocabulary database and the phrase database), the phrase is defined as a suspect phrase (uncartiain _ phrase).
In the implementation process, the scrambling index (U _ index) of the target text (T) can be used for measurement. Please refer to fig. 5, which is a flowchart of calculating the garbled index (U _ index), wherein the total number N (All _ phrase) of phrases in the target text (T) is divided by the number N (uncartiain _ phrase) of suspicious phrases, so as to obtain the garbled index of the text, and the calculation formula is U _ index (T) ═ N (All _ phrase)/N (uncartiain _ phrase).
The information content score of the target text (T) is corrected according to the messy code judgment result, and the score can be: when U _ index (t) <2.0 of the target text, the text is determined to be garbled, y (t) ═ 0, and y (t) represents the score after correction.
In one embodiment, the method further comprises:
judging the repetitive content of the target text (T);
and correcting the information content score of the target text (T) according to the result of the repetitive content judgment.
Specifically, if the same content appears repeatedly in the target text (T), the score of the whole text is increased when nsh (T) of the repeated content is high. Therefore, the present embodiment corrects the duplicate content by a penalty.
In a specific implementation, the repetition index (R _ index) may be used for measurement, please refer to fig. 6, which is a flowchart for calculating the repetition index (R _ index). Dividing the total phrase number N (All _ phrase) of the target text (T) by the non-repeated phrase number N (Nr _ phrase) of the article to obtain the repetition index of the text, wherein the specific calculation formula is as follows: r _ index (t) ═ N (All _ phrase)/N (Nr _ phrase).
The correction method can be as follows: when the text R _ index (T) >1.5, a penalty mechanism is started, and the penalty is calculated by the following method:
Y(T)=NSH(T)-(R_index(T)-1.5)*10
in one embodiment, the method further comprises:
detecting whether the target text (T) uses a preset expression word;
and correcting the information content score of the target text (T) according to the detection result.
Specifically, in the target text, if a rich expressive word (Eword) is used, a verb, an adjective, or an adverb is included. The target text (T) is considered to have a better expressive tension and therefore penalty corrections are required for text that does not use rich Eword.
In a specific implementation, the expressive word richness index (D _ index) of the target text (T) may be used for measurement. Please refer to fig. 7, which is a flowchart illustrating the calculation of the word richness index (D _ index). And marking all Eword in the target text (T) by adopting a semantic analysis model. Dividing the total number N (All _ Eword) of the Eword by the number N (Nr _ Eword) of the non-repeated Eword to obtain an expression word abundance index of the target text, wherein the specific calculation formula is as follows: d _ index (t) N (All _ Eword)/N (Nr _ Eword).
The correction method can be as follows: when the text D _ index (T) >3.0, a penalty mechanism is started, and the penalty is calculated by the following method:
Y(T)=NSH(T)-(R_index(T)-3.0)*10。
it should be noted that, in the specific implementation process, the scoring of the text value may be corrected by combining one or more of the random code determination, the repetitive content determination, and the richness determination, so as to improve the scoring accuracy. Please refer to fig. 8, which is a flowchart of a text quality score correction method, in which a scrambling code index (U _ index), a repetition index (R _ index), and an expression word abundance index (D _ index) are combined to correct, first, it is determined whether the scrambling code index is less than 2, when the scrambling code index is less than 2, the text is a scrambling code, at this time, the score is set to 0, otherwise, it is determined whether the repetition index is greater than 1.5, if so, the text is corrected by the repetition index, and after the repetition index is corrected, or when the repetition index is not greater than 1.5, the expression word abundance index is determined. Further, when the text causes y (t) <0 due to the penalty, y (t) ═ 0.
In order to more clearly illustrate the implementation flow of the present invention, the following detailed description is made by using several examples, please refer to fig. 4, which is an implementation flow chart for evaluating text value.
Firstly, sentence breaking processing is carried out on a target text to obtain sentences (sentence 1, sentence 2 … sentence n), then word segmentation processing is carried out on each sentence, taking sentence 1 as an example, different words (word1, word2, word3 and word) can be obtained after word segmentation, and taking binary word groups as an example, word1 and word2 form word groups phrase. The probability of the appearance of the phrase is calculated according to the probability of the appearance of word1 and the probability of the appearance of two words simultaneously, and then the information quantity h (phrase) brought to the reader by the phrase is calculated. And calculating a coverage correction parameter S _ index of the phrase information quantity according to the number of books containing the phrase and the total number of text units in the community. And then, calculating a corrected information quantity Sh (phrase) according to the information quantity h (phrase) and the coverage correction parameter S _ index, then averaging the corrected information quantities of all phrases in the target text to obtain an information quantity score of the target text, and finally performing normalization processing to obtain a final information quantity score.
Example one:
the Henran angle is the south-most end of south America, the terrain is complex, the climate is severe, the strong wind is continuous throughout the year, the waves are heavy, and the Henran angle can be one of the worst navigation channels in the world under sea conditions.
Glossary of the words, [ 'syngen', 'is', 'south america', 'is extreme', 'is complex', 'and', 'is climate', 'is bad', 'is strong wind', 'is constantly', 'is rough', 'can', 'is world', 'is up', 'is sea', 'is bad', 'is channel', 'one of' ]
The phrase list of [ ' syner angle + is ', ' is + south america ', ' is in south america ', ' is at the + most south end ', ' terrain + is complex ', ' is also + climate ', ' is climate + is bad ', ' is strong wind + is not broken ', ' can be ' world ', ' is + is ', ' is up + is up ', ' is down ', ' is bad ', ' is down ', ' is channel ', ' is one of channel ' ]
The calculation for the phrase "world + up" is as follows:
p word1 word2 (world + over) ═ 0.00031978839945072423
p _ word1 (world) 0.0013908097037316097
p _ phrase (world + above) 0.22992965794868736
h _ phrase (world + above) 2.1207355278488125
S _ index (world + up) ═ 0.09578403120899699
Sh (world + up) 0.20313259798549937
The calculation for the phrase "complex + and" is as follows:
p word1 word2 (complex + -) 3.334602705429867e-07
p word1 (Complex) 9.321150288234715e-05
p _ phrase (complex + and) ═ 0.0035774583633082805
h _ phrase (complex + and) ═ 8.126849308597139
S _ index (complex + and) ═ 0.7903415048925304
Sh (complex + and) ═ 6.422986312591483
The calculation for the phrase "and + climate" is as follows:
p _ word1_ word2 (and + climate) ═ 6.669205410859734e-08
p word1 (and) 0.00020938074838354993
p _ phrase (and + climate) 0.00031852046868429773
h _ phrase (and + climate) 11.616326293973737
S _ index (and + climate) 1.0
Sh (and + climate) 11.616326293973737
The calculation for the phrase "south america + is as follows:
p _ word1_ word2 (of south america) ═ 1.0670728657375574e-06
p _ word1 (south America) 5.828914607014577e-06
p _ phrase (of south america) ═ 0.18306544831750163
h _ phrase (of south america) ═ 2.4495685716101083
S _ index (of south america) ═ 0.647227317405562
Sh (south america +) -1.5854276954041848
The calculation for the phrase "terrain + complex" is as follows:
p _ word1_ word2 (terrain + complex) ═ 1.4005331362805442e-06
p _ word1 (terrain) 1.5543772285372205e-05
p _ phrase (terrain + complex) ═ 0.09010252534377035
h _ phrase (terrain + complex) ═ 3.4722886481101916
S _ index (terrain + complex) ═ 0.6309225466580092
Sh (terrain + complex) ═ 2.190745196597378
The calculation for the phrase "Kanji + world" is as follows:
p _ word1_ word2 (Kan + world) (+ 2.2675298396923095 e-06) p _ word1 (Kan) (+ 3.737663664673382 e-05)
p _ phrase (Kancai + world) ═ 0.06066703810521857
h _ phrase (Kanji + world) ═ 4.0429433121474645
S _ index (can be called as world) ═ 0.5706574375390948
Sh (Kan + world) ═ 2.3071356706258928
The calculation for the phrase "climate + bad" is as follows:
p _ word1_ word2 (climate + bad) ═ 3.334602705429867e-07
p _ word1 (climate) 5.767557611151266e-05
0.005781654783963649 for p _ phrase (climate + bad)
7.434301814956798 for h _ phrase (climate + bad)
S _ index (climate + bad) ═ 0.7903415048925304
Sh (climate + Severe) 5.875637284258225
The calculation for the phrase "bad + is as follows:
p _ word1_ word2 (bad) ═ 7.536202114271499e-06
p _ word1 (bad) 1.4418894027878165e-05
p _ phrase (bad) ═ 0.5226615924703139
h _ phrase (bad) ═ 0.9360509474289722
S _ index (bad) + 0.4123786891864524
Sh (bad) + 0.3860074627124964
The calculation for the phrase "is + south america" is as follows:
p _ word1_ word2 (is + south america) ═ 1.5339172444977387e-06
p _ word1 (Yes) 0.015105274288269074
p _ phrase (is + south america) ═ 0.00010154845355499411
h _ phrase (is + south america) ═ 13.265544110033836
S _ index (is + south America) 0.6388200038250708
Sh (is + south america) 8.47429493911346
The calculation for the phrase "most + bad" is as follows:
p _ word1_ word2 (worst) ═ 1.0670728657375574e-06
p _ word1 (maximum) 0.0021943306953915577
p _ phrase (worst) 0.00048628625939498525
h _ phrase (worst) ═ 11.00590655248606
S _ index (worst) 0.6562148909536591
Sh (worst) 7.222239768185802
The calculation for the + southerly end of the phrase "is as follows:
p _ word1_ word2 (at the + southerst end) ═ 9.603655791638017e-06
0.06442218685332284 of p _ word1
(the + southerst end of) p _ phrase 0.00014907373159349168
(south-most of) h _ phrase 12.71168631802845
(the + southerst end of) S _ index 0.4197404301194033
(of the + southerly end) of Sh 5.335608682672196
The calculation for the phrase "channel + of" is as follows:
p _ word1_ word2 (lane + 6.002284869773761 e-07)
0.06442218685332284 of p _ word1
p _ phrase (channel + 9.31710822459323 e-06)
h _ phrase + channel 16.71168631802845
(channel + S) 0.7137716250260632
Sh (channel + 11.928327500144992)
The calculation for the phrase "coanda angle + is" is as follows:
p _ word1_ word2 (union angle + yes) ═ 6.669205410859734e-08
p _ word1 (Henn angle) 4.090466390887423e-07
p _ phrase (synen angle + yes) ═ 0.1630426649077749
h _ phrase (synen angle + yes) ═ 2.6166785574453666
S _ index (union angle + yes) ═ 1.0
Sh (Henn angle + is) 2.6166785574453666
The calculation for the phrase "strong wind + constant" is as follows:
p _ word1_ word2 (strong wind + constant) ═ 6.669205410859734e-08
p _ word1 (strong wind) 1.2271399172662268e-06
0.05434755496925829 for p _ phrase (strong wind + continuous)
4.201641058166523 for h _ phrase (strong wind + continuous)
S _ index (strong wind + constant) 1.0
Sh (strong wind + constant) 4.201641058166523
The calculation for the phrase "go + sea state" is as follows:
p _ word1_ word2 (go + sea state) ═ 6.669205410859734e-08
p _ word1 (up) 0.0033398658081595805
p _ phrase (go + sea state) ═ 1.996848314853336e-05
h _ phrase (go + sea state) ═ 15.611915727894425
S _ index (upper + sea state) ═ 1.0
Sh (Shang + Hai Condition) 15.611915727894425
The calculation for the phrase "sea state + max" is as follows:
p _ word1_ word2 (sea state + max) ═ 6.669205410859734e-08
p _ word1 (sea state) 1.5339248965827835e-07
p _ phrase (sea state + max) ═ 0.43478043975406633
h _ phrase (sea state + max) ═ 1.2016410581665227
S _ index (sea state + max) ═ 1.0
Sh (sea state + max) ═ 1.2016410581665227
The calculation for the phrase "one of the channels + is as follows:
p _ word1_ word2 (one of the lanes) ═ 6.669205410859734e-08
p _ word1 (channel) 1.4827940666966906e-06
0.044977286871110314 for p _ phrase (one of the lanes).)
h _ phrase (one of the channels) ═ 4.4746595525729385
S _ index (one of the channels) ═ 1.0
Sh (one of the navigation channels) ═ 4.4746595525729385
After the correction information quantity Sh (phrase) of each phrase is calculated, an average is calculated to obtain SH, then normalization processing is carried out, messy code judgment is carried out, the repetition index R _ index and the abundance index D _ index are respectively calculated without the messy code, penalty correction is carried out, finally obtained score Y is 88.32469777023694, and the higher the score is, the higher the value of the text is.
SH=5.3914356093241835
NSH=88.32469777023694
U_index=2.83333333333335
R_index=1.4285714285714286
D_index=1.1666666666666667
Y=88.32469777023694
Example two:
the infinite curve is the abstraction of the universe, one end is connected with the infinite past, the other end is connected with the infinite future, and only irregular and non-life random fluctuation exists in the middle.
Glossary of words [ ' Infinite ', ' Long ', ' curved ', ' just ', ' cosmic ', ' Abstract ', ' connected ', ' Infinite ', ' past ', ' another ', ' connected ', ' infinite ', ' in ' future ', ' intermediate ', ' only irregular ', ' no ', ' Life ', ' random ', ' fluctuating ' ]
List of binary phrases [ ' Infinite + Long ', ' Curve + is ', ' is + universe ', ' is + abstract ', ' is even + is ', ' is unlimited ', ' is even + is ', ' is even ', ' is unlimited ', ' is future ', ' is even ', ' is not
The phrase list of words of the three phrases [ 'infinite + long +', 'long + curve', 'curve + is + universe', 'universe + is + abstraction', 'link + infinite', 'link + future', 'infinite', 'link + future', 'intermediate + only + irregular', 'only irregular + no', 'irregular + no + life', 'no life + life', 'random + random', 'random + heave' ]
And respectively calculating the appearance probability of the phrases in the binary phrases and the ternary phrases, the information content of the phrases, the coverage correction of the information content of the phrases, the information content evaluation and the like.
p _ word1_ word2_ word3 (infinite + long) ═ 2.1658926488303964e-07
p _ word1_ word2 (infinite + length) ═ 1.7184988843505244e-07
p _ phrase (infinite + long) ═ 1.2603398632108838
h _ phrase (infinite + long) — 0.33381282329128703
S _ index (infinite + long) ═ 1.0
Sh (infinite + long) — 0.33381282329128703
p _ word1_ word2_ word3 (random + fluctuation of) ═ 2.1658926488303964e-07
p _ word1_ word2 (random of) ═ 1.5466489959154718e-06
p _ phrase (random + fluctuation of p) ═ 0.14003776257898712
h _ phrase + random + fluctuation 2.836112178151025
(random + fluctuation of) S _ index is 1.0
(random + fluctuation of) Sh 2.836112178151025
Other phrase calculations are similar to the calculation process described above and are not listed here. And finally, integrating the information of all phrases, and grading the target text, wherein the grading result is as follows:
SH=3.6597203716282865
NSH=83.01919915574385
U_index=3.83333333333335
R_index=1.15
D_index=1.25
Y=83.01919915574385
generally speaking, the method provided by the invention calculates the information quantity borne by the text based on the vocabulary and the organization sequence of the vocabulary, further determines the propagation potential of the word group in the community according to the coverage of the word group in the community on the basis of the information quantity, then obtains the corrected information quantity of the word group according to the information quantity and the propagation potential of the word group, and obtains the information quantity score of the target text according to the corrected information quantity of all the word groups contained in the target text, namely, the method can evaluate the text value in one community based on two dimensions of the information quantity and the propagation potential, thereby enabling the calculation result to be more accurate, being more beneficial to mining the valuable information of the text and improving the evaluation effect.
Example two
Based on the same inventive concept, the present embodiment provides an apparatus for evaluating the value of a text in a community, please refer to fig. 9, the apparatus includes:
a corpus construction module 201, configured to collect all corpus texts in a community, construct a corpus, preprocess the corpus texts, use x words linked in sequence as phrases, and form a vocabulary database with all words; all phrases are combined into a vocabulary database, wherein x is a positive integer greater than or equal to 2;
the target text preprocessing module 202 is used for preprocessing a target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database;
a phrase occurrence probability calculation module 203, configured to calculate a probability that a phrase included in the target text (T) appears in the updated phrase database;
the phrase information amount calculation module 204 is configured to calculate, according to the probability of occurrence of a word group in the target text (T), an information amount of each word group, specifically: h (phrase) ═ log2p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase;
the phrase propagation information determining module 205 is configured to determine a propagation potential of the phrase in the community according to a coverage of the phrase in the community, where the coverage is inversely proportional to the propagation potential;
and the scoring module 206 is configured to obtain the corrected information amount of the phrase according to the information amount and the propagation potential of the phrase, and obtain the information amount score of the target text according to the corrected information amounts of all phrases contained in the target text.
In one embodiment, the apparatus further comprises a normalization processing module for, after obtaining the information content score of the target text:
carrying out normalization processing on the original information quantity score to obtain an information quantity score, and controlling the value range of the information quantity score to be between [0 and 100), wherein the normalization processing mode is as follows:
NSH(T)=actan(SH(T))*200/π,
where sh (t) represents the information content score, and nsh (t) represents the information content score.
In an embodiment, in a phrase formed by x words linked in sequence, the phrase is divided into 1-bit word, 2-bit word and x-bit word according to an appearance order, and the phrase occurrence probability calculation module 203 is specifically configured to:
for each word group (phrase) in the target text (T), calculating the frequency count of an x-tuple (word1,.., word) formed by 1-bit words to x-bit words in an updated word group database, and dividing the frequency count by the total number of the word groups in the word group database to obtain the probability p (word1,., word x) of the simultaneous occurrence of the x words;
when x is 2, namely the phrase comprises two words, calculating the frequency of the 1-bit word in the vocabulary database, and making a quotient with the total number of the vocabularies in the vocabulary database to obtain the probability of the 1-bit word; calculating the probability of 2-bit word occurrence under the condition of 1-bit word occurrence, namely the probability of word group occurrence according to a conditional probability formula,
p(phrase)=p(word2+word1)=p(word1∩word2)/p(word1)
wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, word2 represents a 2-bit word, p (word1) represents the probability of occurrence of a 1-bit word, and p (word 1:word 2) represents the probability of occurrence of both a 1-bit word and a 2-bit word;
when x is greater than 2, namely the phrase comprises more than two words, for 1 to x-1 bit words in the phrase, calculating the frequency of the 1 to x-1 bit words (word1_ x-1) in a corresponding x-1-element phrase database, and making a quotient with the total number of the phrases in the phrase database to obtain the probability of the 1 to x-1 bit words; calculating the probability of the occurrence of the x-bit words under the condition that the 1 to x-1-bit words occur, namely the probability of the occurrence of the phrases,
p(phrase)=p(wordx+word1_x-1)=p(word1...wordx)/p(word1_x-1)
wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, wordx represents an x-bit word, p (word1_ x-1) represents the probability of occurrence of 1-x-1-bit words, and p (word1.. wordx) represents the probability of occurrence of 1-x-bit words at the same time.
In one embodiment, the phrase propagation information determining module 205 is specifically configured to:
acquiring the total number of text units in a community and the number of text units containing phrases phrase, wherein the text units represent each individual in the community and all text corpora corresponding to the individual;
calculating a coverage correction parameter of the information quantity of the single phrase according to the following formula, and taking the coverage correction parameter as the propagation potential of the phrase in the community:
S_index=logN(N/n)
wherein, S _ index represents a coverage correction parameter of the phrase information amount, N represents the total number of text units, and N represents the number of text units containing the phrase.
In one embodiment, the apparatus further includes a misjudgment module configured to:
carrying out messy code judgment on the target text (T);
and correcting the information content score of the target text (T) according to the messy code judgment result.
In one embodiment, the apparatus further comprises a repetitive content determination module configured to:
judging the repetitive content of the target text (T);
and correcting the information content score of the target text (T) according to the result of the repetitive content judgment.
In one embodiment, the apparatus further includes a preset expressive word detection module configured to:
detecting whether the target text (T) uses a preset expression word;
and correcting the information content score of the target text (T) according to the detection result.
Since the apparatus described in the second embodiment of the present invention is a system for implementing the method for evaluating the value of the text in the community in the first embodiment of the present invention, those skilled in the art can understand the specific structure and modification of the system based on the method described in the first embodiment of the present invention, and thus the detailed description thereof is omitted here. All systems adopted by the method of the first embodiment of the present invention are within the intended protection scope of the present invention.
EXAMPLE III
Referring to fig. 10, based on the same inventive concept, the present application further provides a computer-readable storage medium 300, on which a computer program 311 is stored, which when executed implements the method according to the first embodiment.
Since the computer-readable storage medium described in the third embodiment of the present invention is a computer device used for implementing the method for evaluating the value of the text in the community in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, those skilled in the art can understand the specific structure and modification of the computer-readable storage medium, and thus details are not described herein again. Any computer readable storage medium used in the method of the first embodiment of the present invention is within the scope of the present invention.
Example four
Based on the same inventive concept, the present application further provides a computer device, please refer to fig. 11, which includes a storage 401, a processor 402, and a computer program 403 stored in the storage and running on the processor, and when the processor 402 executes the above program, the method in the first embodiment is implemented.
Since the computer device described in the fourth embodiment of the present invention is a computer device used for implementing the method for evaluating the value of the text in the community in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, those skilled in the art can understand the specific structure and the deformation of the computer device, and thus the detailed description thereof is omitted. All the computer devices used in the method in the first embodiment of the present invention are within the scope of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A method for evaluating a text value in a community, comprising:
collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are linked in sequence as phrases, and forming a vocabulary database by all the words; all phrases are combined into a vocabulary database, wherein x is a positive integer greater than or equal to 2;
preprocessing a target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus, and updating a vocabulary database and a phrase database;
calculating the probability of the appearance of the phrases contained in the target text (T) in the updated phrase database;
calculating the information content of each word group according to the occurrence probability of the word group in the target text (T), which specifically comprises the following steps: h (phrase) ═ log2p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase;
determining the propagation potential of the phrases in the community according to the coverage of the phrases in the community, wherein the coverage is inversely proportional to the propagation potential;
and obtaining the corrected information amount of the phrases according to the information amount and the propagation potential of the phrases, and obtaining the original information amount score of the target text according to the corrected information amounts of all the phrases contained in the target text.
2. The method of claim 1, wherein after obtaining the raw information content score for the target text, the method further comprises:
carrying out normalization processing on the original information quantity score to obtain an information quantity score, and controlling the value range of the information quantity score to be between [0 and 100), wherein the normalization processing mode is as follows:
NSH(T)=actan(SH(T))*200/π,
where sh (t) represents raw traffic score and nsh (t) represents traffic score.
3. The method according to claim 1, wherein calculating the probability of occurrence of a phrase contained in the target text (T) in a phrase consisting of x words linked in sequence, which is divided into 1-bit words, 2-bit words, and x-bit words according to the order of occurrence, comprises:
for each word group (phrase) in the target text (T), calculating the frequency count of an x-tuple (word1,.., word) formed by 1-bit words to x-bit words in an updated word group database, and dividing the frequency count by the total number of the word groups in the word group database to obtain the probability p (word1,., word x) of the simultaneous occurrence of the x words;
when x is 2, namely the phrase comprises two words, calculating the frequency of the 1-bit word in the vocabulary database, and making a quotient with the total number of the vocabularies in the vocabulary database to obtain the probability of the 1-bit word; calculating the probability of 2-bit word occurrence under the condition of 1-bit word occurrence, namely the probability of word group occurrence according to a conditional probability formula,
p(phrase)=p(word2|word1)=p(word1∩word2)/p(word1)
wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, word2 represents a 2-bit word, p (word1) represents the probability of occurrence of a 1-bit word, and p (word 1:word 2) represents the probability of occurrence of both a 1-bit word and a 2-bit word;
when x is greater than 2, namely the phrase comprises more than two words, for 1 to x-1 bit words in the phrase, calculating the frequency of the 1 to x-1 bit words (word1_ x-1) in a corresponding x-1-element phrase database, and making a quotient with the total number of the phrases in the phrase database to obtain the probability of the 1 to x-1 bit words; calculating the probability of the occurrence of the x-bit words under the condition that the 1 to x-1-bit words occur, namely the probability of the occurrence of the phrases,
p(phrase)=p(wordx|word1_x-1)=p(word1...wordx)/p(word1_x-1)
wherein p (phrase) represents the probability of occurrence of a word group, word1 represents a 1-bit word, wordx represents an x-bit word, p (word1_ x-1) represents the probability of occurrence of 1-x-1-bit words, and p (word1.. wordx) represents the probability of simultaneous occurrence of x words.
4. The method of claim 1, wherein determining the propagation potential of the phrase within the community based on the coverage of the phrase within the community comprises:
acquiring the total number of text units in a community and the number of text units containing phrases phrase, wherein the text units represent each individual in the community and all text corpora corresponding to the individual;
calculating a coverage correction parameter of the information quantity of the single phrase according to the following formula, and taking the coverage correction parameter as the propagation potential of the phrase in the community:
S_index=logN(N/n)
wherein, S _ index represents a coverage correction parameter of the phrase information amount, N represents the total number of text units, and N represents the number of text units containing the phrase.
5. The method of claim 1, wherein the method further comprises:
carrying out messy code judgment on the target text (T);
and correcting the information content score of the target text (T) according to the messy code judgment result.
6. The method of claim 1, wherein the method further comprises:
judging the repetitive content of the target text (T);
and correcting the information content score of the target text (T) according to the result of the repetitive content judgment.
7. The method of claim 1, wherein the method further comprises:
detecting whether the target text (T) uses a preset expression word;
and correcting the information content score of the target text (T) according to the detection result.
8. An apparatus for evaluating a value of a text in a community, comprising:
the corpus construction module is used for collecting all corpus texts in a community, constructing a corpus, preprocessing the corpus texts, taking x words which are sequentially linked as phrases, and forming a vocabulary database by all the words; all phrases are combined into a vocabulary database, wherein x is a positive integer greater than or equal to 2;
the target text preprocessing module is used for preprocessing a target text (T), taking x words which are linked in sequence as phrases, integrating the target text (T) into a corpus and updating a vocabulary database and a phrase database;
the phrase occurrence probability calculation module is used for calculating the probability of the appearance of the phrases contained in the target text (T) in the updated phrase database;
the phrase information amount calculation module is used for calculating the information amount of each phrase according to the probability of the occurrence of the phrase in the target text (T), and specifically comprises the following steps: h (phrase) ═ log2p (phrase), wherein p (phrase) represents the probability of appearance of the phrase, and h (phrase) represents the information content of the phrase;
the system comprises a phrase propagation information determining module, a phrase propagation information determining module and a phrase propagation information determining module, wherein the phrase propagation information determining module is used for determining the propagation potential of a phrase in a community according to the coverage of the phrase in the community, and the coverage is inversely proportional to the propagation potential;
and the scoring module is used for obtaining the corrected information quantity of the phrases according to the information quantity and the propagation potential of the phrases, and obtaining the original information quantity score of the target text according to the corrected information quantities of all the phrases contained in the target text.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the method of any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.
CN201910763287.6A 2019-08-19 2019-08-19 Evaluation method and device for text value in community Active CN112417088B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910763287.6A CN112417088B (en) 2019-08-19 2019-08-19 Evaluation method and device for text value in community

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910763287.6A CN112417088B (en) 2019-08-19 2019-08-19 Evaluation method and device for text value in community

Publications (2)

Publication Number Publication Date
CN112417088A true CN112417088A (en) 2021-02-26
CN112417088B CN112417088B (en) 2022-07-05

Family

ID=74778956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910763287.6A Active CN112417088B (en) 2019-08-19 2019-08-19 Evaluation method and device for text value in community

Country Status (1)

Country Link
CN (1) CN112417088B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822521A (en) * 2021-06-15 2021-12-21 腾讯云计算(北京)有限责任公司 Method and device for detecting quality of question library questions and storage medium
CN116681056A (en) * 2023-05-24 2023-09-01 人民网股份有限公司 Text value calculation method and device based on value scale

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007018234A (en) * 2005-07-07 2007-01-25 National Institute Of Information & Communication Technology Automatic feeling-expression word and phrase dictionary generating method and device, and automatic feeling-level evaluation value giving method and device
CN104463603A (en) * 2014-12-05 2015-03-25 中国联合网络通信集团有限公司 Credit assessment method and system
CN104615772A (en) * 2015-02-16 2015-05-13 重庆大学 Text evaluation data specialization level analyzing method for electronic commerce

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007018234A (en) * 2005-07-07 2007-01-25 National Institute Of Information & Communication Technology Automatic feeling-expression word and phrase dictionary generating method and device, and automatic feeling-level evaluation value giving method and device
CN104463603A (en) * 2014-12-05 2015-03-25 中国联合网络通信集团有限公司 Credit assessment method and system
CN104615772A (en) * 2015-02-16 2015-05-13 重庆大学 Text evaluation data specialization level analyzing method for electronic commerce

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822521A (en) * 2021-06-15 2021-12-21 腾讯云计算(北京)有限责任公司 Method and device for detecting quality of question library questions and storage medium
CN113822521B (en) * 2021-06-15 2024-05-24 腾讯云计算(北京)有限责任公司 Method, device and storage medium for detecting quality of question library questions
CN116681056A (en) * 2023-05-24 2023-09-01 人民网股份有限公司 Text value calculation method and device based on value scale
CN116681056B (en) * 2023-05-24 2024-01-26 人民网股份有限公司 Text value calculation method and device based on value scale

Also Published As

Publication number Publication date
CN112417088B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN109657054B (en) Abstract generation method, device, server and storage medium
CN111639489A (en) Chinese text error correction system, method, device and computer readable storage medium
CN111966917A (en) Event detection and summarization method based on pre-training language model
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
US8150822B2 (en) On-line iterative multistage search engine with text categorization and supervised learning
CN109815336B (en) Text aggregation method and system
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
Pedler Computer correction of real-word spelling errors in dyslexic text
CN106202153A (en) The spelling error correction method of a kind of ES search engine and system
CN103678271B (en) A kind of text correction method and subscriber equipment
CN112733533A (en) Multi-mode named entity recognition method based on BERT model and text-image relation propagation
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN111737968A (en) Method and terminal for automatically correcting and scoring composition
CN111767393A (en) Text core content extraction method and device
CN112417088B (en) Evaluation method and device for text value in community
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN111125457A (en) Deep cross-modal Hash retrieval method and device
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN102929864B (en) A kind of tone-character conversion method and device
CN111241824A (en) Method for identifying Chinese metaphor information
CN107832297A (en) A kind of field sentiment dictionary construction method of Feature Oriented word granularity
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN112489655A (en) Method, system and storage medium for correcting error of speech recognition text in specific field
CN111611791B (en) Text processing method and related device
CN113032550B (en) Viewpoint abstract evaluation system based on pre-training language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant