CN108073571B

CN108073571B - Multi-language text quality evaluation method and system and intelligent text processing system

Info

Publication number: CN108073571B
Application number: CN201810028932.5A
Authority: CN
Inventors: 宋俊平; 程国艮
Original assignee: Global Tone Communication Technology Co ltd
Current assignee: Global Tone Communication Technology Co ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2021-08-13
Anticipated expiration: 2038-01-12
Also published as: CN108073571A

Abstract

The invention belongs to the technical field of intelligent text data, and discloses a multilingual text quality evaluation method and system and an intelligent text processing system. The invention focuses on text quality evaluation, and sequences the texts in grammar and semantic expression, arranges the high-quality texts in the front, arranges the low-quality texts with messy codes and the like in the back, and performs manual marking evaluation. The method has the advantages that the accuracy rate is over 95 percent, the improvement is about 8 percent compared with the traditional method, and on the other hand, the problem of multi-language balance evaluation which cannot be solved by the prior art is creatively solved.

Description

Multi-language text quality evaluation method and system and intelligent text processing system

Technical Field

The invention belongs to the technical field of intelligent text data, and particularly relates to a multilingual text quality evaluation method and system and an intelligent text processing system.

Background

The prior art commonly used in the industry is now such: with the continuous deepening of the globalization process and the rapid development of the internet, the text data shows explosive growth, but the data sources are different and influence the utilization efficiency of the information. Therefore, how to evaluate the quality of the data acquired in real time and recommend high-quality text information for the user is an important basic problem of the research in the text intelligent field under the background of big data. At present, the existing technology in the field of text quality assessment is often used as a functional item for data cleaning. From the text data, the cleaned objects include scrambled text, mixed JS code text, inconsistent title content, water-filled text (e.g., repeating a sentence or randomly entering meaningless words and sentences). Existing methods of filtering such dirty data are generally classified into two types: a rule method and a word frequency statistical method; the basic idea of the rule-based method is to enumerate various filtering rules for different data formats, for example, judging whether the messy codes adopt coding rules or dictionary mode and judging JS codesAdopting JS grammar keyword dictionary and the like; the basic idea of the word frequency statistical method is to count the word collocation frequency on a data set with higher text quality, and regarding low-frequency collocation as dirty data to be filtered.

In summary, the prior art has the problems that：

The existing rule and frequency statistics based method causes a great deal of information loss; it is difficult to cover the ever-flooding sources and formats of data.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a multilingual text quality evaluation method and system and an intelligent text processing system.

The invention is realized in such a way, the multilingual text quality evaluation method adopts a bigram model to divide a sentence or a chapter into a continuous word pair set, calculates the cosine similarity between vectors of two adjacent words in the sentence pair by pair to obtain the semantic quality score between the two words, and obtains the quality scores of a sentence level and a chapter level through an average sum algorithm.

Further, the quality score is calculated by:

the bigram model is that the current character is only related to the previous character, and is expressed by the formula:

p(s)＝p(w₁)p(w₂|w₁)p(w₃|w₂)K p(w_n|w_n-1)；

the condition is obtained by counting word frequency;

using semantic similarity between word vectors of characters instead of word frequency, the new formula for calculating conditional probability becomes:

combining the two formulas to obtain the mass fraction of a sentence; the quality score of the input document can be obtained by averaging all sentences.

Further, before the sentence or chapter is split into a continuous word pair set by using the bigram model, the following steps are required: performing distributed representation of characters on the text corpus at a character level through a neural network language model; capturing context information of the words and the words in sentences by utilizing a sliding window, modeling semantic collocation of the words, and mapping each word into an N-dimensional floating point vector.

Further, after the sentence or chapter is divided into a continuous word pair set by using the bigram model, the following steps are required:

(1) measuring the matching degree between the title and the text content by using the maximum repeat substring and the semantic similarity, and calculating the maximum repeat substring of the title and the text content by using a Knudt-Morris-pratt algorithm to represent the reproduction degree of the language expression of the title in the text content; the semantic similarity obtains vector representation of the title and the text content by using a weighted average method of word vectors, and calculates cosine similarity between the title and the text content vector to represent the semantic similarity of the title and the text;

(2) the average quality score of each language is calculated by large-scale, multi-language and high-quality text corpora respectively, and the score of each language is balanced by setting a reference value by user self-definition.

Further, the method for determining the matching degree between the title and the text content is to perform distributed representation on the title and the text, and measure the matching degree between the title and the text by calculating the semantic similarity between the title and the text; semantic vectors for the headlines and the body are obtained by calculating a weighted average sum of word vectors for the characters.

Further comprising:

firstly, the characters in the text are subjected to importance ranking by using a textrank algorithm, and a character calculation formula is as follows:

wherein d is damping coefficient with value of 0.85 In (W)_i) To point to the current wordCharacter set of symbols, Out (W)_j) Set of characters, ω, pointed to for the current character_jiIs a co-occurrence weight of two characters; obtaining a text semantic vector by means of weighted average sum, and expressing the text semantic vector as follows by using a formula:

and finally, calculating the cosine similarity of the title and the text to obtain the matching degree.

Another object of the present invention is to provide an intelligent text processing system for implementing the multilingual text quality assessment method.

Another object of the present invention is to provide a multilingual text quality evaluation system of the multilingual text quality evaluation method, the multilingual text quality evaluation system including:

the distributed representation acquisition module is used for carrying out distributed representation on the word level through the neural network language model;

the splitting module is used for splitting the sentence or the chapter into a continuous word pair set, calculating the cosine similarity between vectors of two adjacent words in the sentence or the chapter pair by pair to obtain the semantic quality score between the two words, and obtaining the quality scores of a sentence level and a chapter level through an average sum algorithm;

the matching module is used for measuring the matching degree between the title and the text content by utilizing the maximum repeated substring and the semantic similarity;

and the average quality score calculating module is used for calculating the average quality scores of all languages respectively, and setting a reference value by user self-definition to balance the scores of all languages.

Another object of the present invention is to provide an intelligent text processing system using the multilingual text quality-assessment system.

In summary, the advantages and positive effects of the invention are: the invention adopts text quality assessment to monitor the data in real time on the basis of data cleaning. The method can identify the messy codes and the texts with the JS scripts, and can also identify the irrigation texts and calculate the matching degree of the main body of the title; the distribution balance of the multi-language retrieval results obtained in the retrieval process can count the score distribution histogram of each language and perform adjustment and balance.

The intermediate result model trained by the invention is independent in language, and is embedded with a naive Bayes-based language recognition algorithm, so that the intermediate result model can be easily added without changing the original model, and the flexibility of the method is greatly improved.

The invention focuses on text quality evaluation, and sequences the texts in grammar and semantic expression, arranges the high-quality texts in the front, arranges the low-quality texts with messy codes and the like in the back, and performs manual marking evaluation. The method has the advantages that the accuracy rate is over 95 percent, the improvement is about 8 percent compared with the traditional method, and on the other hand, the problem of multi-language balance evaluation which cannot be solved by the prior art is creatively solved.

Drawings

Fig. 1 is a flowchart of a multilingual text quality assessment method according to an embodiment of the present invention.

FIG. 2 is a schematic structural diagram of a multilingual text quality assessment system according to an embodiment of the present invention;

in the figure: 1. a distributed representation acquisition module; 2. splitting the module; 3. a matching module; 4. and an average quality score calculation module.

FIG. 3 is a diagram illustrating statistical distributions in corpus of languages according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention focuses on text quality evaluation, namely, the texts are sequenced on grammatical and semantic expressions, so that the texts with good quality are arranged in front of the texts with messy codes and the like and the texts with poor quality are arranged behind the texts, and the problems of the original algorithm are obviously solved.

As shown in fig. 1, the method for evaluating the quality of a multilingual text according to the embodiment of the present invention includes the following steps:

s101: performing distributed representation of characters on the level of the characters (the words in the alphabetic language) by the text corpus through a neural network language model; capturing context information of the words and the words in sentences by utilizing a sliding window, modeling semantic collocation of the words, and mapping each word into an N-dimensional floating point vector so as to facilitate mathematical operation among the words;

s102: based on a binary grammar model, splitting a sentence or a chapter into a continuous word pair set, calculating cosine similarity between vectors of two adjacent words in the sentence or the chapter pair by pair to obtain semantic quality (grammar and semantic hierarchy) scores between the two words, and obtaining quality scores of a sentence level and a chapter level through an average sum algorithm;

s103: measuring the matching degree between the title and the text content by using the maximum repetition substring and the semantic similarity, and calculating the maximum repetition substring of the title and the text content by using a KMP (Kent-Morris-Praite) algorithm so as to represent the reproduction degree of the linguistic expression of the title in the text content; the semantic similarity obtains vector representation of the title and the text content by using a weighted average method of word vectors, and calculates cosine similarity between the title and the text content vector to represent the semantic similarity of the title and the text;

s104: the average quality score of each language is calculated by large-scale, multi-language and high-quality text corpora respectively, and the score of each language is balanced by setting a reference value by user self-definition, namely the text with good quality is close to the reference value.

The multilingual text quality evaluation method based on the depth characterization provided by the embodiment of the invention is based on a tf-idf algorithm of the depth characterization and a vector calculation method of sentence level and chapter level, and the most basic algorithm of the technical dependence is word vector training. The colloquial description of word vectors is a Distributed Representation (Distributed Representation) method of words, that is, abstract words in natural language are converted into easily-calculated N-dimensional vectors, and deep semantic association contained between words can also be obtained by calculating similarity between word vectors. The existing word vector training method mainly comprises a word2vec model of Google and a global vector model (GloVe) of Stanford, and the word2vec model is adopted in the invention. Because the text quality evaluation needs stronger real-time performance, the invention does not need to carry out preprocessing of algorithms such as word segmentation and the like, and directly trains word vectors on the level of characters (a word in a letter system).

The N-gram model is a probabilistic grammar model based on markov model, which determines the grammatical rationality of a sentence by the probability of the simultaneous occurrence of N consecutive characters in a natural language, most commonly a bigram model, whose basic idea is that the current character is only related to the previous character, and is formulated as:

p(s)＝p(w₁)p(w₂|w₁)p(w₃|w₂)K p(w_n|w_n-1)；

the condition in the above formula can be obtained by counting word frequency, but for text quality evaluation, it is not that the higher the statistical frequency is, the better the quality is, but on the contrary, the higher the statistical frequency is, the smaller the amount of information contained therein is.

The invention uses tfidf algorithm to replace single statistical word frequency, in addition, in order to better represent semantic relation between characters, the invention uses semantic similarity between word vectors of characters to replace word frequency, thus a new formula for calculating conditional probability becomes:

combining the above two formulas, the quality score of a sentence can be obtained, and then the quality score of the input document can be obtained by averaging all sentences.

The matching degree of the title and the text is also an important factor influencing the text quality, and the solution of the invention is to respectively perform distributed representation on the title and the text and measure the matching degree between the title and the text by calculating the semantic similarity between the title and the text. The semantic vectors of the title and the text can be obtained by calculating the weighted average sum of word vectors of characters, and the method comprises the following specific steps: firstly, the characters in the text are subjected to importance ranking by using a textrank algorithm, and a character calculation formula is as follows:

wherein d is damping coefficient (generally 0.85), nIW: (_i) To point to the character set of the current character, Out (W)_j) Set of characters, ω, pointed to for the current character_jiIs the co-occurrence weight of two characters. Then, a text semantic vector is obtained by means of weighted average sum, and is expressed by a formula as follows:

finally, the matching degree between the title and the text is obtained by calculating the cosine similarity of the title and the text.

As shown in fig. 2, the multilingual text quality evaluation system provided in the embodiment of the present invention includes:

the distributed representation acquisition module 1 is used for performing distributed representation of the words on the word (word in the alphabetical language) level through the neural network language model.

The splitting module 2 is configured to split the sentence or the chapter into a continuous word pair set, calculate cosine similarity between vectors of two adjacent words in the sentence or the chapter pair by pair, obtain a semantic quality score between the two words, and obtain a sentence-level quality score and a chapter-level quality score through an average sum algorithm.

And the matching module 3 is used for measuring the matching degree between the title and the text content by utilizing the maximum repeated substrings and the semantic similarity.

And the average quality score calculating module 4 is used for calculating the average quality scores of all languages respectively, and setting a reference value by user self-definition to balance the scores of all languages.

The model training of the invention is language independent, and the calculated quality score is language independent, namely the condition of unbalanced quality score distribution in different languages may occur. In order to observe the imbalance condition, the invention counts the distribution condition in the training corpus of each language, as shown in fig. 2: as can be seen in fig. 2, each language distributes a severe imbalance in the quality score. In order to balance the distribution of the scores of the languages, the invention takes the average score of each language as a reference value based on the statistical result, and weights the quality score of each language, thereby minimizing the imbalance.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A multilingual text quality assessment method is characterized in that a sentence or chapter is divided into a continuous word pair set by adopting a bigram model, cosine similarity between vectors of two adjacent words in the sentence is calculated pair by pair to obtain semantic quality scores between the two words, and the quality scores of the sentence level and the chapter level are obtained through an average sum algorithm;

the quality score is calculated by the following method:

p(s)＝p(w₁)p(w₂|w₁)p(w₃|w₂)K p(w_n|w_n-1)；

the condition is obtained by counting word frequency;

2. The multilingual text quality-assessment method of claim 1, wherein the splitting of a sentence or chapter into a set of consecutive word pairs using a bigram model requires: performing distributed representation of characters on the text corpus at a character level through a neural network language model; capturing context information of the words and the words in sentences by utilizing a sliding window, modeling semantic collocation of the words, and mapping each word into an N-dimensional floating point vector.

3. The multilingual text quality-assessment method of claim 1, wherein said splitting a sentence or chapter into a set of consecutive word pairs using a bigram model entails:

4. The multilingual text quality-assessment method of claim 3, wherein said degree of matching between the headline and body contents is determined by distributively characterizing the headline and body, and measuring the degree of matching between the headline and body by calculating semantic similarity therebetween; semantic vectors for the headlines and the body are obtained by calculating a weighted average sum of word vectors for the characters.

5. The multilingual text quality-assessment method of claim 4, further comprising:

wherein d is damping coefficient with value of 0.85 In (W)_i) To point to the character set of the current character, Out (W)_j) Set of characters, ω, pointed to for the current character_jiIs a co-occurrence weight of two characters; obtaining a text semantic vector by means of weighted average sum, and expressing the text semantic vector as follows by using a formula:

6. An intelligent text processing system for implementing the multilingual text-quality-assessment method according to any one of claims 1 to 5.

7. A multilingual text quality-evaluating system of the multilingual text quality-evaluating method of claim 1, wherein the multilingual text quality-evaluating system comprises:

the splitting module is used for splitting the sentence or the chapter into a continuous word pair set, calculating the cosine similarity between vectors of two adjacent words in the sentence or the chapter pair by pair to obtain the semantic quality score between the two words, and obtaining the quality scores of a sentence level and a chapter level through an average sum algorithm; the quality score is calculated by the following method:

p(s)＝p(w₁)p(w₂|w₁)p(w₃|w₂)K p(w_n|w_n-1)；

the condition is obtained by counting word frequency;

combining the two formulas to obtain the mass fraction of a sentence; the quality score of the input document can be obtained by averaging all sentences;

8. An intelligent text processing system to which the multilingual text-quality-evaluating system of claim 7 is applied.