CN112989816B - Text content quality evaluation method and system - Google Patents

Text content quality evaluation method and system Download PDF

Info

Publication number
CN112989816B
CN112989816B CN202110422185.5A CN202110422185A CN112989816B CN 112989816 B CN112989816 B CN 112989816B CN 202110422185 A CN202110422185 A CN 202110422185A CN 112989816 B CN112989816 B CN 112989816B
Authority
CN
China
Prior art keywords
text
speech
evaluated
byte
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110422185.5A
Other languages
Chinese (zh)
Other versions
CN112989816A (en
Inventor
张力文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Glabal Tone Communication Technology Co ltd
Original Assignee
Glabal Tone Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Glabal Tone Communication Technology Co ltd filed Critical Glabal Tone Communication Technology Co ltd
Priority to CN202110422185.5A priority Critical patent/CN112989816B/en
Publication of CN112989816A publication Critical patent/CN112989816A/en
Application granted granted Critical
Publication of CN112989816B publication Critical patent/CN112989816B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Computation (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text content quality evaluation method and system, which can be applied to an intelligent data mining system as a part of preprocessing, remove non-value information, furthest reserve valuable information in a text body, obtain valuable texts and serve downstream tasks, and can also effectively save system storage resources and improve the reading quality of users.

Description

Text content quality evaluation method and system
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a text content quality evaluation method and system.
Background
With the continuous and deep development of the internet industry application, all aspects of people's production and life are affected deeply, and various text data generated along with the affected aspects are increased explosively, and the quantity of the texts becomes extremely large and contains a lot of important information. However, the quality of these network data is variable, the storage of large amounts of useless data consumes valuable system resources, and the reading of invalid information consumes a great deal of effort and time by the user. Therefore, how to accurately and quickly extract valuable information from the text data is a problem to be solved urgently.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention aims to provide a text content quality evaluation method and a text content quality evaluation system, which can accurately and quickly evaluate the quality of a text.
In order to achieve the above object, the present invention provides a text content quality evaluation method, which includes: acquiring a sample data set, and selecting first type text information and second type text information from the sample data set as a training set, wherein the first type text information represents that the text information is information irrelevant to a theme, and the second type text information represents that the text information is information relevant to the theme; segmenting the training set into a plurality of natural sentences, respectively labeling the word groups in each natural sentence, and forming part-of-speech texts by the part-of-speech labeled in the training set according to the sequence of the word groups in the natural sentences, wherein each natural sentence corresponds to one part-of-speech text; calculating the probability of each byte segment of each part of speech text corresponding to the training set appearing in all byte segments corresponding to the training set based on an N-Gram algorithm; acquiring S byte segments at the beginning and R byte segments at the end of each part-of-speech text corresponding to the first type text information, counting the occurrence times of the same byte segments, and taking the first Q byte segments as feature items according to the sequence of the occurrence times of the same byte segments from high to low, wherein S, R, Q are integers; converting each part-of-speech text corresponding to the training set into a feature vector, inputting all the converted feature vectors into a two-classifier for training to obtain a trained two-classifier, wherein part of elements in each feature vector are related to the probability of each byte segment of each part-of-speech text corresponding to the training set appearing in all the byte segments corresponding to the training set, and the other part of elements are related to the feature items; preprocessing a text to be evaluated, and converting each part-of-speech text corresponding to the text to be evaluated into a feature vector, wherein part of elements in each feature vector corresponding to the text to be evaluated are related to the occurrence probability of each byte fragment of each part-of-speech text corresponding to the text to be evaluated in all byte fragments corresponding to the training set, and the other part of elements are related to the feature items; inputting each feature vector corresponding to the text to be evaluated into the trained classifier to obtain a plurality of output results, judging the proportion of first type text information in the text to be evaluated according to the output results, and if the proportion of the first type text information is larger than a first preset threshold value, judging the text to be evaluated as a low-quality text.
In an embodiment of the present invention, selecting the first type text information and the second type text information from the sample data set as a training set includes: selecting first type text information from the sample data set, marking the first type text information as a negative example, randomly selecting second type text information from the residual data in the sample data set, marking the second type text information as a positive example, and taking the first type text information and the second type text information as a training set.
In an embodiment of the present invention, converting each part-of-speech text corresponding to the training set into a feature vector includes: obtaining probability values of all byte segments corresponding to the training set of each byte segment of each part-of-speech text corresponding to the training set, selecting the maximum value and/or the minimum value of all probability values corresponding to all byte segments of each part-of-speech text, and calculating the mean value and/or the variance of all probability values corresponding to all byte segments of each part-of-speech text; judging whether a certain characteristic item exists in the first S byte segments and the last R byte segments in each part of speech text corresponding to the training set, if so, assigning a characteristic value corresponding to the certain characteristic item to be 1, and if not, assigning a characteristic value corresponding to the certain characteristic item to be 0; and respectively taking the maximum value and/or the minimum value, the mean value and/or the variance and each characteristic value as an element in a characteristic vector so as to convert each part of speech text into the characteristic vector.
In an embodiment of the present invention, preprocessing a text to be evaluated includes: performing part-of-speech tagging on the phrases in each natural sentence in the text to be evaluated; and combining the parts of speech marked in the text to be evaluated into a part of speech text according to the sequence of the appearance of each phrase in the natural sentence.
In an embodiment of the present invention, the preprocessing the text to be evaluated further includes: before the part-of-speech tagging is carried out on the word group in each natural sentence in the text to be evaluated, extracting plain text information from the open-source text to be evaluated, segmenting the plain text information into a plurality of natural sentences, and carrying out structured storage on the plurality of natural sentences.
In an embodiment of the present invention, converting each part-of-speech text corresponding to a text to be evaluated into a feature vector includes: obtaining a probability value of each byte fragment of each part-of-speech text corresponding to the text to be evaluated in the training set based on an N-Gram algorithm, selecting a maximum value and/or a minimum value of all probability values corresponding to each byte fragment of each part-of-speech text, and calculating a mean value and/or a variance of all probability values corresponding to each byte fragment of each part-of-speech text; judging whether a certain characteristic item exists in the first S byte segments and the last R byte segments in each part-of-speech text corresponding to the text to be evaluated, if so, assigning a characteristic value corresponding to the certain characteristic item to be 1, and if not, assigning a characteristic value corresponding to the certain characteristic item to be 0; and respectively taking the maximum value and/or the minimum value, the mean value and/or the variance and each characteristic value as an element in a characteristic vector so as to convert each part of speech text into the characteristic vector.
In an embodiment of the present invention, determining an occupation ratio of first type text information in the text to be evaluated according to the output result, and if the occupation ratio of the first type text information is greater than a first preset threshold, determining that the text to be evaluated is a low-quality text includes: comparing each output result with a second preset threshold, and if a certain output result is smaller than the second preset threshold, judging that the natural sentence corresponding to the certain output result belongs to first type text information; determining the proportion of the number of all natural sentences belonging to the first type text information in the text to be evaluated in all natural sentences in the text to be evaluated, and if the proportion is greater than the first preset threshold value, judging the text to be evaluated as a low-quality text.
Based on the same inventive concept, the invention also provides a text content quality evaluation system, which comprises: the training set acquisition module is used for acquiring a sample data set and selecting first type text information and second type text information from the sample data set as a training set, wherein the first type text information represents that the text information is information irrelevant to a theme, and the second type text information represents that the text information is information relevant to the theme; the part-of-speech text generation module is coupled with the training set acquisition module and is used for segmenting the training set into a plurality of natural sentences, respectively labeling word groups in each natural sentence, and forming part-of-speech texts by the part-of-speech labeled in the training set according to the sequence of the appearance of each word group in the natural sentences, wherein each natural sentence corresponds to one part-of-speech text; the probability determination module is coupled with the part-of-speech text generation module and used for solving the probability of each byte fragment of each part-of-speech text corresponding to the training set appearing in all byte fragments corresponding to the training set based on an N-Gram algorithm; the characteristic item determining module is coupled with the part-of-speech text generating module and is configured to obtain S byte segments at the beginning and R byte segments at the end of each part-of-speech text corresponding to the first type of text information, count the occurrence times of the same byte segments in the S byte segments at the beginning and R byte segments at the end of each part-of-speech text corresponding to the first type of text information, and take the first Q byte segments as characteristic items according to the order from high to low of the occurrence times of the same byte segments, where S, R, Q are integers; a first feature vector conversion module, coupled to the probability determination module and the feature item determination module, configured to convert each part-of-speech text corresponding to the training set into a feature vector, where a part of elements in each feature vector is related to a probability that each byte fragment of each part-of-speech text corresponding to the training set appears in all byte fragments corresponding to the training set, and another part of elements is related to the feature item; the second classifier training module is coupled with the first feature vector conversion module and used for inputting all feature vectors converted by all part-of-speech texts corresponding to the training set into a second classifier for training to obtain a trained second classifier; the second feature vector conversion module is used for preprocessing a text to be evaluated and converting each part-of-speech text corresponding to the text to be evaluated into a feature vector, wherein part of elements in each feature vector corresponding to the text to be evaluated are related to the occurrence probability of each byte fragment of each part-of-speech text corresponding to the text to be evaluated in all byte fragments corresponding to the training set, and the other part of elements are related to the feature items; the evaluation module is coupled with the second feature vector conversion module and the two-classifier training module and is used for inputting each feature vector corresponding to the text to be evaluated into the trained classifier to obtain a plurality of output results; and judging the proportion of the first type text information in the text to be evaluated according to the output result, and if the proportion of the first type text information is larger than a first preset threshold value, judging the text to be evaluated as a low-quality text.
Based on the same inventive concept, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the text content quality assessment method according to any one of the above embodiments.
Based on the same inventive concept, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the text content quality assessment method according to any one of the above embodiments.
Compared with the prior art, according to the text quality evaluation method and the text quality evaluation system, invalid information which influences reading, such as advertisements, subscription information, rich text information and the like, mixed in the text can be identified, and the integral text content is scored.
Drawings
FIG. 1 is a composition of a text quality assessment method according to an embodiment of the present invention;
fig. 2 is a composition of a text quality evaluation system according to an embodiment of the present invention.
Detailed Description
The following detailed description of the present invention is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.
Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.
In order to accurately and quickly extract valuable information from text data, an inventor conducts thinking and finds that text data in a site which is open at present often contains some advertisements, subscription information or some rich text information which is often unrelated to text subjects, the quality of the whole text is reduced, reading of a user is influenced, the cost of system storage is increased, invalid text information and text content cannot be distinguished only through comparison of character layers such as rules and the like, the invalid text information and the text information can be effectively separated through deep semantic information, the invalid text information has certain rules on writing formats and usage of some words, and the invention provides a thought for constructing feature engineering and establishing a classification model based on N-grams of part of speech tags to effectively identify the invalid information in the text in combination with the thinking.
FIG. 1 is a method for evaluating the quality of text content according to an embodiment of the present invention, the method including steps S1-S7.
In step S1, a sample data set is obtained, and a first type of text information and a second type of text information are selected from the sample data set as a training set, where the first type of text information indicates that the text information is information unrelated to the theme, and the second type of text information indicates that the text information is information related to the theme.
Specifically, a first type of text information, such as advertisements and useless information, may be manually selected from the sample data set, the first type of text information is labeled as a negative example, a second type of text information is randomly selected from remaining data in the sample data set, the second type of text information is labeled as a positive example, and the first type of text information and the second type of text information are used as a training set. Some first type of text information is listed below: 1. "statement: this textual view represents only the author himself "; 2. "go back to home page, see more". 3. "edit/Zhang Sanzhuang/Liquan" 4, "reporter/Wangwu" 5, "manuscript originated from a reporter group Commission".
In step S2, the training set is segmented into a plurality of natural sentences, each word group in each natural sentence is labeled with a part of speech, and the parts of speech labeled in the training set are combined into part of speech texts according to the sequence of appearance of each word group in the natural sentence, wherein each natural sentence corresponds to one part of speech text. Optionally, the part of speech tagging may be performed by using a big-north word segmenter. Parts of speech include verbs, nouns, prepositions, adjectives, and the like. For example, a natural sentence is "manuscript is from the group committee of a reporter", a part-of-speech text corresponding to the natural sentence is "noun verb preposition noun", and in other embodiments, the part-of-speech text may be expressed by other means such as english abbreviation.
In step S3, the probability of each byte segment of each part of speech text corresponding to the training set appearing in all byte segments corresponding to the training set is obtained based on an N-Gram algorithm. Wherein, the N-Gram can be set to be 3-Gram,2-Gram and the like according to actual conditions. For example, if set to 3-Gram, the respective byte segments in the part-of-speech text "noun verb preposition noun" are "noun verb preposition", "verb preposition noun", "preposition noun", respectively.
In step S4, acquiring S byte segments at the beginning and R byte segments at the end of each part of speech text corresponding to the first type of text information, where S and R are integers; and counting the occurrence times of the same byte segment in the S-byte segment at the beginning and the R-byte segment at the end of each part of speech text corresponding to the first type of text information, and taking the first Q byte segments as characteristic items according to the sequence of the occurrence times of the same byte segments from high to low, wherein Q is an integer. Wherein, the value of R, Q, S can be set according to actual conditions, the larger the value is, the more accurate the final evaluation result is, the smaller the value is, the faster the evaluation process is.
In step S5, each part-of-speech text corresponding to the training set is converted into a feature vector, where a part of elements in each feature vector is related to the probability of each byte segment of each part-of-speech text corresponding to the training set appearing in all byte segments corresponding to the training set, and another part of elements is related to the feature items; and inputting all the feature vectors of all the part-of-speech text conversions corresponding to the training set into a two-classifier for training to obtain the trained two-classifier. Wherein, optionally, the two classifiers are support vector machines.
In step S6, the text to be evaluated is preprocessed, and each part-of-speech text corresponding to the text to be evaluated is converted into a feature vector, where a part of elements in each feature vector corresponding to the text to be evaluated is related to the probability of each byte segment of each part-of-speech text corresponding to the text to be evaluated appearing in all byte segments corresponding to the training set, and another part of elements is related to the feature item.
Optionally, the preprocessing the text to be evaluated includes: extracting plain text information from an open-source text to be evaluated, segmenting the plain text information into a plurality of natural sentences, and performing structured storage on the natural sentences; performing part-of-speech tagging on the phrases in each natural sentence in the text to be evaluated; and combining the parts of speech marked in the text to be evaluated into a part of speech text according to the sequence of the appearance of each phrase in the natural sentence.
In step S7, each feature vector corresponding to the text to be evaluated is input into the trained classifier to obtain a plurality of output results, and the proportion of the first type text information in the text to be evaluated is determined according to the output results, and if the proportion of the first type text information is greater than a first preset threshold, the text to be evaluated is determined to be a low-quality text. Specifically, each output result is compared with a second preset threshold, and if a certain output result is smaller than the second preset threshold, it is determined that a natural sentence corresponding to the certain output result belongs to first type text information; determining the proportion of the number of all natural sentences belonging to the first type text information in the text to be evaluated in all natural sentences in the text to be evaluated, and if the proportion is greater than the first preset threshold value, judging the text to be evaluated as a low-quality text. It can be understood that each output result corresponds to one input feature vector, each feature vector corresponds to one part of speech text, and each part of speech text corresponds to one natural sentence.
Therefore, according to the text quality evaluation method, invalid information influencing reading, such as advertisements, subscription information, rich text information and the like, which are mixed in the text, can be identified and scored for the whole text content.
Preferably, in order to further improve the accuracy of the evaluation result, in an embodiment, the converting each part-of-speech text corresponding to the training set into a feature vector includes: obtaining probability values of all byte segments of each part-of-speech text corresponding to the training set in all byte segments corresponding to the training set, selecting the maximum value and/or the minimum value of all probability values corresponding to each part-of-speech text from the probability values, and calculating the mean value and/or the variance of all probability values corresponding to each part-of-speech text; judging whether a certain characteristic item exists in the first S byte segments and the last R byte segments in each part of speech text corresponding to the training set, if so, assigning a characteristic value corresponding to the certain characteristic item to be 1, and if not, assigning a characteristic value corresponding to the certain characteristic item to be 0; and respectively taking the maximum value and/or the minimum value, the mean value and/or the variance and each characteristic value as an element in a characteristic vector so as to convert each part of speech text into the characteristic vector.
In this embodiment, converting each part-of-speech text corresponding to the text to be evaluated into a feature vector includes: obtaining a probability value of each byte fragment of each part-of-speech text corresponding to the text to be evaluated in the training set based on an N-Gram algorithm, selecting a maximum value and/or a minimum value of all probability values corresponding to each part-of-speech text, and calculating a mean value and/or a variance of all probability values corresponding to each part-of-speech text; judging whether a certain characteristic item exists in the first S byte segments and the last R byte segments in each part-of-speech text corresponding to the text to be evaluated, if so, assigning a characteristic value corresponding to the certain characteristic item to be 1, and if not, assigning a characteristic value corresponding to the certain characteristic item to be 0; and respectively taking the maximum value and/or the minimum value, the mean value and/or the variance and each characteristic value as an element in a characteristic vector so as to convert each part of speech text into the characteristic vector.
Further, through test analysis, the feature vectors are set to be vectors with the following 12 dimensions, S and R are set to be 3, Q is set to be 4, features are extracted based on 3-Gram models and 2-Gram models respectively, a good evaluation effect can be achieved, and each feature vector comprises 12 elements. The elements in the feature vector of each part of speech text are respectively: the maximum value of all the probability values corresponding to the part of speech text obtained based on the 3-Gram algorithm, the minimum value of all the probability values corresponding to the part of speech text obtained based on the 3-Gram algorithm, the mean value of all the probability values corresponding to the part of speech text obtained based on the 3-Gram algorithm, the variance of all the probability values corresponding to the part of speech text obtained based on the 3-Gram algorithm, the maximum value of all the probability values corresponding to the part of speech text obtained based on the 2-Gram algorithm, the minimum value of all the probability values corresponding to the part of speech text obtained based on the 2-Gram algorithm, the mean value of all the probability values corresponding to the part of speech text obtained based on the 2-Gram algorithm, the variance of all the probability values corresponding to the part of speech text obtained based on the 2-Gram algorithm, The characteristic value corresponding to the first characteristic item, the characteristic value corresponding to the second characteristic item, the characteristic value corresponding to the third characteristic item and the characteristic value corresponding to the fourth characteristic item.
Based on the same inventive concept, an embodiment further provides a text content quality evaluation system, as shown in fig. 2, which includes: the system comprises a training set acquisition module 10, a part-of-speech text generation module 11, a probability determination module 12, a feature item determination module 13, a first feature vector conversion module 14, a classifier training module 15, a second feature vector conversion module 16 and an evaluation module 17.
The training set obtaining module 10 is configured to obtain a sample data set, and select first type text information and second type text information from the sample data set as a training set, where the first type text information indicates that the text information is information unrelated to a topic, and the second type text information indicates that the text information is information related to the topic.
The part-of-speech text generation module 11 is coupled to the training set acquisition module 10, and is configured to segment the training set into a plurality of natural sentences, perform part-of-speech tagging on each phrase in each natural sentence, and compose part-of-speech texts from the parts-of-speech tagged in the training set according to an order in which each phrase appears in a natural sentence, where each natural sentence corresponds to one part-of-speech text.
The probability determination module 12 is coupled to the part-of-speech text generation module 11, and configured to calculate, based on an N-Gram algorithm, a probability that each byte segment of each part-of-speech text corresponding to the training set appears in all byte segments corresponding to the training set.
The feature item determining module 13 is coupled to the part-of-speech text generating module 11, and configured to obtain S byte segments at the beginning and R byte segments at the end of each part-of-speech text corresponding to the first type of text information, where S and R are integers; and counting the occurrence times of the same byte segment in the S-byte segment at the beginning and the R-byte segment at the end of each part of speech text corresponding to the first type of text information, and taking the first Q byte segments as characteristic items according to the sequence of the occurrence times of the same byte segments from high to low, wherein Q is an integer.
The first feature vector conversion module 14 is coupled to both the probability determination module 12 and the feature item determination module 13, and configured to convert each part of speech text corresponding to the training set into a feature vector, where a part of elements in each feature vector is related to a probability that each byte fragment of each part of speech text corresponding to the training set appears in all byte fragments corresponding to the training set, and another part of elements is related to the feature item.
The two-classifier training module 15 is coupled to the first feature vector conversion module 14, and is configured to input all feature vectors converted from all part-of-speech texts corresponding to the training set into the two-classifier for training, so as to obtain a trained two-classifier.
The second feature vector conversion module 16 is configured to pre-process a text to be evaluated, and convert each part-of-speech text corresponding to the text to be evaluated into a feature vector, where a part of elements in each feature vector corresponding to the text to be evaluated is related to a probability that each byte segment of each part-of-speech text corresponding to the text to be evaluated appears in all byte segments corresponding to the training set, and another part of elements is related to the feature item.
The evaluation module 17 is coupled to the second feature vector conversion module 16 and the second classifier training module 15, and configured to input each feature vector corresponding to the text to be evaluated into the trained classifier to obtain a plurality of output results; and judging the proportion of the first type text information in the text to be evaluated according to the output result, and if the proportion of the first type text information is larger than a first preset threshold value, judging the text to be evaluated as a low-quality text.
Therefore, according to the text quality evaluation system of the embodiment, invalid information influencing reading, such as advertisements, subscription information, rich text information and the like, which are mixed in the text, can be identified and scored for the whole text content.
Preferably, in order to further improve the accuracy of the evaluation result, in an embodiment, the first feature vector conversion module 14 is configured to: obtaining probability values of all byte segments of each part-of-speech text corresponding to the training set in all byte segments corresponding to the training set, selecting the maximum value and/or the minimum value of all probability values corresponding to each part-of-speech text from the probability values, and calculating the mean value and/or the variance of all probability values corresponding to each part-of-speech text; judging whether a certain characteristic item exists in the first S byte segments and the last R byte segments in each part of speech text corresponding to the training set, if so, assigning a characteristic value corresponding to the certain characteristic item to be 1, and if not, assigning a characteristic value corresponding to the certain characteristic item to be 0; and respectively taking the maximum value and/or the minimum value, the mean value and/or the variance and each characteristic value as an element in a characteristic vector so as to convert each part of speech text into the characteristic vector.
In this embodiment, the second feature vector conversion module 16 is configured to: obtaining a probability value of each byte fragment of each part-of-speech text corresponding to the text to be evaluated in the training set based on an N-Gram algorithm, selecting a maximum value and/or a minimum value of all probability values corresponding to each part-of-speech text, and calculating a mean value and/or a variance of all probability values corresponding to each part-of-speech text; judging whether a certain characteristic item exists in the first S byte segments and the last R byte segments in each part-of-speech text corresponding to the text to be evaluated, if so, assigning a characteristic value corresponding to the certain characteristic item to be 1, and if not, assigning a characteristic value corresponding to the certain characteristic item to be 0; and respectively taking the maximum value and/or the minimum value, the mean value and/or the variance and each characteristic value as an element in a characteristic vector so as to convert each part of speech text into the characteristic vector.
Further, through test analysis, the feature vectors are set to be vectors with the following 12 dimensions, S and R are set to be 3, Q is set to be 4, features are extracted based on 3-Gram models and 2-Gram models respectively, a good evaluation effect can be achieved, and each feature vector comprises 12 elements. The elements in the feature vector of each part of speech text are respectively: the maximum value of all the probability values corresponding to the part of speech text obtained based on the 3-Gram algorithm, the minimum value of all the probability values corresponding to the part of speech text obtained based on the 3-Gram algorithm, the mean value of all the probability values corresponding to the part of speech text obtained based on the 3-Gram algorithm, the variance of all the probability values corresponding to the part of speech text obtained based on the 3-Gram algorithm, the maximum value of all the probability values corresponding to the part of speech text obtained based on the 2-Gram algorithm, the minimum value of all the probability values corresponding to the part of speech text obtained based on the 2-Gram algorithm, the mean value of all the probability values corresponding to the part of speech text obtained based on the 2-Gram algorithm, the variance of all the probability values obtained based on the part of speech text obtained based on the 2-Gram algorithm, the variance of the probability values obtained based on the part of speech text obtained based on the 3-Gram algorithm, the variance of the, The characteristic value corresponding to the first characteristic item, the characteristic value corresponding to the second characteristic item, the characteristic value corresponding to the third characteristic item and the characteristic value corresponding to the fourth characteristic item.
Based on the same inventive concept, an embodiment further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the text content quality assessment method according to any one of the above embodiments.
Based on the same inventive concept, an embodiment also provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the text content quality assessment method according to any one of the above embodiments.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims (7)

1. A text content quality evaluation method is characterized by comprising the following steps:
acquiring a sample data set, and selecting first type text information and second type text information from the sample data set as a training set, wherein the first type text information represents that the text information is information irrelevant to a theme, and the second type text information represents that the text information is information relevant to the theme;
segmenting the training set into a plurality of natural sentences, respectively labeling the word groups in each natural sentence, and forming part-of-speech texts by the part-of-speech labeled in the training set according to the sequence of the word groups in the natural sentences, wherein each natural sentence corresponds to one part-of-speech text;
calculating the probability of each byte segment of each part of speech text corresponding to the training set appearing in all byte segments corresponding to the training set based on an N-Gram algorithm;
acquiring S byte segments at the beginning and R byte segments at the end of each part-of-speech text corresponding to the first type text information, counting the occurrence times of the same byte segments, and taking the first Q byte segments as feature items according to the sequence of the occurrence times of the same byte segments from high to low, wherein S, R, Q are integers;
converting each part-of-speech text corresponding to the training set into a feature vector, inputting all the converted feature vectors into a two-classifier for training to obtain a trained two-classifier, wherein part of elements in each feature vector are related to the probability of each byte segment of each part-of-speech text corresponding to the training set appearing in all the byte segments corresponding to the training set, and the other part of elements are related to the feature items;
preprocessing a text to be evaluated, and converting each part-of-speech text corresponding to the text to be evaluated into a feature vector, wherein part of elements in each feature vector corresponding to the text to be evaluated are related to the occurrence probability of each byte fragment of each part-of-speech text corresponding to the text to be evaluated in all byte fragments corresponding to the training set, and the other part of elements are related to the feature items;
inputting each feature vector corresponding to the text to be evaluated into the trained classifier to obtain a plurality of output results, judging the proportion of first type text information in the text to be evaluated according to the output results, and if the proportion of the first type text information is larger than a first preset threshold value, judging the text to be evaluated as a low-quality text;
converting each part-of-speech text corresponding to the training set into a feature vector, including:
obtaining the probability value of each byte fragment of each part-of-speech text corresponding to the training set in all the byte fragments corresponding to the training set, selecting the maximum value and/or the minimum value of all the probability values corresponding to each byte fragment of each part-of-speech text, and calculating the mean value and/or the variance of all the probability values corresponding to each byte fragment of each part-of-speech text,
judging whether a certain characteristic item exists in the first S byte segments and the last R byte segments in each part of speech text corresponding to the training set, if so, assigning the characteristic value corresponding to the certain characteristic item to be 1, if not, assigning the characteristic value corresponding to the certain characteristic item to be 0,
respectively taking the maximum value and/or the minimum value, the mean value and/or the variance and each characteristic value as an element in a characteristic vector so as to convert each part of speech text into the characteristic vector;
converting each part-of-speech text corresponding to the text to be evaluated into a feature vector, wherein the converting comprises the following steps:
obtaining the probability value of each byte fragment of each part-of-speech text corresponding to the text to be evaluated in the training set based on an N-Gram algorithm, selecting the maximum value and/or the minimum value of all probability values corresponding to each byte fragment of each part-of-speech text, and calculating the mean value and/or the variance of all probability values corresponding to each byte fragment in each part-of-speech text,
judging whether a certain characteristic item exists in the first S byte segments and the last R byte segments in each part-of-speech text corresponding to the text to be evaluated, if so, assigning the characteristic value corresponding to the certain characteristic item to be 1, and if not, assigning the characteristic value corresponding to the certain characteristic item to be 0,
respectively taking the maximum value and/or the minimum value, the mean value and/or the variance and each characteristic value as an element in a characteristic vector so as to convert each part of speech text into the characteristic vector;
judging the proportion of first type text information in the text to be evaluated according to the output result, and if the proportion of the first type text information is greater than a first preset threshold, judging the text to be evaluated as a low-quality text, wherein the method comprises the following steps:
comparing each output result with a second preset threshold, if a certain output result is smaller than the second preset threshold, judging that the natural sentence corresponding to the certain output result belongs to the first type text information,
determining the proportion of the number of all natural sentences belonging to the first type text information in the text to be evaluated in all natural sentences in the text to be evaluated, and if the proportion is greater than the first preset threshold value, judging the text to be evaluated as a low-quality text.
2. The method of claim 1, wherein selecting a first type of textual information and a second type of textual information from the sample data set as a training set comprises:
selecting first type text information from the sample data set, marking the first type text information as a negative example, randomly selecting second type text information from the residual data in the sample data set, marking the second type text information as a positive example, and taking the first type text information and the second type text information as a training set.
3. The text content quality assessment method according to claim 1, wherein preprocessing the text to be assessed comprises:
performing part-of-speech tagging on the phrases in each natural sentence in the text to be evaluated;
and combining the parts of speech marked in the text to be evaluated into a part of speech text according to the sequence of the appearance of each phrase in the natural sentence.
4. The text content quality assessment method according to claim 3, wherein preprocessing the text to be assessed further comprises:
before the part-of-speech tagging is carried out on the word group in each natural sentence in the text to be evaluated, extracting plain text information from the open-source text to be evaluated, segmenting the plain text information into a plurality of natural sentences, and carrying out structured storage on the plurality of natural sentences.
5. A text content quality evaluation system, comprising:
the training set acquisition module is used for acquiring a sample data set and selecting first type text information and second type text information from the sample data set as a training set, wherein the first type text information represents that the text information is information irrelevant to a theme, and the second type text information represents that the text information is information relevant to the theme;
a part-of-speech text generation module, coupled to the training set acquisition module, configured to segment the training set into multiple natural sentences, perform part-of-speech tagging on each phrase in each natural sentence, and compose part-of-speech texts from the parts-of-speech tagged in the training set according to an order in which each phrase appears in a natural sentence, where each natural sentence corresponds to one part-of-speech text;
a probability determination module, coupled to the part-of-speech text generation module, configured to calculate, based on an N-Gram algorithm, a probability that each byte segment of each part-of-speech text corresponding to the training set appears in all byte segments corresponding to the training set;
a feature item determining module, coupled to the part-of-speech text generating module, configured to obtain S byte segments at the beginning and R byte segments at the end of each part-of-speech text corresponding to the first type of text information, count the number of times that the same byte segment occurs in the S byte segments at the beginning and R byte segments at the end of each part-of-speech text corresponding to the first type of text information, and take the first Q byte segments as feature items according to the order from high to low of the number of times that the same byte segment occurs, where S, R, Q is an integer;
a first feature vector conversion module, coupled to the probability determination module and the feature item determination module, configured to convert each part-of-speech text corresponding to the training set into a feature vector, where a part of elements in each feature vector is related to a probability that each byte fragment of each part-of-speech text corresponding to the training set appears in all byte fragments corresponding to the training set, that is, a maximum value and/or a minimum value of all probability values corresponding to each byte fragment of each part-of-speech text and a mean value and/or a variance of all probability values corresponding to each byte fragment of each part-of-speech text are related to a part of elements in each feature vector; the other part of elements are related to the feature items, namely whether a certain feature item exists in the first S byte segments and the last R byte segments in each part of speech text or not is related to the other part of elements, if the certain feature item exists, the feature value corresponding to the certain feature item is assigned to be 1, and if the certain feature item does not exist, the feature value corresponding to the certain feature item is assigned to be 0;
the second classifier training module is coupled with the first feature vector conversion module and used for inputting all feature vectors converted by all part-of-speech texts corresponding to the training set into a second classifier for training to obtain a trained second classifier;
the second feature vector conversion module is used for preprocessing a text to be evaluated and converting each part-of-speech text corresponding to the text to be evaluated into a feature vector, wherein a part of elements in each feature vector corresponding to the text to be evaluated is related to the occurrence probability of each byte fragment of each part-of-speech text corresponding to the text to be evaluated in all byte fragments corresponding to the training set, namely the part of elements in each feature vector corresponding to the text to be evaluated is related to the maximum value and/or the minimum value of all probability values corresponding to each byte fragment of each part-of-speech text obtained based on an N-Gram algorithm and the mean value and/or the variance of all probability values corresponding to each byte fragment in each part-of-speech text; another part of elements are related to the feature items, that is, whether a feature item exists in the other part of elements and the first S byte segments and the last R byte segments in each part of speech text corresponding to the text to be evaluated, if the feature item exists, the feature value corresponding to the feature item is assigned to 1, and if the feature item does not exist, the feature value corresponding to the feature item is assigned to 0;
the evaluation module is coupled with the second feature vector conversion module and the two-classifier training module and is used for inputting each feature vector corresponding to the text to be evaluated into the trained classifier to obtain a plurality of output results; comparing each output result with a second preset threshold, if a certain output result is smaller than the second preset threshold, judging that a natural sentence corresponding to the certain output result belongs to first type text information, judging whether the text to be evaluated is a low-quality text according to the proportion of the number of all natural sentences belonging to the first type text information in the text to be evaluated in all natural sentences in the text to be evaluated, and if the proportion is larger than the first preset threshold, judging the text to be evaluated as the low-quality text.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the text content quality assessment method according to any one of claims 1 to 4 when executing the program.
7. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the text content quality assessment method according to any one of claims 1 to 4.
CN202110422185.5A 2021-04-20 2021-04-20 Text content quality evaluation method and system Active CN112989816B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110422185.5A CN112989816B (en) 2021-04-20 2021-04-20 Text content quality evaluation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110422185.5A CN112989816B (en) 2021-04-20 2021-04-20 Text content quality evaluation method and system

Publications (2)

Publication Number Publication Date
CN112989816A CN112989816A (en) 2021-06-18
CN112989816B true CN112989816B (en) 2021-10-01

Family

ID=76341181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110422185.5A Active CN112989816B (en) 2021-04-20 2021-04-20 Text content quality evaluation method and system

Country Status (1)

Country Link
CN (1) CN112989816B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535952B (en) * 2021-07-13 2024-02-09 六棱镜(杭州)科技有限公司 Intelligent matching data processing method based on artificial intelligence
CN118095251B (en) * 2024-04-23 2024-06-18 北京国际大数据交易有限公司 Offline text data evaluation method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927297A (en) * 2014-04-13 2014-07-16 北京工业大学 Evidence theory based Chinese microblog credibility evaluation method
CN106372056A (en) * 2016-08-25 2017-02-01 久远谦长(北京)技术服务有限公司 Natural language-based topic and keyword extraction method and system
US9928234B2 (en) * 2016-04-12 2018-03-27 Abbyy Production Llc Natural language text classification based on semantic features
CN109783804A (en) * 2018-12-17 2019-05-21 北京百度网讯科技有限公司 Low-quality speech recognition methods, device, equipment and computer readable storage medium
CN112115703A (en) * 2020-09-03 2020-12-22 腾讯科技(深圳)有限公司 Article evaluation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927297A (en) * 2014-04-13 2014-07-16 北京工业大学 Evidence theory based Chinese microblog credibility evaluation method
US9928234B2 (en) * 2016-04-12 2018-03-27 Abbyy Production Llc Natural language text classification based on semantic features
CN106372056A (en) * 2016-08-25 2017-02-01 久远谦长(北京)技术服务有限公司 Natural language-based topic and keyword extraction method and system
CN109783804A (en) * 2018-12-17 2019-05-21 北京百度网讯科技有限公司 Low-quality speech recognition methods, device, equipment and computer readable storage medium
CN112115703A (en) * 2020-09-03 2020-12-22 腾讯科技(深圳)有限公司 Article evaluation method and device

Also Published As

Publication number Publication date
CN112989816A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
Friedrich et al. Situation entity types: automatic classification of clause-level aspect
Furlan et al. Semantic similarity of short texts in languages with a deficient natural language processing support
CN101520802A (en) Question-answer pair quality evaluation method and system
Poostchi et al. BiLSTM-CRF for Persian named-entity recognition ArmanPersoNERCorpus: the first entity-annotated Persian dataset
CN112989816B (en) Text content quality evaluation method and system
CN113377927A (en) Similar document detection method and device, electronic equipment and storage medium
Atmadja et al. Comparison on the rule based method and statistical based method on emotion classification for Indonesian Twitter text
CN113672731B (en) Emotion analysis method, device, equipment and storage medium based on field information
Ljubešić et al. Discriminating between closely related languages on twitter
Megyesi Shallow Parsing with PoS Taggers and Linguistic Features.
CN114528919A (en) Natural language processing method and device and computer equipment
JP4534666B2 (en) Text sentence search device and text sentence search program
CN107451116B (en) Statistical analysis method for mobile application endogenous big data
Aida et al. A comprehensive analysis of PMI-based models for measuring semantic differences
CN107341142B (en) Enterprise relation calculation method and system based on keyword extraction and analysis
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
Venčkauskas et al. Problems of authorship identification of the national language electronic discourse
CN109241272B (en) Chinese text abstract generation method, computer readable storage medium and computer equipment
CN112711666B (en) Futures label extraction method and device
JP2019200784A (en) Analysis method, analysis device and analysis program
Subha et al. Quality factor assessment and text summarization of unambiguous natural language requirements
JP5226198B2 (en) XML-based architecture for rule induction systems
CN112905796A (en) Text emotion classification method and system based on re-attention mechanism
CN111815426A (en) Data processing method and terminal related to financial investment and research
CN110765762A (en) System and method for extracting optimal theme of online comment text under big data background

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant