CN111815188A - Method for evaluating expression presentation capacity of article - Google Patents

Method for evaluating expression presentation capacity of article Download PDF

Info

Publication number
CN111815188A
CN111815188A CN202010676282.2A CN202010676282A CN111815188A CN 111815188 A CN111815188 A CN 111815188A CN 202010676282 A CN202010676282 A CN 202010676282A CN 111815188 A CN111815188 A CN 111815188A
Authority
CN
China
Prior art keywords
article
result
weight coefficient
generate
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010676282.2A
Other languages
Chinese (zh)
Inventor
贲忠奇
蔡博克
冷若冰
阚野
张云
张京鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chaos Times Beijing Education Technology Co ltd
Original Assignee
Chaos Times Beijing Education Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chaos Times Beijing Education Technology Co ltd filed Critical Chaos Times Beijing Education Technology Co ltd
Priority to CN202010676282.2A priority Critical patent/CN111815188A/en
Publication of CN111815188A publication Critical patent/CN111815188A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Educational Administration (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Tourism & Hospitality (AREA)
  • Game Theory and Decision Science (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Document Processing Apparatus (AREA)
  • General Business, Economics & Management (AREA)
  • Multimedia (AREA)

Abstract

The invention discloses an evaluation method of article expression presentation capacity, which comprises the following steps: acquiring article basic format information of an article and judging the article basic format indexes to generate a first judgment result; obtaining the readability information of the article and judging the readability indexes of the article to generate a second judgment result; obtaining semantic coherence information of an article and judging semantic coherence indexes of the article to generate a third judgment result; setting a weight coefficient of each index; and performing comprehensive calculation on the first judgment result, the second judgment result and the third judgment result according to the weight coefficient of each index to obtain a final evaluation result. The invention sets reasonable weight for each index by performing all-around evaluation on the aspects of basic format, readability, semantic coherence and the like of the article, improves the detection and evaluation effect and ensures that the article is comprehensively and accurately evaluated.

Description

Method for evaluating expression presentation capacity of article
Technical Field
The invention relates to the technical field of article evaluation, in particular to an evaluation method for article expression presentation capacity.
Background
The method for automatically labeling discourse coherent semantics comprises the steps of utilizing a part-of-speech distribution rule of related words to eliminate non-related words, labeling potential related words, comparing a mode table in a related word library, comprehensively utilizing matching distance, strength and syntax position to obtain legal discourse coherent modes, and labeling the semantic relation on the basis. In the prior art, generally, OCR (Optical Character Recognition) is used to detect continuity of chapters, which means a process that an electronic device (such as a scanner or a digital camera) checks characters printed on paper, determines shapes of the characters by detecting dark and light patterns, and then translates the shapes into computer characters by a Character Recognition method, but the detection is performed in a line-by-line manner, sometimes, a Character string is arranged in blocks, and the whole blocks cannot be recognized; in the prior art, an MD5 Message Digest Algorithm (english: MD5 Message-Digest Algorithm) is also used to detect continuity of chapters, which is a widely used cryptographic hash function that can generate a 128-bit (16-byte) hash value to ensure complete and consistent Message transmission, but it can only detect exactly the same semantic content in a text, cannot detect similarity, and has a poor detection effect.
Disclosure of Invention
In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide an evaluation method for article expression presentation capability, which sets reasonable weights for each index by performing comprehensive evaluation on the basic format, readability, semantic coherence, and the like of an article, so as to improve the detection and evaluation effect and ensure that the article is comprehensively and accurately evaluated.
The embodiment of the invention is realized by the following steps:
a method for evaluating the expression presentation capability of an article comprises the following steps:
acquiring article basic format information of an article and judging the article basic format indexes to generate a first judgment result;
obtaining the readability information of the article and judging the readability indexes of the article to generate a second judgment result;
obtaining semantic coherence information of an article and judging semantic coherence indexes of the article to generate a third judgment result;
setting a weight coefficient of each index;
and performing comprehensive calculation on the first judgment result, the second judgment result and the third judgment result according to the weight coefficient of each index to obtain a final evaluation result.
When the expression presenting capability of an article is evaluated, firstly, basic format information, readability information, semantic coherence information and the like of the article are obtained, wherein the basic format information comprises the word number of the text, the definition of the word, the semantic content of the text and the like, the readability information refers to the information of a text structure, a text paragraph, an image and text, and the semantic coherence information refers to information of a relevant word, the part of a relevant word and a collocation of the relevant word, then, the basic format indexes of the article are evaluated respectively to generate a first evaluation result, the readability indexes of the article are evaluated to generate a second evaluation result, the semantic coherence indexes of the article are evaluated to generate a third evaluation result, wherein the first evaluation result comprises the word number grade evaluation result of the article obtained by evaluating the word number, the OCR identification evaluation result obtained by evaluating the definition of the word and the simhash evaluation result obtained by evaluating the semantic content of the text, the second judgment result comprises a structure judgment result obtained by judging the text structure, a paragraph judgment result obtained by judging the text paragraph, and a graph-text judgment result obtained by judging the graph-text, and the third judgment result comprises a related word judgment result obtained by identifying and judging related words and a related word collocation strength judgment result obtained by judging the related word collocation strength; setting the weight coefficients of all indexes according to the actual situation, wherein the weight coefficients comprise the weight coefficient of the basic format of the article, the weight coefficient of the readability of the article and the weight coefficient of the semantic consistency of the article, the sum of all the weight coefficients is 1, and comprehensively calculating the first judgment result, the second judgment result and the third judgment result according to the corresponding weight coefficients to obtain the final evaluation result. The method carries out comprehensive evaluation on the basic format, readability, semantic coherence and the like of the article, sets reasonable weight for each index, improves the detection and evaluation effect and ensures that the article is comprehensively and accurately evaluated.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability judges a basic format indicator of an article, and a method for generating a first judgment result includes the following steps:
counting the word number of the article in the basic format information of the article to generate a counting result;
expressing the statistical result according to a quantile mode;
and obtaining the word number grade judgment result of the article according to a preset quantile grade standard.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability judges a basic format indicator of an article, and a method for generating a first judgment result includes the following steps:
converting the article characters into images by an OCR technology;
enhancing the image effect by adjusting sharpening, brightness, chroma, contrast and gray scale, and identifying the image to obtain characters identified by various modes;
solving a union set of characters identified by various modes, and carrying out coordinate marking on the identified characters;
and recognizing adjacent files as a whole by adopting a preset maximum distance communication mode according to the coordinate marks of the characters to generate an OCR recognition judgment result.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability judges a basic format indicator of an article, and a method for generating a first judgment result includes the following steps:
mapping the text content of the article into a plurality of binary number strings by adopting a simhash technology;
carrying out difference comparison on the plurality of binary digit strings to generate a difference result;
and recognizing the semantic repeated content of the text according to the difference result to generate a simhash judgment result.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability judges a readability index of an article, and generates a second judgment result, the method comprising the steps of:
acquiring structural rich text format information in article data according to version;
matching the structures of the large titles, the small titles and the serial numbers of the articles in a regular mode according to the structural rich text format information, and generating and counting matching results;
expressing the statistical matching result in a quantile mode;
and judging the matching result according to a preset percentage threshold value to generate a structure judging result.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability judges a readability index of an article, and generates a second judgment result, the method comprising the steps of:
acquiring paragraph rich text format information in article data according to version;
matching the article paragraphs according to the paragraph rich text format information in a regular mode, and generating and counting matching results;
expressing the statistical matching result in a quantile mode;
and judging the matching result according to a preset percentage threshold value to generate a paragraph judgment result.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability judges a readability index of an article, and generates a second judgment result, the method comprising the steps of:
acquiring paragraph rich text format information in article data according to version;
matching the article paragraphs according to the paragraph rich text format information in a regular mode, and generating and counting matching results;
expressing the statistical matching result in a quantile mode;
and judging the matching result according to a preset percentage threshold value to generate a graph and text judgment result.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability judges a semantic coherence indicator of an article, and generates a third judgment result, the method comprising the following steps:
extracting relevant words in the article and labeling the part of speech of the relevant words in the article to generate a labeling result;
performing relevant word recognition through a biLSTM + crf model according to the labeling result to generate a relevant word recognition result;
and generating a related word judgment result according to the related word recognition result and a preset judgment standard.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability judges a semantic coherence indicator of an article, and generates a third judgment result, the method comprising the following steps:
extracting relevant words in the article;
comparing the associated words in the article with the associated word list, and generating and counting a comparison result;
according to the formula
Figure BDA0002584167970000061
Calculating the collocation strength of the relevant words, wherein E ═ f (a) x l × f (b),
Figure BDA0002584167970000062
z is the matching strength of the associated words, f (a, b) is the collinear frequency of the associated words a and b, f (a) and f (b) are the frequency of occurrence of the associated words a and b respectively, E is the expected frequency between 2 associated words, and l is the matching distance between two associated words;
comparing the matching strength of the associated words with a preset matching strength threshold value to generate a matching strength comparison result;
and generating a related word judgment result according to the collocation strength comparison result and a preset judgment standard.
In some embodiments of the present invention, a method for evaluating an article expression presentation capability performs comprehensive calculation on a first evaluation result, a second evaluation result and a third evaluation result according to a weight coefficient of each index, and obtains a final evaluation result, the method includes the following steps:
taking the weight coefficient of the article basic format index in the weight coefficient as a first weight coefficient, taking the weight coefficient of the article readability index in the weight coefficient as a second weight coefficient, and taking the weight coefficient of the article semantic coherence index in the weight coefficient as a third weight coefficient;
calculating the product of the first evaluation result and the first weight coefficient to obtain a first evaluation result;
calculating the product of the second evaluation result and the second weight coefficient to obtain a second evaluation result;
calculating the product of the third evaluation result and the third weight coefficient to obtain a third evaluation result;
and calculating the sum of the first evaluation result, the second evaluation result and the third evaluation result to obtain a final evaluation result.
The embodiment of the invention at least has the following advantages or beneficial effects:
the embodiment of the invention provides an evaluation method of article expression presentation capacity, which is used for evaluating the expression presentation capacity of an article, and comprises the steps of firstly obtaining basic format information, readability information, semantic coherence information and the like of the article, wherein the basic format information comprises text word number, text definition, text semantic content and the like, the readability information refers to information such as a text structure, a text paragraph, pictures and texts, and the semantic coherence information refers to information such as associated words, associated word part-of-speech categories, associated word collocation and the like, then respectively evaluating the basic format indexes of the article to generate a first evaluation result, evaluating the readability indexes of the article to generate a second evaluation result, evaluating the semantic coherence indexes of the article to generate a third evaluation result; setting the weight coefficients of all indexes according to the actual situation, wherein the weight coefficients comprise the weight coefficient of the basic format of the article, the weight coefficient of the readability of the article and the weight coefficient of the semantic consistency of the article, the sum of all the weight coefficients is 1, and comprehensively calculating the first judgment result, the second judgment result and the third judgment result according to the corresponding weight coefficients to obtain the final evaluation result. The method carries out comprehensive evaluation on the basic format, readability, semantic coherence and the like of the article, sets reasonable weight for each index, improves the detection and evaluation effect and ensures that the article is comprehensively and accurately evaluated.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flowchart of a method for evaluating an article expression presentation capability according to an embodiment of the present invention;
FIG. 2 is a first flowchart of evaluating basic format indicators of an article in an evaluation method for expression presentation capabilities of an article according to an embodiment of the present invention;
FIG. 3 is a flowchart II of a method for evaluating an article expression presentation capability according to an embodiment of the present invention, for evaluating a basic format indicator of an article;
FIG. 4 is a third flowchart illustrating the evaluation of basic format indicators of an article in the method for evaluating an expression presentation capability of an article according to the embodiment of the present invention;
FIG. 5 is a first flowchart of an evaluation method for evaluating readability indicators of an article in an article expression presentation capability according to an embodiment of the present invention;
FIG. 6 is a flowchart II of the method for evaluating the readability index of an article according to the embodiment of the present invention;
FIG. 7 is a third flowchart illustrating the evaluation of the readability indicators of the articles in the method for evaluating the expression and display capabilities of the articles according to the embodiment of the present invention;
fig. 8 is a flowchart illustrating evaluation of semantic consistency indicators of an article according to an evaluation method for expression presentation capabilities of the article according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It is noted that, herein, relational terms such as first, second, third, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Examples
As shown in fig. 1, this embodiment provides a method for evaluating expression and presentation capabilities of an article, including the following steps:
s1, acquiring article basic format information of the article, judging the article basic format index and generating a first judgment result;
s2, obtaining the readability information of the article and judging the readability index of the article to generate a second judgment result;
s3, obtaining semantic coherence information of the article and judging semantic coherence indexes of the article to generate a third judgment result;
s4, setting weight coefficients of each index;
and S5, performing comprehensive calculation on the first judgment result, the second judgment result and the third judgment result according to the weight coefficient of each index to obtain a final evaluation result.
When the expression presenting capability of an article is evaluated, firstly, basic format information, readability information, semantic coherence information and the like of the article are obtained, wherein the basic format information comprises the word number of the text, the definition of the word, the semantic content of the text and the like, the readability information refers to the information of a text structure, a text paragraph, an image and text, and the semantic coherence information refers to information of a relevant word, the part of a relevant word and a collocation of the relevant word, then, the basic format indexes of the article are evaluated respectively to generate a first evaluation result, the readability indexes of the article are evaluated to generate a second evaluation result, the semantic coherence indexes of the article are evaluated to generate a third evaluation result, wherein the first evaluation result comprises the word number grade evaluation result of the article obtained by evaluating the word number, the OCR identification evaluation result obtained by evaluating the definition of the word and the simhash evaluation result obtained by evaluating the semantic content of the text, the second judgment result comprises a structure judgment result obtained by judging the text structure, a paragraph judgment result obtained by judging the text paragraph, and a graph-text judgment result obtained by judging the graph-text, and the third judgment result comprises a related word judgment result obtained by identifying and judging related words and a related word collocation strength judgment result obtained by judging the related word collocation strength; setting the weight coefficients of all indexes according to the actual situation, wherein the weight coefficients comprise the weight coefficient of the basic format of the article, the weight coefficient of the readability of the article and the weight coefficient of the semantic consistency of the article, the sum of all the weight coefficients is 1, and comprehensively calculating the first judgment result, the second judgment result and the third judgment result according to the corresponding weight coefficients to obtain the final evaluation result. The method carries out comprehensive evaluation on the basic format, readability, semantic coherence and the like of the article, sets reasonable weight for each index, improves the detection and evaluation effect and ensures that the article is comprehensively and accurately evaluated.
In one embodiment, as shown in fig. 2, the method for evaluating the basic format indicator of the article and generating the first evaluation result includes the following steps:
s11, counting the word number of the article in the basic format information of the article to generate a counting result;
s12, representing the statistical result according to a quantile mode;
and S13, obtaining the word number grade judgment result of the article according to the preset quantile grade standard.
Counting the number of words of the article in the basic format information of the article, generating a word counting result, representing the word counting result in a quantile mode according to the counting result, judging whether the word counting result is the first grade or not according to a preset quantile grade standard, wherein the quantile grade standard is less than 30% of quantile and is taken as the first grade, 30% -60% of quantile is taken as the second grade, more than 60% of quantile is taken as the third grade, if so, generating a first grade judgment result of the number of the article, otherwise, judging whether the word counting result is the second grade, if so, generating a second grade judgment result of the number of the article, and if not, generating a third grade judgment result of the number of the article. And the grade judgment is carried out on the word number of the article in a quantile mode, so that the judgment accuracy is ensured.
In one embodiment, as shown in fig. 3, the method for evaluating the basic format indicator of the article and generating the first evaluation result includes the following steps:
s14, converting the article characters into images through an OCR technology;
s15, enhancing the image effect by adjusting the sharpening, brightness, chroma, contrast and gray scale, and identifying the image to obtain characters identified by various modes;
s16, solving an union set of characters recognized in various modes, and carrying out coordinate marking on the recognized characters;
and S17, recognizing the adjacent files as a whole in a preset maximum distance communication mode according to the coordinate marks of the characters, and generating an OCR recognition judgment result.
The characters are folded into the picture by adopting OCR technology, the picture is input and the effect is enhanced, the recognized characters are merged by adopting a plurality of modes such as sharpening, brightness, chroma, contrast, gray level and the like so as to improve the recognition rate of OCR, the recognized characters have coordinates, the characters can be considered to belong to a whole body by adopting a preset maximum distance communication mode through the distance between the characters on the upper part, the lower part, the left part and the right part, thus avoiding that the recognition of one line and one line can break the whole block body to cause the discontinuity of the recognized entity, and generating the OCR recognition judgment result according to the recognition efficiency.
In one embodiment, as shown in fig. 4, the method for evaluating the basic format indicator of the article and generating the first evaluation result includes the following steps:
s18, mapping the text content of the article into a plurality of binary number strings by adopting a simhash technology;
s19, carrying out difference comparison on the plurality of binary digit strings to generate a difference result;
and S110, recognizing the semantic repeated content of the text according to the difference result, and generating a simhash judgment result.
The SimHash technology is adopted to identify related semantemes, repeated contents are removed, the SimHash refers to a text similarity calculation method, a local sensitive hash algorithm can map original text contents into numbers (hash signatures), the hash signatures corresponding to the more similar text contents are also relatively similar, the SimHash algorithm is a high-efficiency algorithm for removing massive web pages by Google company, the original text is mapped into 64-bit binary string, then the difference of the original text contents is further represented by comparing the difference of the binary string, semantic repeated contents of the text are identified according to the difference result, the proportion of the repeated contents is judged, and a Simhash judgment result is generated according to a preset repeated content proportion judgment standard.
In one embodiment, as shown in fig. 5, the method for evaluating the readability index of the article and generating the second evaluation result comprises the following steps:
s21, acquiring structural rich text format information in the article data according to version;
s22, matching the structures of the headlines, the subtitles and the serial numbers of the articles in a regular mode according to the rich text format information, and generating and counting matching results;
s23, representing the statistical matching result in a quantile mode;
and S24, judging the matching result according to the preset percentage threshold value, and generating a structure judgment result.
According to different versions of APP recording the article, corresponding rich text information is obtained, wherein the rich text information is in an html form, such as: matching a corresponding format in a regular mode in the form of < h1 h1> < title > < p >, matching the structures of the large titles, the small titles and the serial numbers of the articles, generating and counting matching results, and judging; the title is < h1> </h1 >.
In one embodiment, as shown in fig. 6, the method for evaluating the readability index of the article and generating the second evaluation result includes the following steps:
s25, obtaining paragraph rich text format information in the article data according to version;
s26, matching the article paragraphs in a regular mode according to the rich text format information, and generating and counting matching results;
s27, representing the statistical matching result in a quantile mode;
and S28, judging the matching result according to the preset percentage threshold value, and generating a paragraph judgment result.
Acquiring corresponding rich text information according to different versions of APP (application) for recording the article, wherein the paragraph format is < h > </h >, the paragraph is mainly matched, the article paragraphs are matched in a regular mode according to the rich text format information, and the matching result is generated and counted; expressing the statistical matching result in a quantile mode; and judging the matching result according to a preset percentage threshold value to generate a paragraph judgment result.
In one embodiment, as shown in fig. 7, the method for evaluating the readability index of the article and generating the second evaluation result comprises the following steps:
s29, obtaining the graphic rich text format information in the article data according to version,
s210, matching the pictures in a regular mode according to the image-text rich text format information, and generating and counting matching results;
and S211, judging the matching result according to a preset percentage threshold value, and generating a picture and text judgment result.
Matching pictures in a regular mode according to the image-text rich text format information, wherein the picture format is < img > </img >, counting the number of the pictures, generally carrying out subsequent statistics on image-text matching results, matching the pictures in a regular mode according to the image-text rich text format information, and generating and counting matching results; and judging the matching result according to a preset percentage threshold value to generate a graph and text judgment result.
In one embodiment, as shown in fig. 8, the method for evaluating the semantic consistency index of an article and generating a third evaluation result includes the following steps:
s31, extracting relevant words in the article, labeling the part of speech of the relevant words in the article, and generating a labeling result;
s32, performing relevant word recognition through a bilSTM + crf model according to the labeling result to generate a relevant word recognition result;
and S33, generating a related word judgment result according to the related word recognition result and a preset judgment standard.
There are many kinds of associated words: sentence-to-sentence conjunctions, related adverbs, auxiliary words, hyperwords. The part of speech of the related words in the article is labeled, the entity as the related words is identified by adopting a biLSTM + crf (named entity identification) mode as an identification target, a related word identification result is generated, and a related word judgment result is generated according to the related word identification result and a preset judgment standard.
In one embodiment, the method for evaluating semantic consistency indexes of an article and generating a third evaluation result comprises the following steps:
extracting relevant words in the article;
comparing the associated words in the article with the associated word list, and generating and counting a comparison result;
according to the formula
Figure BDA0002584167970000151
Calculating the collocation strength of the relevant words, wherein E ═ f (a) x l × f (b),
Figure BDA0002584167970000152
z is the matching strength of the associated words, f (a, b) is the collinear frequency of the associated words a and b, f (a) and f (b) are the frequency of occurrence of the associated words a and b respectively, E is the expected frequency between 2 associated words, l is the matching distance between two associated words, and SD is the distance between a and b;
comparing the matching strength of the associated words with a preset matching strength threshold value to generate a matching strength comparison result;
and generating a related word judgment result according to the collocation strength comparison result and a preset judgment standard.
Comparing a pattern table (related word table) of related words, for example, … because of …; …, but …. The collocation distance of the associated words in the statistical data can be used for analyzing a potential legal collocation format, unqualified words are filtered out, statistical evaluation is carried out, and according to the statistics of a CCCS corpus, the collocation strength value range between the associated words is as follows: 2.26< ═ Z < ═ 1328.2, therefore, the threshold value of the collocation strength is set to 2.26, the collocation strength of the relevant word is compared with the preset collocation strength threshold value to generate a collocation strength comparison result, and the relevant word judgment result is generated according to the collocation strength comparison result and the preset judgment standard.
In one embodiment, the method for comprehensively calculating the first evaluation result, the second evaluation result and the third evaluation result according to the weight coefficient of each index to obtain the final evaluation result comprises the following steps:
taking the weight coefficient of the article basic format index in the weight coefficient as a first weight coefficient, taking the weight coefficient of the article readability index in the weight coefficient as a second weight coefficient, and taking the weight coefficient of the article semantic coherence index in the weight coefficient as a third weight coefficient;
calculating the product of the first evaluation result and the first weight coefficient to obtain a first evaluation result;
calculating the product of the second evaluation result and the second weight coefficient to obtain a second evaluation result;
calculating the product of the third evaluation result and the third weight coefficient to obtain a third evaluation result;
and calculating the sum of the first evaluation result, the second evaluation result and the third evaluation result to obtain a final evaluation result.
And carrying out weight distribution on each index, and finally calculating to obtain a comprehensive score which is used as the measurement evaluation of the expression presentation capacity of the article. Defining the weight of each index according to the domain knowledge of experts, wherein the weight coefficient of the basic format of an article is generally 0.3, the weight coefficient of the readability of the article is 0.4, and the weight coefficient of the semantic coherence of the article is 0.3, wherein the weight coefficient of the number of words in the basic format index of the article is set to 0.5, the OCR weight coefficient is set to 0.2, and the weight coefficient of simhash is 0.3; setting the weight coefficient of clear structures in the readability of the article to be 0.4, the weight coefficient of clear paragraphs to be 0.3 and the weight coefficient of the text to be 0.3; the weight coefficient of the related word recognition in the semantic coherence of the article is set to 0.5, and the weight coefficient of the strength collocation of the related words is set to 0.5. And finally, accurately and comprehensively evaluating the expression presentation capacity of the article by setting a reasonable weight coefficient.
To sum up, the embodiment of the present invention provides a method for evaluating an article expression and presentation capability, when evaluating the expression and presentation capability of an article, first obtaining basic format information, readability information, semantic coherence information, and the like of the article, where the basic format information includes text word number, text definition, text semantic content, and the like, the readability information refers to information such as text structure, text paragraph, and image, and the semantic coherence information refers to information such as associated word, associated word part-of-speech category, and associated word collocation, then counting the article word number in the basic format information of the article to generate a word number statistic result, representing the word number statistic result by means of quantiles according to the statistic result, taking the quantiles below 30% of the quantiles as a first level and taking the quantiles 30% -60% of the quantiles as a second level according to a preset quantiles level standard, more than 60% of quantiles are used as a third grade, whether the word number counting result is the first grade or not is judged, if yes, the first grade judgment result of the article word number is generated, if not, whether the word number counting result is the second grade or not is judged, if yes, the second grade judgment result of the article word number is generated, if not, the third grade judgment result of the article word number is generated, grade judgment is carried out on the article word number in the quantile mode, and the judgment accuracy is guaranteed; the method comprises the steps of folding characters into a picture by adopting an OCR technology, inputting the picture and enhancing the effect, solving and integrating the recognized characters by adopting various modes such as sharpening, brightness, chrominance, contrast, gray level and the like to improve the recognition rate of OCR, wherein the recognized characters have coordinates, and are considered to belong to a whole body by adopting a preset maximum distance communication mode through the distance between the upper characters, the lower characters, the left characters, the right characters and the left characters, so that the recognition of one line and one line can be avoided, the whole block can be broken, the recognized entity is discontinuous, and an OCR recognition judgment result is generated according to the recognition efficiency; the SimHash technology is adopted to identify related semantemes, repeated contents are removed, the SimHash refers to a text similarity calculation method, a local sensitive hash algorithm can map original text contents into numbers (hash signatures), the hash signatures corresponding to the more similar text contents are also relatively similar, the SimHash algorithm is a high-efficiency algorithm for removing the duplication of massive web pages by Google company, the original text is mapped into 64-bit binary string, then the difference of the original text contents is further represented by comparing the difference of the binary string, the semanteme repeated contents of the text are identified according to the difference result, the proportion of the repeated contents is judged, and the Simhash judgment result is generated according to the preset repeated content proportion judgment standard; judging the readability indexes of the articles to generate a third judgment result, judging the semantic coherence indexes of the articles to generate a third judgment result, wherein the first judgment result comprises an article word number grade judgment result obtained by judging the number of text words, an OCR recognition judgment result obtained by judging the definition of the text words and a simhash judgment result obtained by judging the semantic content of the text, the second judgment result comprises a structure judgment result obtained by judging the structure of the text, a paragraph judgment result obtained by judging the paragraph of the text and a graph-text judgment result obtained by judging the graph-text, and the third judgment result comprises a related word judgment result obtained by identifying related words and a related word collocation strength judgment result obtained by judging the matching strength of related words; setting the weight coefficients of all indexes according to the actual situation, wherein the weight coefficients comprise the weight coefficient of the basic format of the article, the weight coefficient of the readability of the article and the weight coefficient of the semantic consistency of the article, the sum of all the weight coefficients is 1, and comprehensively calculating the first judgment result, the second judgment result and the third judgment result according to the corresponding weight coefficients to obtain the final evaluation result. The method carries out comprehensive evaluation on the basic format, readability, semantic coherence and the like of the article, sets reasonable weight for each index, improves the detection and evaluation effect and ensures that the article is comprehensively and accurately evaluated.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (10)

1. A method for evaluating the expression presentation capability of an article is characterized by comprising the following steps:
acquiring article basic format information of an article and judging the article basic format indexes to generate a first judgment result;
obtaining the readability information of the article and judging the readability indexes of the article to generate a second judgment result;
obtaining semantic coherence information of an article and judging semantic coherence indexes of the article to generate a third judgment result;
setting a weight coefficient of each index;
and performing comprehensive calculation on the first judgment result, the second judgment result and the third judgment result according to the weight coefficient of each index to obtain a final evaluation result.
2. The method for evaluating the expression and presentation capabilities of the article according to claim 1, wherein the method for evaluating the basic format index of the article and generating the first evaluation result comprises the following steps:
counting the word number of the article in the basic format information of the article to generate a counting result;
expressing the statistical result according to a quantile mode;
and obtaining the word number grade judgment result of the article according to a preset quantile grade standard.
3. The method for evaluating the expression and presentation capabilities of the article according to claim 1, wherein the method for evaluating the basic format index of the article and generating the first evaluation result comprises the following steps:
converting the article characters into images by an OCR technology;
enhancing the image effect by adjusting sharpening, brightness, chroma, contrast and gray scale, and identifying the image to obtain characters identified by various modes;
solving a union set of characters identified by various modes, and carrying out coordinate marking on the identified characters;
and recognizing adjacent files as a whole by adopting a preset maximum distance communication mode according to the coordinate marks of the characters to generate an OCR recognition judgment result.
4. The method for evaluating the expression and presentation capabilities of the article according to claim 1, wherein the method for evaluating the basic format index of the article and generating the first evaluation result comprises the following steps:
mapping the text content of the article into a plurality of binary number strings by adopting a simhash technology;
carrying out difference comparison on the plurality of binary digit strings to generate a difference result;
and recognizing the semantic repeated content of the text according to the difference result to generate a simhash judgment result.
5. The method of claim 1, wherein the step of evaluating the readability indicators of the article and generating the second evaluation result comprises the steps of:
acquiring structural rich text format information in article data according to version;
matching the structures of the large titles, the small titles and the serial numbers of the articles in a regular mode according to the structural rich text format information, and generating and counting matching results;
expressing the statistical matching result in a quantile mode;
and judging the matching result according to a preset percentage threshold value to generate a structure judging result.
6. The method of claim 1, wherein the step of evaluating the readability indicators of the article and generating the second evaluation result comprises the steps of:
acquiring paragraph rich text format information in article data according to version;
matching the article paragraphs according to the paragraph rich text format information in a regular mode, and generating and counting matching results;
expressing the statistical matching result in a quantile mode;
and judging the matching result according to a preset percentage threshold value to generate a paragraph judgment result.
7. The method of claim 1, wherein the step of evaluating the readability indicators of the article and generating the second evaluation result comprises the steps of:
acquiring the graphic rich text format information in the article data according to version,
matching the pictures in a regular mode according to the image-text rich text format information, and generating and counting matching results;
expressing the statistical matching result in a quantile mode;
and judging the matching result according to a preset percentage threshold value to generate a graph and text judgment result.
8. The method of claim 1, wherein the step of evaluating semantic consistency indicators of the articles to generate a third evaluation result comprises the steps of:
extracting relevant words in the article and labeling the part of speech of the relevant words in the article to generate a labeling result;
performing relevant word recognition through a biLSTM + crf model according to the labeling result to generate a relevant word recognition result;
and generating a related word judgment result according to the related word recognition result and a preset judgment standard.
9. The method of claim 1, wherein the step of evaluating semantic consistency indicators of the articles to generate a third evaluation result comprises the steps of:
extracting relevant words in the article;
comparing the associated words in the article with the associated word list, and generating and counting a comparison result;
according to the formula
Figure FDA0002584167960000041
Calculating the collocation strength of the relevant words, wherein E ═ f (a) x l × f (b),
Figure FDA0002584167960000042
z is the matching strength of the associated words, f (a, b) is the collinear frequency of the associated words a and b, f (a) and f (b) are the frequency of occurrence of the associated words a and b respectively, E is the expected frequency between 2 associated words, l is the matching distance between two associated words, and SD is the distance between a and b;
comparing the matching strength of the associated words with a preset matching strength threshold value to generate a matching strength comparison result;
and generating a related word collocation strength judgment result according to the collocation strength comparison result and a preset judgment standard.
10. The method of claim 1, wherein the step of comprehensively calculating the first, second and third evaluation results according to the weight coefficient of each index to obtain the final evaluation result comprises the steps of:
taking the weight coefficient of the article basic format index in the weight coefficient as a first weight coefficient, taking the weight coefficient of the article readability index in the weight coefficient as a second weight coefficient, and taking the weight coefficient of the article semantic coherence index in the weight coefficient as a third weight coefficient;
calculating the product of the first evaluation result and the first weight coefficient to obtain a first evaluation result;
calculating the product of the second evaluation result and the second weight coefficient to obtain a second evaluation result;
calculating the product of the third evaluation result and the third weight coefficient to obtain a third evaluation result;
and calculating the sum of the first evaluation result, the second evaluation result and the third evaluation result to obtain a final evaluation result.
CN202010676282.2A 2020-07-14 2020-07-14 Method for evaluating expression presentation capacity of article Pending CN111815188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010676282.2A CN111815188A (en) 2020-07-14 2020-07-14 Method for evaluating expression presentation capacity of article

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010676282.2A CN111815188A (en) 2020-07-14 2020-07-14 Method for evaluating expression presentation capacity of article

Publications (1)

Publication Number Publication Date
CN111815188A true CN111815188A (en) 2020-10-23

Family

ID=72864772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010676282.2A Pending CN111815188A (en) 2020-07-14 2020-07-14 Method for evaluating expression presentation capacity of article

Country Status (1)

Country Link
CN (1) CN111815188A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012230652A (en) * 2011-04-27 2012-11-22 Isuzu Motors Ltd Readability evaluation method, readability evaluation device and readability evaluation program
US20130179169A1 (en) * 2012-01-11 2013-07-11 National Taiwan Normal University Chinese text readability assessing system and method
CN107315736A (en) * 2017-06-22 2017-11-03 云天弈(北京)信息技术有限公司 A kind of assisted writing system and method
CN107506360A (en) * 2016-06-14 2017-12-22 科大讯飞股份有限公司 A kind of essay grade method and system
CN109543090A (en) * 2018-08-07 2019-03-29 宜人恒业科技发展(北京)有限公司 A kind of method and apparatus for evaluating web documents

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012230652A (en) * 2011-04-27 2012-11-22 Isuzu Motors Ltd Readability evaluation method, readability evaluation device and readability evaluation program
US20130179169A1 (en) * 2012-01-11 2013-07-11 National Taiwan Normal University Chinese text readability assessing system and method
CN107506360A (en) * 2016-06-14 2017-12-22 科大讯飞股份有限公司 A kind of essay grade method and system
CN107315736A (en) * 2017-06-22 2017-11-03 云天弈(北京)信息技术有限公司 A kind of assisted writing system and method
CN109543090A (en) * 2018-08-07 2019-03-29 宜人恒业科技发展(北京)有限公司 A kind of method and apparatus for evaluating web documents

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚双云: "关联词搭配的自动发现", 《计算机应用研究》, vol. 28, no. 12, pages 4426 - 4428 *

Similar Documents

Publication Publication Date Title
US8254681B1 (en) Display of document image optimized for reading
US8849725B2 (en) Automatic classification of segmented portions of web pages
US20210209421A1 (en) Method and apparatus for constructing quality evaluation model, device and storage medium
CN112800848A (en) Structured extraction method, device and equipment of information after bill identification
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
EP0544432A2 (en) Method and apparatus for document processing
EP2102760A1 (en) Converting text
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN110741376A (en) Automatic document analysis for different natural languages
CN110866116A (en) Policy document processing method and device, storage medium and electronic equipment
CN108197119A (en) The archives of paper quality digitizing solution of knowledge based collection of illustrative plates
TW200540728A (en) Text region recognition method, storage medium and system
CN111814481B (en) Shopping intention recognition method, device, terminal equipment and storage medium
CN112464927B (en) Information extraction method, device and system
CN113762100A (en) Name extraction and standardization method and device in medical bill, computing equipment and storage medium
JP2015005100A (en) Information processor, template generation method, and program
CN110929647B (en) Text detection method, device, equipment and storage medium
CN112464907A (en) Document processing system and method
CN111815188A (en) Method for evaluating expression presentation capacity of article
CN109670183B (en) Text importance calculation method, device, equipment and storage medium
CN100444194C (en) Automatic extraction device, method and program of essay title and correlation information
CN116110066A (en) Information extraction method, device and equipment of bill text and storage medium
CN115294594A (en) Document analysis method, device, equipment and storage medium
CN111626057B (en) Official document judgment method and judgment system based on named entity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination