CN117708545A

CN117708545A - Viewpoint contribution degree evaluation method and system integrating theme extraction and cosine similarity

Info

Publication number: CN117708545A
Application number: CN202410144330.1A
Authority: CN
Inventors: 段尧清; 凌榕; 曾江峰; 程征
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2024-02-01
Filing date: 2024-02-01
Publication date: 2024-03-15
Anticipated expiration: 2044-02-01
Also published as: CN117708545B

Abstract

The invention discloses a viewpoint contribution evaluation method and system for fusion topic extraction and cosine similarity. The viewpoint contribution degree evaluation method comprises the following steps: generating a theme corresponding to the viewpoint of each article in the target database; wherein each article corresponds to a point of view, and the articles comprise target articles; comparing the similarity of each theme with the theme of the target viewpoint; the target viewpoint is the viewpoint of the target article; grouping the topics with the similarity value of the topics and the topics of the target views being greater than a similarity threshold into one class, and generating a document set of similar topics, wherein the document set comprises target articles; and obtaining the contribution degree of the target view according to the time sequence corresponding to each article in the document set. The invention fully considers the advancement and foresight of the article views, thereby measuring the view contribution degree of academic representatives and improving the accuracy and objectivity of the representatives.

Description

Viewpoint contribution degree evaluation method and system integrating theme extraction and cosine similarity

Technical Field

The invention relates to a data processing and evaluating technology, in particular to a viewpoint contribution evaluation method and system for fusion topic extraction and cosine similarity.

Background

At present, the representative evaluation system is applied to various aspects of college teacher title evaluation, subject evaluation, foundation declaration and subject evaluation, high-level talent selection and the like, and gradually forms a scientific research evaluation mechanism with performance priority, encouragement innovation and competition. Although the representative evaluation system is widely used, the evaluation standard has not yet been established in a set of mature theories.

The current academy is also "peer assessment" as the most common method for evaluating representatives, i.e. inviting multiple experts or peers in the field to evaluate academic results of other students in the field. Compared with the method that the method is used for evaluating the quality and influence of achievements at home and abroad, the method has the advantages that the method is earlier in national and overseas application, the quality and influence of main evaluation achievements are evaluated earlier, in 1986, the great britain education foundation committee carries out great britain research evaluation assessment (Research Assessment Exercise, REF) for the first time, in 2014, the great britain study framework (Research Excellence Framework, REF) replaces RAE, and each learner only needs to provide no more than 3 study achievements for grading. In fact, the peer assessment is more prone to be explained empirically, and the expert scholars are mainly used for understanding the field of the expert to assess the scientific research results of the peer, which is an empirical thinking, and the expert's experience and wisdom are mainly used for assessing the academic results, so that the expert invited to participate in the assessment is always academic with higher significance, and the expert's expert is better informed of the field of the discipline, so that the assessment standard is of higher reference value. However, peer comments have the defect that the peer comments are difficult to overcome, on one hand, academic research is a very innovative activity, new knowledge and new content are continuously generated, and therefore, the review experts may face knowledge blind areas or unreasonable problems of knowledge structures; on the other hand, the emotion of the person may cause the interference of subjective factors in the evaluation process, and in recent years, many scholars at home and abroad question the peer evaluation system. Because of subjectivity, the knowledge structure system of the manuscript is likely to influence the evaluation result, and the reliability and fairness of the peer evaluation system are questioned.

Academic contributions of a learner are typically referred to as studies and contributions they make in a particular area, which contributions are typically presented as new knowledge, theory, methodology, or application, and evaluation of the learner may be analyzed from the perspective of the learner's contribution, which contribution in that area may be measured by evaluating the quality of the learner's academic results. Numerous studies have been made at home and abroad on academic effort impact, and many scholars have used citation analysis to evaluate author contributions and academic impact, e.g., american physicists have proposed an h index as early as 2005 to measure scholars contributions and impact. The learner also provides a learner document influence measure index by constructing a weighted document quotation network model. An author academic impact evaluation method is also constructed from 4 aspects of reference strength, reference position, reference emotion and author signature order, and the author contribution degree is calculated. In recent years, many students use comprehensive methods to evaluate their contribution and academic impact, for example, fusion Altmetrics (instead of metrology, supplement metrology, and consideration of academic results to social impact) and quotation analysis methods are used to construct data paper evaluation models. The 7 indexes most closely related to academic influence of papers are selected to form a comprehensive evaluation system, and the comprehensive evaluation value of each paper is calculated by using a principal component analysis method. Also, a method of 'objective peer assessment' is proposed to assess academic impact of papers in combination with a quotation analysis and peer assessment method. Although there are many studies on the contribution degree and academic influence of authors at home and abroad, most of them start from the introduction point of view to construct an evaluation model or put forward a measure index to analyze the academic influence of the authors so as to measure the contribution degree of the authors. The quotation-based evaluation model and the measurement index are complex, which is not beneficial to objectively and efficiently evaluating the viewpoint contribution degree.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, some embodiments of the invention provide a viewpoint contribution degree evaluation method and system for fusion of topic extraction and cosine similarity, which fully consider the advancement and prospective of article viewpoints, thereby measuring the viewpoint contribution degree of academic representatives and improving the accuracy and objectivity of the representatives.

The technical scheme adopted by the invention for solving the technical problems is as follows:

in some embodiments, a method for evaluating perspective contribution of fusion topic extraction and cosine similarity is provided, the method comprising:

generating a theme corresponding to the viewpoint of each article in the target database; wherein each article corresponds to a point of view, and the articles comprise target articles;

comparing the similarity of each theme with the theme of the target viewpoint; the target viewpoint is the viewpoint of the target article;

grouping the topics with the similarity value of the topics and the topics of the target views being greater than a similarity threshold into one class, and generating a document set of similar topics, wherein the document set comprises target articles;

and obtaining the contribution degree of the target viewpoint according to the time sequence corresponding to each article in the document set, wherein the contribution degree is used for representing the advancement and foresight of the target viewpoint.

In some embodiments, each of the topics is composed of a plurality of topic words, and the generating of one topic for each article's perspective in the target database includes: placing a plurality of subject words of each subject into a word bag, taking the plurality of subject words in the word bag as a set without considering the sequence of the plurality of subject words, and performing a duplication removing operation on each word bag.

In some embodiments, the comparing the similarity of each of the topics to the topics of the target view includes: and comparing the cosine similarity of each theme with the theme of the target viewpoint, and generating a similarity value of each theme with the theme of the target viewpoint.

In some embodiments, the perspectives of each article in the target database are extracted using a UniLM model to form a perspective dataset, the data format in the perspective dataset being short text;

the generating a theme corresponding to the viewpoint of each article in the target database includes: and processing the viewpoint data set by using a TextRank4ZH model, automatically extracting the subject words from the short text aiming at each viewpoint, selecting three subject words with the largest weight value according to the weight sequence, and placing the three subject words into a word bag to generate a subject.

In some embodiments, the deriving the contribution of the target view according to the chronological order corresponding to each article in the document set includes: evaluating the contribution degree of the target viewpoint by using a viewpoint contribution degree index formula, wherein the viewpoint contribution degree index formula is as follows:

，

wherein a, b and k are constants, the value of k is adjusted according to the test result, t is the day value of the release date of the target article from the start date, and the value of P (t) is reduced along with the increase of t.

In some embodiments, the start date is the earliest date of the publication time range of all articles in the target database, the value of the constant b is set to 6, the value of a is set to 4, and the value of k is set to 0.002.

In some embodiments, the target article is representative of the learner to be evaluated.

In some embodiments, there is also provided a perspective contribution evaluation system of fusion topic extraction and cosine similarity, the perspective contribution evaluation system including:

the theme generation module is used for correspondingly generating a theme for the viewpoint of each article in the target database; wherein each article corresponds to a viewpoint, the articles comprise target articles, and each topic comprises three topic words;

The similarity comparison module is used for comparing the similarity of each theme with the theme of the target viewpoint; the target viewpoint is the viewpoint of the target article;

the clustering module is used for gathering the topics with the similarity value of the topics and the targets larger than the similarity threshold value into one type, and generating a document set, wherein the document set comprises target articles;

and the contribution index calculation module is used for obtaining the contribution of the target view according to the time sequence corresponding to each article in the document set, wherein the contribution is used for representing the advancement and foresight of the target view.

In some embodiments, there is also provided an electronic device including:

a processor;

a memory storing processor-executable instructions, wherein:

a processor reads instructions from a memory to implement the steps of the method as claimed in any one of the preceding claims.

In some embodiments, there is also provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as claimed in any one of the preceding claims.

Compared with the prior art, the invention at least comprises the following advantages: according to the innovative viewpoint contribution evaluation method, a TextRank4ZH subject term extraction algorithm, a word bag model, a cosine similarity algorithm and an exponential decay function are combined together, an integral machine model and system are built, and the representative viewpoint contribution is evaluated based on viewpoint dimensions from the time sequence of publication. The model and the system can automatically analyze texts and generate a representative viewpoint contribution degree evaluation index system based on natural language processing and similarity clustering technology. The contribution index generated based on the viewpoint contribution evaluation method and system of the application is compared with the artificial scoring result, and the result shows that the method and the system have higher accuracy and reach 86.85% in consistency with the artificial evaluation. The method and the system have excellent performance in academic representatives as evaluation, and can generate evaluation indexes in a highly consistent manner, and the evaluation indexes are consistent with manual evaluation results. Moreover, the whole evaluation process is realized by automatic operation of the machine, which is helpful for eliminating subjective interference factors in manual evaluation. This makes the method of the present application more scientific and objective, improving the reliability of the evaluation.

In addition, the viewpoint contribution degree evaluation method does not consider the quotation relation among articles, does not need to build a quotation network, only analyzes the quality of the articles based on single articles, and extracts the value of specific elements of the articles, and obtains the viewpoint contribution degree index value through the comparative analysis of the viewpoints of the articles. According to the viewpoint contribution degree evaluation method and system, only the relative time positions of similar viewpoints in the whole database are measured, each viewpoint can obtain one viewpoint contribution degree index, and other factors which are easily influenced by subjective factors and are manually operated are not considered, so that the evaluation on the representation is fair, reasonable and effective.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart illustrating a method for evaluating a contribution degree of a viewpoint according to some embodiments of the present invention.

FIG. 2 is a functional schematic of a UniLM model of some embodiments of the invention.

FIG. 3 is a graph showing a trend of the opinion contribution index according to some embodiments of the present invention.

Fig. 4 is a schematic diagram of a perspective contribution evaluation system according to some embodiments of the present invention.

Fig. 5 is a schematic overall flow chart of a perspective contribution evaluation method according to some embodiments of the present invention.

Fig. 6 is a schematic diagram showing a visual result when the cosine similarity threshold is 70% according to some embodiments of the present invention.

FIG. 7 is a schematic diagram of the number of topics distributed over time in accordance with some embodiments of the present invention.

Fig. 8 is a schematic diagram of experimental verification flow according to some embodiments of the invention.

Fig. 9 is a schematic structural diagram of an electronic device according to some embodiments of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Along with the 'Kawuwei' proposal, academic evaluation system reform is receiving more and more attention from students, and management departments have successively sent out a series of guide files and policy requirements to represent an important measure for making evaluation system reform. According to the method, based on the student representatives as evaluation problems, a set of whole-course student representatives as contribution evaluation system based on a machine algorithm and used for assisting a peer evaluation system is established, and accuracy and objectivity of peer evaluation are further improved. In some embodiments of the present application, a view contribution index is calculated comprehensively by constructing a weighted exponential decay model and combining a view comparison and a time sequence analysis. According to the viewpoint contribution degree evaluation method and system for fusion topic extraction and cosine similarity provided by some embodiments of the application, the consistency of index values obtained through verification and results of peer evaluation is higher and reaches 86.85%. The model has higher application value in the aspect of assisting peer assessment. According to the method and the device, a set of viewpoint contribution index model which can be realized by machine algorithm operation in the whole process is built by fusing the TextRank4ZH model and the time sequence, so that the method and the device can be used for assisting a peer evaluation system and improving the reliability and fairness of an evaluation result.

As shown in fig. 1, in some embodiments of the present application, a method for evaluating perspective contribution of fusion topic extraction and cosine similarity is provided, where the method for evaluating perspective contribution includes:

generating a theme corresponding to the viewpoint of each article in the target database; wherein each article corresponds to a point of view, the article comprising a target article.

Comparing the similarity of each theme with the theme of the target viewpoint; the target viewpoint is a viewpoint of a target article.

And grouping the topics with the similarity value of the topics and the topics of the target views being greater than a similarity threshold into one class, and generating a document set of similar topics, wherein the document set comprises target articles.

In this embodiment of the present application, the contribution of the target article may be reflected by the advancement and prospective of the target viewpoint. In some embodiments, the corresponding point of view for each article is in the abstract of that article. The abstract is the most important kernel of an article, and is the condensation of ideas of authors, and often contains conclusions generated by scientists conducting scientific researches. The academic paper innovation is more important to the value of the point of view, which can reflect the contribution of the article.

In addition, the advancement and foresight of the views can be embodied according to the time sequence of the publication of the articles. Advanced and prospective are part of the dimensions involved in innovations, defining academic papers "innovations" as creating or developing new theories, new professions, new methods, new technologies, etc. of value in the relevant academic field, where "new" is closely related to chronological order. Papers that occur at different times represent different academic values. At different times, the perspective is advanced and prospective, and the academic contribution degree is different in the whole time sequence. In this embodiment of the present application, the time sequence of article publishing may represent the advancement and foresight of the representative view, and the contribution degree of the target view is obtained in combination with the time sequence, so as to provide assistance for evaluating the scientific literature representative.

The method for evaluating the contribution degree of the views can be used for evaluating academic representatives, and the views of the abstract part of an article are extracted by analyzing elements in the single article, so that the contribution degree of the views is analyzed. In some embodiments of the present application, the method for evaluating the viewpoint contribution degree does not consider the quotation relationship between articles, does not build a quotation network, only analyzes the quality of a target article based on a single article, and generates a document set of similar subject through similarity analysis of viewpoints among articles. In some embodiments of the present application, the method only measures the time sequence corresponding to each article, and does not consider other factors susceptible to subjective influence and manual operation, and each view obtains a view contribution index, so that objective, fair, reasonable and effective evaluation is made.

In some embodiments of the present application, each of the topics is composed of a plurality of topic words, and the generating a topic corresponding to the viewpoint of each article in the target database includes: placing a plurality of subject words of each subject into a word bag, taking the plurality of subject words in the word bag as a set without considering the sequence of the plurality of subject words, and performing a duplication removing operation on each word bag.

In this embodiment of the present application, it is necessary to compare views in the article abstract, and since the view data is in short text form, the view comparison first requires extracting subject words from the views, in some embodiments of the present application, evaluation scoring is performed based on a single document, and the model sets that only one subject is generated per view, thereby ensuring operability. In some embodiments of the present application, each topic is composed of multiple topic words, so that the viewpoint topic is represented as comprehensively and accurately as possible without losing topic elements. In some embodiments of the present application, a word bag model (Bag of Words Model) is introduced, and a plurality of subject words of each subject are put into a word bag, so that each subject generates a word bag, and thus the subject words are regarded as a set without considering their sequence, and a deduplication operation is performed on each word bag, so that the problem that the similarity calculation results are different due to repetition or permutation and combination of a plurality of subject words is avoided, and the error of the similarity calculation due to the fact that one word repeatedly appears or the occurrence times are different is avoided.

In some embodiments of the present application, the comparing the similarity between each subject and the subject of the target viewpoint includes: and comparing the cosine similarity of each theme with the theme of the target viewpoint, and generating a similarity value of each theme with the theme of the target viewpoint.

In this embodiment of the present application, after outputting the subject word list, each view corresponds to one subject, comparing the similarity of each subject, performing word bagging processing and duplication removal operation on each subject, converting the word bag into vector representation, calculating the cosine similarity between each two subject vectors, and using the cosine value of the included angle of the two vectors as the magnitude of the difference between the two individuals, wherein the closer the cosine value is to 1, the closer the included angle of the two vectors is to 0, the more similar the two vectors are, otherwise, the closer the cosine value is to 0, and the more dissimilar the two vectors are. And setting a given threshold value for the similarity value, and screening out topic pairs with similarity larger than the threshold value, namely screening out topic pair sets with higher similarity. For the target articles, classifying the perspective documents with higher similarity (larger than a similarity threshold value) into one category, namely classifying each theme and the perspective of which the similarity value of the theme of the target perspective is larger than the similarity threshold value into one category, and generating a document set, wherein the document set comprises the target articles.

The similarity algorithm is used for comparing the similarity degree between two objects, and has wide application in the aspects of text analysis, image processing, recommendation systems and the like. In this embodiment of the present application, the similarity calculation is performed using cosine similarity. Cosine similarity (Cosine Similarity) is a method for measuring the similarity between two vectors. Cosine similarity is the cosine value of the angle between two n-dimensional vectors A, B in n-dimensional space, which is equal to the product of the dot product of the two vectors divided by the length of the two vectors. In some embodiments of the present application, the similarity calculation is performed by using cosine similarity, and a specific calculation formula is as follows:

。

wherein A represents the selected topic vector, B represents a topic vector except A in the document set, and in order to calculate cosine similarity between A and other vectors in the document set, the value of B needs to traverse all vectors except A in the document set in turn to obtain cosine similarity values between each non-A vector and A vector in the document set. In this embodiment of the present application, cosine similarity works effectively in high-dimensional space as well, being suitable for processing high-dimensional text data. In short text processing, the data is often sparse, cosine similarity only focuses on the part of non-zero vocabulary, and the method is suitable for processing the sparse data. The cosine similarity is adopted for similarity calculation, the calculation is simple and easy to realize, the similarity is not influenced by vector dimensions and vector absolute sizes, the similarity only depends on the direction of the vector, and the method is suitable for high-dimensional features of text data and large-scale data sets.

In some embodiments of the present application, the views of each article in the target database are extracted using a UniLM model to form a view dataset, and the data format in the view dataset is short text.

In this embodiment of the present application, the UniLM model used is a novel pre-training language model proposed by Microsoft institute based on BERT model, called unified pre-training language model (Unified Language Model), which combines the advantages of both AR (Aoto-Regressive Lanuage Modeling) and AE (Auto-Encoding Language Modeling) language models. The UniLM model can be applied to both natural language understanding tasks (NLU) and natural language generating tasks (NLG). The structure is consistent with the BERT structure, and is composed of a multi-layer transducer network, and the task of prediction is performed by modifying a multi-layer Mask matrix. Model pre-training is performed using a large amount of unsupervised data, and the missing portions are derived in combination with the context.

FIG. 2 is a functional schematic of a UniLM model, and it can be seen that the UniLM model is capable of simultaneously achieving three pre-training goals, and that a sequence-to-sequence training manner is added to the original model. Model parameters are increasingly involved in natural language processing, while more data is required to be trained to prevent the risk of model overfitting. However, the difficulty of data processing is increased along with the increase of data, and in the field of natural language processing (Natural Language Processing, NLP), the most time-consuming part of manual annotation processing of data is often used. Specifically, in some embodiments, the article is an academic paper. Viewpoint generation refers to summarization of a given information content with a sentence or paragraph of defined length. The abstract view of the article can be automatically generated and understood to be an automatic abstract task of the full-text layer of the academic paper, and the abstract content of the article is simplified and summarized into short sentences capable of expressing the full-text content by using a UniLM unified language pre-training model. In natural language processing, there are mainly two methods for viewpoint generation, namely extraction and generation. In practical application, the extraction formula only considers the word frequency of the article, ignores the semantic content of the article, and the extracted viewpoint is difficult to represent the central sentence of the article. The generation formula is more in line with the thinking process of the human brain, and in the generation formula, the machine learning is a simulation of the human brain, so that the satisfaction degree of the obtained result is higher. In the embodiment of the present application, the automatic view creation work of the article is performed by using the creation formula. In some embodiments, the UniLM model is based on a bi-directional language model BERT, and is improved, so that the defects of large quantity of pre-training parameters and poor performance in text generation of the BERT model are overcome, the method is suitable for the text automatic generation task of long texts such as Chinese abstracts, and the text generation quality and efficiency are effectively improved.

In the embodiment of the application, the method of machine learning is used for performing viewpoint pre-training on the abstract of the academic paper, so that the machine can automatically extract the views of the abstract of the academic paper (namely the abstract of the academic paper). The method specifically comprises the following steps: taking the academic abstract as a text sequence with a length of aGenerating a sentence sequence with the length of b through machine learningAnd outputting the generated sentence sequence as an academic abstract viewpoint sentence.

The specific flow comprises the following steps: and (1) acquiring academic abstract data. The abstract information of the academic paper is obtained from five informatics main journal of book informatics works. And (2) data preprocessing. And screening the acquired documents, and eliminating meeting records, journal annual certificates, english documents and the like. (3) academic abstract classification. The collected academic abstract is divided into three types, namely a standard abstract, a semi-standard abstract and a non-standard abstract. (4) viewpoint creation model construction and expert extraction viewpoints. The two experts adopt a back-to-back mode to extract the academic abstract views manually, then the views marked manually are processed into a sequence vector form through Python, and a UniLM unified language pre-training model is used for machine learning. The viewpoint (5) is automatically generated. And automatically extracting the academic views from the academic abstract by using the view automatic generation rules through the learning of the model on a large number of data sets.

In the embodiment of the application, the UniLM model has better effect in pre-training, and the accuracy rate of extracting the academic abstract views applied to the UniLM model reaches 88 percent, so that the view automatic generation work can be effectively completed. By using the UniLM model, not only can the efficiency of viewpoint generation be improved, but also the objectivity of the viewpoints can be improved, and a mat is made for the subsequent viewpoint comparison.

In some embodiments of the present application, the generating a topic corresponding to the viewpoint of each article in the target database includes: and processing the viewpoint data set by using a TextRank4ZH model, automatically extracting the subject words from the short text aiming at each viewpoint, selecting three subject words with the largest weight value according to the weight sequence, and placing the three subject words into a word bag to generate a subject.

Extraction of subject words from text is one of the important content in the field of natural language processing, and it is a theory and method of efficient communication between humans and machines in natural language, and aims to enable a computer to understand and analyze human natural language to perform tasks such as translation, text classification, emotion analysis, etc. In the embodiment of the application, the basic idea is based on a TextRank algorithm, the TextRank algorithm is derived from a Pagerank algorithm of Google, the algorithm can separate from the interference of a corpus, divide a text into a plurality of constituent units and build a graph model, the important components in the text are ordered by a voting mechanism, keyword extraction can be realized by only using the information of a single document, and the method is applicable to a short text data set used in the embodiment of the application. The TextRank4ZH in the embodiment of the application is a keyword extraction and text abstract generation tool for Chinese texts based on the TextRank algorithm, the TextRank4ZH algorithm is selected to be more suitable for the Chinese texts used by the viewpoint data set, and the keyword can be automatically extracted by using the TextRank4 ZH.

Specifically, the process of extracting the subject term by the TextRank4ZH algorithm comprises the following steps: (1) segmenting a given text T into complete sentences. (2) And carrying out word segmentation and part-of-speech tagging on each sentence, filtering out stop words, and only retaining words with required parts-of-speech. (3) Constructing a candidate keyword graph G= (V, E), wherein V is a node set, then constructing edges between any two points by adopting a co-occurrence relation, and E represents the set of edges between the nodes in the graph G. There are edges between two nodes only co-occurring when their corresponding vocabulary is in a window of length K, K representing the window size, i.e. at most K words co-occurring. (4) And iteratively propagating the weight of each node according to the TextRank4ZH algorithm until convergence. (5) And (3) sorting the node weights in a reverse order, so as to obtain N most important words as candidate keywords. (6) And marking in the original text according to the obtained N most important words, and if adjacent phrases are formed, combining the adjacent phrases into multi-word subject words. In some embodiments of the present application, n=3, i.e. the three subject terms with the largest weight values are selected according to the weight ranking. In the embodiment of the application, textRank4ZH is a graphical model for chinese text mainly used for text summarization and keyword extraction, where each node represents a word and each edge represents a relationship between words. The algorithm is based on the PageRank algorithm, can be separated from a corpus, converts texts into a graph structure, calculates the weight value of each node in an iterative calculation mode, and can extract subject words directly according to a single document, wherein the larger the weight value of each node is, the more important the word or phrase is. Compared with other methods, textRank4ZH also considers the interrelationship between words, and the result of extracting the subject words in short text is better.

In the embodiment of the application, the TextRank4ZH is used for extracting the subject of a single article, clustering is performed according to cosine similarity, and weight calculation is performed according to time sequence provided by a similar view to obtain a contribution index value. According to the viewpoint contribution degree evaluation method, a complicated quotation network is not required to be constructed, the relation among nodes is avoided, and the evaluation can be more objective and accurate.

In some embodiments of the present application, each topic is composed of three topic words with the highest weight values, so that topic elements are not lost as much as possible, accuracy is improved, and calculation is easy. Specifically, the output is in the format of "subject 1-subject 2-subject 3". In some embodiments, the text length is insufficient or the occupation ratio of stop words and common words is too high, so that the output result is insufficient for three subject words, and the text "unextracted subject word" is used for replacing the subject word, so that the consistency of typesetting formats is ensured, and the subsequent comparison and analysis are convenient.

In some embodiments of the present application, the deriving the contribution of the target view according to the chronological order corresponding to each article in the document set includes: evaluating the contribution degree of the target viewpoint by using a viewpoint contribution degree index formula, wherein the viewpoint contribution degree index formula is as follows:

，

Wherein a, b and k are constants, the value of k can be adjusted according to the test result, t is the day value of the release date of the target article from the start date, and the value of P (t) decreases with the increase of t.

For a certain topic, the perspective documents with higher similarity (larger than the similarity threshold) are gathered into one category, and are ordered according to the time sequence of perspective release, and the release date of the target article is expressed asA document collection W is generated. In some embodiments, the date can be converted into a time stamp, a start date is set +.>T is the number of days from the date of publication of the target article to the date of start, i.e. +.>. Setting an index formula of viewpoint contribution degree according to an exponential decay functionWhere P (t) is the value of the contribution index at t in time, a is the initial value, P (t) is the value at t=0, b is a constant, b is a positive number, k is a positive number, t is the rate of increase, and t is the value of the time variable. The feature of this function is that as t increases, the function value becomes smaller and smaller, the magnitude of the decrease is determined by the k value, and the larger the t value, the later the publication time. In some embodiments of the present application, the opinion contribution index decreases with increasing t value, i.e., the earlier the opinion is presented, the greater the opinion contribution.

In some embodiments of the present application, the start date is the earliest date of the publication time range of all articles in the target database, the value of the constant b is set to 6, the value of the constant a is set to 4, and the value of the constant k is set to 0.002. Referring to fig. 3, by the above constant setting, the finally output viewpoint contribution index value is between 6 and 10 and the variation range is clear, so that evaluation and distinction of the viewpoint contribution are facilitated.

In some embodiments of the present application, the target article is representative of the learner to be evaluated. In some embodiments of the present application, the academic contribution of the learner can be represented by a degree of contribution made by the proxy. In some embodiments, the determination of the representatives may be based on the discretion of the learner, the representatives may be selected on a self-assessment basis, and the learner to be assessed may select outcomes that are believed to be best representative of the level of the learner to receive the assessment. Paper representatives often have the fundamental feature of high relevance and high quality, i.e. paper representatives should first be academic achievements highly relevant to the direction of investigation or topic of the authors' own, and scattered, irrelevant papers are not suitable as representatives. The first purpose of the representative evaluation is to discard the number and emphasize the quality, so that the paper at the general level cannot be called the representative, and the high quality is the fundamental manifestation of the representative of the paper. The quality of the representative work reflects the level of the scholars, and the academic contribution degree of the scholars can be further evaluated by reasonably evaluating the academic results. Of course, in some embodiments, artificial intelligence algorithms may also be employed to automatically obtain representatives of the students under evaluation based on the big data model. The representative acquisition method is not particularly limited in the present application.

Referring to fig. 4, in some embodiments of the present application, there is further provided a perspective contribution evaluation system for fusion topic extraction and cosine similarity, the perspective contribution evaluation system including:

the theme generation module is used for correspondingly generating a theme for the viewpoint of each article in the target database; wherein each article corresponds to a point of view, and the articles comprise target articles;

In some embodiments, the system for evaluating perspective contribution of fusion topic extraction and cosine similarity further includes a module capable of implementing each process and function in the method embodiments described in the above embodiments, and may be applied to an electronic device as described below, and for its effect, for avoiding repetition, a description is omitted here.

Referring to fig. 5, in some embodiments, first, extracting topics from views using TextRank4ZH for representatives using UniLM models to extract article abstracts, and introducing a word bag model to process topic words to avoid similarity calculation errors due to different topic word sequences. And then, calculating cosine similarity among the topics, gathering the topics with high similarity value into one class, and taking the corresponding sources of the topics as a similarity viewpoint set. And (3) providing an exponential decay function by combining the time factors, carrying out weight assignment on articles in the similar delegate set according to the sequence of the delegate opinion publishing time, wherein the opinion contribution degree value is larger as the same opinion publishing time is earlier, and finally generating the delegate opinion contribution degree index value of the single literature. Specifically, the steps of the viewpoint contribution degree evaluation method of the present application include the following.

1. Data preparation

The method comprises the steps of selecting classical journals in the field of informatics as research objects, uploading seven kinds of main informatics journal of informatics from 'informatics science' informatics theory and practice 'informatics' informatics data work 'book informatics work' Chinese librarian informatics 'informatics journal' seven kinds of informatics main journal, downloading 10104 articles from 2017-2023, and extracting views in article summaries as view data sets by using a UniLM model proposed by team members. Each view is indexed, and the index is set to a number from 1 to 10104, so that subsequent topics correspond to views.

2. Viewpoint topic extraction

And processing the text of the Chinese by using a TextRank4ZH, automatically extracting the subject words from the text, selecting three subject words with the largest weight value according to the weight sequence, generating a subject list, wherein each subject consists of three subject words, processing the subject words by using a word bag model, and avoiding calculation errors caused by different subject arrangement sequences. The method comprises the following specific steps:

(1) The data format in the viewpoint data set is short text, and the original text data is processed first to facilitate extraction of the subject term, and the process includes text cleaning, i.e. removing special characters, punctuation marks and numbers in the text. In addition, common spelling errors and abbreviations in the text are checked to ensure accuracy of the text. Text cleaning helps to improve the accuracy of subsequent keyword extraction. Next, the text is segmented, the text is broken down into words or phrases, and common stop words are removed, which aids in the segmentation of the text into meaningful vocabulary units.

(2) And converting the preprocessed text into a graphical model, and constructing the relation between the nodes and the edges. Each node in the graph represents a word and each edge represents a relationship between words. Such a graphical model helps to build associations between words, thereby better understanding the semantic structure of text.

(3) And calculating the weight value of each node by using an iterative calculation mode, wherein the larger the weight value of the node is, the more important the word is.

(4) And sorting according to the weight values, and finding out the most important words.

(5) According to the weight value of the word, extracting the topic word with the highest weight value, wherein each topic consists of three topic words with the highest weight value, and outputting the topic words in a format of 'topic word 1-topic word 2-topic word 3'. Because the text length is insufficient or the occupation ratio of stop words, common words and the like is too high, the output result is insufficient for three subject words, and the text 'unextracted subject word' is used for replacing the subject words, so that the consistency of typesetting formats is ensured, and the subsequent comparison and analysis are convenient.

(6) The method comprises the steps of processing the subject words by using a word bag model, so that three subject words of each subject are placed into one word bag, each subject is regarded as one word bag, the sequence in the word bag is not considered, calculation errors caused by different sequence of the subject words can be avoided, and each subject is generated to be one word bag.

(7) And carrying out de-duplication treatment on each word bag, so as to avoid subsequent calculation errors caused by repeated occurrence or different occurrence times of the subject words.

(8) And carrying out vectorization processing on the duplicate-removed word bags so as to carry out comparison calculation later.

3. Similarity clustering

Clustering is the basis of classification, and the basis of classification is based on similarity, for which similarity calculation is first required for the subjectThe present study uses cosine similarity algorithm to calculate the similarity between topics after word bagging and de-duplication of the generated subject words, and in the final stage of step (5), 10104 topics in total have been successfully generated. N topics are randomly selected from the list, and in some embodiments, the topics may be topics corresponding to the target articles, that is, target topics, and topics with higher similarity to the topics may be respectively selected. The criteria for this screening process is that each topic must have a sufficiently high cosine similarity to a topic (target topic) of the n topics that only topics meeting this criteria can be classified as the same class as a topic (target topic), i.e. we treat them as similar topics. In some embodiments, a cosine similarity threshold is set to a proper value, denoted as β, and a cosine similarity threshold range is set to a value of β . In some embodiments, with 60% similarity as a starting point, the step size is 0.1, and different similarity thresholds are sequentially tried to obtain clustering results under the condition that the similarity takes different values. And analyzing the results of topic clustering under different similarity thresholds, and comparing and evaluating the clustering results by using two indexes of diffusivity and convergence, wherein the diffusivity index can be used for measuring the difference or diversity among topics in a topic set, and the convergence index can be used for measuring the similarity among different topics in the set. And calculating Euclidean distances of center points of different topic categories in the whole set, averaging the distance values among topics to obtain a diffusion value of the whole set, wherein the larger the diffusion index is, the larger the difference among different categories is, namely the better the clustering effect is. And the convergence is that Euclidean distance between different topics in the same cluster is measured, an average value is taken to obtain a convergence index in each topic, the smaller the convergence is, the more compact the class cluster is, and the better the clustering effect is. The diffusion index and the convergence index are normalized to assign the calculated values to between 0 and 1 for comparison. In some embodiments, the similarity threshold is set to 0.7. As shown in Table 1, when the similarity value is 70% in the case where the convergence gap is small, the diffusion occurs The performance is superior.

TABLE 1 clustering result assessment index

The clustering result is subjected to dimension reduction and visualization by using T-SNE, wherein the T-SNE is a nonlinear dimension reduction algorithm which is suitable for dimension reduction of high-dimension data, and the clustering effect under different thresholds can be intuitively embodied by visualizing the clustering result, and the visualization result is shown in the graph of FIG. 6 when the cosine similarity threshold is 70%.

After the similarity value is selected, the bag of words is converted into a vector form in the last stage, and the cosine similarity between vectors can be calculated. Cosine similarity the similarity comparison is performed by calculating the cosine value of the angle between the two vectors, the larger this value is indicative of the higher the similarity. 31 topics (target topics) are randomly extracted from the processed data, and cosine similarity between other topics and the selected 31 topics is calculated, so that similarity measurement is carried out on each pair of topics to determine the similarity degree between the topics. Next, the value of the generated cosine similarity is compared, and for each topic, only the similarity is retained70% of topics are classified into one category, and are numbered and analyzed for subsequent searching, as shown in table 2 for screening out partial topic examples with similarity meeting requirements, it can be seen that the similarity of "subject-cross-medical" and "cross-medical-subject" is 1, and the value of similarity calculation of the explanatory topics is not affected by the sequence of topic word arrangement.

TABLE 2 similar topic section presentation

31 topics and similar topics corresponding to the 31 topics are selected in the experiment, 557 topics are extracted in total, and the annual distribution corresponding to the topic set is shown in figure 7. Finally 31 sets of topics are generated, each set representing a class of topics. Inside each set of similar topics, ascending order is performed according to published time order. And each topic is corresponding to the viewpoint sentence to which the topic belongs according to the index value, and a data set with similar viewpoints ordered according to the published time sequence is generated.

4. Contribution index calculation

In the academic literature library with a set age range, on the premise that topics are similar, the earlier the viewpoint is proposed, the higher the viewpoint contribution degree of the viewpoint is. Since the data format of "distribution time" is a date format, the larger the date is, the smaller the opinion contribution index value is. The method accords with the change trend of the exponential decay function, in some embodiments of the application, the exponential decay function is changed to a certain extent, and the viewpoint contribution index formula is set as follows:wherein a, b and k are constants, the value of b can be adjusted according to the test result, the value of k determines the function attenuation speed, and t is an independent variable. The P (t) value decreases as the t value increases.

Because the publication time in the dataset is in a date format, t is inconvenient to substitute for calculation. It is necessary to perform a time stamp operation on the date data first. In the present embodiment, the time range of the collected data is 2017-2023, and "2017-01-01" is set as the initial timeThe publication date is->Subtracting +.>The number of days from the start date is obtained as the t value, i.e. +.>. It can be known from the previous experimental condition that the t values are constant and positive, the smaller the t value isThe earlier the table, the final result range of t values is 4 to 2285, and the upper and lower differences of t value range reach 2281.

In this embodiment, the articles are published journal papers, basically meet the standards, set the value of the constant b to be 6, set the value of the constant a to be 4, set the value of the constant k to be 0.002, and have the viewpoint contribution index formula as follows:

，

substituting the value of t for calculation, retaining two decimal places, and finally outputting index values between 6 and 10 with clear variation amplitude. The trend of the index function with the increase of the t value is shown in fig. 3. In some embodiments, the contribution index values from different perspectives may be calculated by substituting a formula, rounding the calculated result to remain in integer bits, and rounding to an integer value as the scoring result.

5. Result verification

The most common method for evaluating is still the peer evaluation today, and because the study aims at assisting manual scoring through a machine algorithm, the experiment selects a manual evaluating mode to verify the machine model result. The specific scheme is as follows: and selecting two experts in the intelligence field to evaluate the abstracts corresponding to the 557 extracted topics, wherein the abstracts are completely disordered in sequence unlike the process, the similarity and time sequence problems are not considered any more, the abstracts are completely randomly arranged in sequence, and the abstracts are positioned according to the index numbers. The two experts adopt a back-to-back scoring mode without any communication and communication, and because the ideas and standards of the study are explained in detail for evaluating the representatives from the aspects of advancement and prospective, the two experts are respectively scored from the aspects of advancement and prospective of the article so as to ensure the consistency of the evaluation dimensions. Wherein the advance means that the article view has novelty and uniqueness based on the existing research, or new ideas are put forward, which contribute to the field; while prospective refers to the article perspective with directions and implications for future research directions.

The score ranges from 6 to 10 points, and each article can obtain 2 scores, the average value is taken as the final result of the article, and the numerical result is reserved to an integer bit by rounding. The overall verification flow is shown in fig. 8.

After the result of expert manual scoring is obtained, the final scores of the two experts are compared, the data inconsistent in expert scoring results are removed, and only the abstract consistent in expert scoring is reserved. According to numerical statistical analysis, two experts respectively score 557 abstract data, wherein 40 inconsistent results account for 7.18% of the total data set, and the remaining 517 available abstract accounts for about 92.82% of the total abstract number. The viewpoint contribution index obtained by the previous machine operation is also rounded to be reserved to the integer, and is compared and analyzed with the scores of the 517 abstracts, and statistics shows that the viewpoint contribution index of 449 abstracts is completely consistent with expert scoring results, and the viewpoint contribution index is about 86.85% of the number of available abstracts, and the machine operation result of 68 abstracts is inconsistent with expert review results and is 13.15% of the available abstracts. The statistics of the evaluation results are shown in table 3, and the consistency of the model calculation results and expert scoring results is high.

Table 3 results statistics table

According to the innovative viewpoint contribution evaluation method, a TextRank4ZH subject term extraction algorithm, a word bag model, a cosine similarity algorithm and an exponential decay function are combined together, an integral machine model is built, and the representative viewpoint contribution is evaluated from the published time sequence based on viewpoint dimensions. The model is based on natural language processing and similarity clustering technology, and can automatically analyze texts and generate a representative viewpoint contribution evaluation index system. And comparing the generated contribution index with the manual scoring result. The result shows that the model of the application has higher accuracy and the consistency with manual review reaches 86.85 percent. The model of the application is excellent in academic representatives and can generate evaluation indexes in a highly consistent manner, and the evaluation indexes are consistent with manual evaluation results. Moreover, the whole evaluation process is realized by automatic operation of the machine, which is helpful for eliminating subjective interference factors in manual evaluation. This makes the method of the present application more scientific and objective, improving the reliability of the evaluation.

In some embodiments of the present application, as shown in fig. 9, there is also provided an electronic device, including:

A processor 10;

a memory 20 storing processor-executable instructions, wherein:

the processor reads instructions from the memory to implement the steps of the perspective contribution evaluation method as described in any of the above.

In some embodiments, the electronic device may include, but is not limited to, a smart phone, tablet, wearable device, personal computer (personal computer, PC), netbook, personal digital assistant (personal digital assistant, PDA), smart watch, in-vehicle device, robot, desktop computer, and the like.

Some embodiments also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed in a computer, causes the computer to perform the viewpoint contribution degree evaluation method described in the above respective method embodiments.

The computer readable storage medium may be a memory that can be used to store a software program as well as various data. The memory may mainly include a memory program area and a memory data area, wherein the memory program area may store the above-mentioned computer program. The storage data area may store model data of the viewpoint contribution degree evaluation system of the above-described fusion subject extraction and cosine similarity, and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

In some embodiments, the computer readable storage medium of the present application may include one or more databases, such as a key value database, mySQL database, etc., and the category of each database and its data storage manner are not described in detail herein. One or more databases of some embodiments of the present application may be integrated with an electronic device, or may exist as a separate server or in a cloud storage form, and may specifically be determined according to a system structure and an application requirement of an application platform to which the present application is applied.

The processor is a control center of the electronic device, and connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or models stored in the memory, and calling data stored in the memory, thereby performing overall control of the electronic device. The processor may include one or more processing units; preferably, the processor may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor.

In some embodiments, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the step diagram of the opinion contribution evaluation method of any of the embodiments described above.

Those of skill in the art will appreciate that in one or more of the above examples, the functions described in the various embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A viewpoint contribution degree evaluation method integrating theme extraction and cosine similarity is characterized by comprising the following steps:

2. The method of evaluating a degree of contribution to a viewpoint according to claim 1, wherein each of the topics is composed of a plurality of subject words, and wherein the generating of a topic for each article in the target database includes: placing a plurality of subject words of each subject into a word bag, taking the plurality of subject words in the word bag as a set without considering the sequence of the plurality of subject words, and performing a duplication removing operation on each word bag.

3. The viewpoint contribution evaluation method according to claim 2, wherein the comparing the similarity of each of the subjects to the subject of the target viewpoint includes: and comparing the cosine similarity of each theme with the theme of the target viewpoint, and generating a similarity value of each theme with the theme of the target viewpoint.

4. The viewpoint contribution evaluation method according to claim 3, wherein the viewpoint of each article in the target database is extracted by using a unified language pre-training model to form a viewpoint data set, and the data format in the viewpoint data set is short text;

5. The viewpoint contribution evaluation method according to claim 4, wherein the deriving the contribution of the target viewpoint according to the chronological order corresponding to each article in the document set includes: evaluating the contribution degree of the target viewpoint by using a viewpoint contribution degree index formula, wherein the viewpoint contribution degree index formula is as follows:

，

6. The viewpoint contribution degree evaluation method according to claim 5, wherein the start date is an earliest date of a publication time range of all articles in the target database, a constant b is set to 6, a value of a is set to 4, and a value of k is set to 0.002.

7. The method of claim 1, wherein the target article is a representative of a student to be evaluated.

8. A viewpoint contribution degree evaluation system integrating topic extraction and cosine similarity, characterized in that the viewpoint contribution degree evaluation system comprises:

9. An electronic device, the electronic device comprising:

a processor;

a memory storing processor-executable instructions, wherein:

a processor reads instructions from a memory to implement the steps of the method according to any of claims 1-7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any of claims 1-7.