CN110298033B

CN110298033B - Keyword corpus labeling training extraction system

Info

Publication number: CN110298033B
Application number: CN201910455064.3A
Authority: CN
Inventors: 崔莹; 代翔; 黄细凤; 王侃; 杨拓; 余博; 朱宇涛; 李超; 李源源
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2022-07-08
Anticipated expiration: 2039-05-29
Also published as: CN110298033A

Abstract

The invention discloses a keyword corpus labeling training extraction tool, and aims to provide a labeling training tool which can reduce the complexity of a manual labeling process and improve the efficiency and accuracy of labeling massive keyword corpora. The invention is realized by the following technical scheme: the method comprises the steps that a keyword corpus labeling preparation module distinguishes mass corpus data from different sources, a semi-automatic corpus keyword labeling module creates a keyword labeling task, an adaptive algorithm is selected autonomously, automatic labeling based on an algorithm model is carried out, pre-labeling processing is carried out on text corpus data to be labeled through integrating at least one keyword extraction algorithm of CHI, LDA, TEXTRANK and TFIDF, labeling results of multiple algorithms are fused, and a feedback type keyword labeling model learning and training module trains a keyword labeling algorithm model after the labeling task is completed; and the keyword labeling model effect evaluation module automatically evaluates the quantitative labeling effect of the model index.

Description

Keyword corpus labeling training extraction system

Technical Field

The invention relates to the technical field of text mining, in particular to a keyword corpus semi-automatic labeling training extraction system.

Background

In the field of natural language processing, the most important key for processing massive text files is to extract the most concerned problems of users. Whether for long text or short text, the subject idea of the entire text can often be snooped by several keywords. Meanwhile, regardless of text-based recommendation or text-based search, the dependency on text keywords is also great, and the accuracy of keyword extraction directly relates to the final effect of a recommendation system or a search system. Therefore, keyword extraction is an important part in the field of text mining. The rapid development of the network provides people with a simple and convenient information acquisition way, and the number of electronic documents such as web pages, mails, electronic books and the like is increasing. However, the information resources which are increased explosively lack the content structuring, so that people have to spend a lot of time to read and arrange the information while obtaining a lot of information, and the retrieval efficiency of people is greatly reduced. Therefore, how to organize up numerous and disordered resources and improve the efficiency of information utilization makes it extremely important for people to simply, quickly and accurately acquire the key information of the texts. The automatic extraction of the keywords has wide application in various aspects. Especially in the fields of knowledge mining, information retrieval, text clustering, text classification and the like, automatic keyword indexing is a more fundamental and core technology. The keyword automatic indexing technology also plays a more critical role in the fields of relevant feedback, automatic filtering, event detection and tracking and the like. It can be said that the keyword automatic indexing technology is a fundamental work for performing all text automatic analysis processes, and is essential in many text analysis works. The extraction of the keywords plays an important role in the aspects of automatic abstracting, information retrieval, text classification, text clustering and the like. How to intelligently, quickly and effectively acquire information from the internet becomes a problem which needs to be solved urgently in the field of computers at present. Keyword extraction is an important means for realizing quick and accurate acquisition of information on the internet. Keywords have some characteristics, and keywords are generally nouns or noun phrases; keywords generally do not start or end with stop words; the length of the keywords is generally not too long. The problem to be considered in the selection of the characteristics of the keywords is that the development and selection of the characteristics are a key point and a difficult point in the work of extracting the keywords, and the good and bad characteristics are directly related to the judgment of the keywords.

In recent years, with the rapid development of large data acquisition means, text information on a network has been explosively increased, so that the difficulty in acquiring required information is increasing. It becomes particularly urgent to mine the maximum value from the data, which puts a completely new demand on intelligent analysis of large data. In order to process information resources expanding at a high speed, a manual processing method becomes impractical, so that an automatic processing method is needed to help people to effectively manage and organize information so as to solve the problem of rich information and poor knowledge. Under the background, technologies such as machine learning and deep learning are rapidly developed and have great success in big data application, and model algorithms used at the bottom of the technologies need to rely on a large amount of data labeling corpora as basic training supports. Most of the existing keyword extraction algorithms judge the importance of words by using the statistical information of the words, and select the words exceeding a certain threshold value as the keywords of the articles. However, the statistical method is computationally expensive and requires a large amount of statistical corpora. A plurality of key word measurement functions are provided based on the method, and the key word measurement functions comprise TF/IDF, entropy functions, distribution coefficients and the like. Many machine learning algorithms are also applied to keyword extraction, for example, CHI and TFIDF can be used as feature selection and weight calculation methods, except that TFIDF can be used for any text set, and CHI requires a label of a text with a classification label to calculate. TextRank was originally proposed as a keyword extraction method, and some attempts have been made as a weight calculation method later, but TextRank has high calculation complexity.

The mass data corpus labeling work has an important influence on the training of an algorithm model, is used as basic work in a big data analysis process, mainly supports links such as daily research and development, algorithm tuning, demonstration and verification of big data, and is a core foundation of big data mining analysis. The keyword is a word having a substantial meaning for the main content of the text, and is a word or a phrase selected from a title, an abstract, and a body in order to satisfy text indexing or search work. Keyword extraction is a process of selecting an appropriate feature item set capable of completely expressing the subject content from a single text or a text set through statistical and semantic analysis of core words. Since the key words are the most basic units for representing the meaning of the text subjects, the key word extraction is generally performed in the fields of natural language processing and Chinese information processing, such as automatic summarization, information retrieval, text clustering, automatic question answering, topic tracking and the like, and the key word extraction also has important clue value for information monitoring and tracking. The part of speech is the result obtained by word segmentation and grammar analysis. Most of the existing keywords are nouns or dynamic nouns. Generally, the position where a word appears is of great value to the word. For example, the headlines and abstract are the central ideas of the articles summarized by the authors, so the words appearing in these places have certain representativeness and are more likely to be keywords. However, since the habits of each author are different, the writing mode is different, and the position of the key sentence is also different, this is also a very wide method for obtaining the key word, and is not used alone in general. Whether a word is important in an article is judged, an easily-thought measuring index is word frequency, and the important word often appears in the article for multiple times. On the other hand, it is important that the words not appearing frequently are not the same, because some words appear frequently in various articles, and the importance of the words is certainly not as strong as that of the words appearing frequently only in a certain article. The word frequency indicates the frequency with which a word appears in the text. It is generally believed that the more frequently a word appears in text, the more likely it is that the word is to serve as the core word of an article. Word frequency simply counts the number of times a word appears in a text, but keywords obtained only by word frequency have great uncertainty, and for a text with a longer length, the method has great noise. The key point of the keywords based on the statistical characteristics lies in the calculation of characteristic quantization indexes, and the results obtained by different quantization indexes are different. Meanwhile, different quantization indexes have respective advantages and disadvantages, and in practical application, Topk words are obtained by combining different quantization indexes and are used as keywords. Keyword extraction has wide application in the field of text mining, and the existing method has certain problems. In the prior art, the method based on statistics and machine learning has the effect of machine learning which depends more on artificially labeled corpora, that is, model parameters are trained according to observed data (labeled corpora), the probability of occurrence of various participles is calculated by the model in a participle stage, and the participle result with high probability is taken as a final result. The machine learning-based method can be implemented on the premise that a knowledge base or a training base with a large enough data size is established. Because the problem of knowledge learning is not solved fundamentally at present, the updating of the knowledge base is slow and cannot keep up with the current scientific development. The labeled data set provides limited information, and manual labeling of samples is time-consuming and labor-consuming, and large-scale labeling is too expensive. The number of easily obtained unlabeled samples (e.g., web pages on the internet) is large relative to the labeled samples, and also approximates the distribution of data across the entire sample space. Providing as many annotation samples as possible requires hard and slow manual labor, which affects the construction of the whole system, and this creates a bottleneck problem for annotation. The idea of the keyword extraction algorithm based on statistical characteristics is to extract keywords of a document by using statistical information of words in the document. Generally, a text is preprocessed to obtain a set of candidate words, and then keywords are obtained from the candidate set by means of characteristic value quantization. Common sequence labeling models are HMM and CRF. The word segmentation algorithm can be used for processing ambiguous and unknown word problems, has better effect than the former word segmentation algorithm, but needs a large amount of manual labeling data and has lower word segmentation speed. The supervised keyword extraction algorithm regards the keyword extraction algorithm as a binary problem and judges whether a word or phrase in a document is a keyword or not. Since the problem is a classification problem, it is necessary to provide labeled training corpora, train a keyword extraction model using the training corpora, and extract keywords from documents whose keywords are to be extracted according to the model.

The traditional keyword extraction method is divided into two types, namely an unsupervised method and a supervised method. The unsupervised method comprises TF-IDF, Chi-squared, TextRank, LDA and other methods, the supervised method converts the keyword extraction problem into a two-classification problem for judging whether each word is a keyword, and the keyword extraction is carried out by people through the supervised methods such as NaiveBayes and a decision tree C4.5. Unsupervised and supervised methods have their advantages and disadvantages: the unsupervised method does not need to manually label a training set, so that the method is faster, but the effect is probably inferior to the supervised method because various information cannot be comprehensively utilized to sort the candidate words; the supervised method can adjust the influence degree of various information on the judgment of the keywords through training and learning, so the effect is better, but in the current data era, the labeling training set is very time-consuming and labor-consuming. The supervised text keyword extraction algorithm has the disadvantage of requiring high labor cost. The third category is that the computer simulates the understanding of sentences by people to achieve the effect of recognizing words, because of the complexity of Chinese semantics, various language information is difficult to organize into a form which can be recognized by a machine, because a large amount of training linguistic data needs to be marked, a manual method is time-consuming and labor-consuming, and the word segmentation system is still in a test stage at present. In the construction process of the language network graph, the preprocessed words are used as nodes, and the relation between the words is used as an edge. In the language network diagram, the weight between the edges is generally expressed by the degree of association between words. When the keyword is obtained by using the language network graph, the importance of each node needs to be evaluated, then the nodes are sequenced according to the importance, and the word represented by TopK nodes is selected as the keyword. Due to the characteristics of Chinese language, no explicit word boundary exists, certain difficulty is added to the task of automatically indexing keyword strings, and the keyword extraction model can achieve a stable effect by needing more training corpora. In practical applications, due to the complexity of the application environment, the effect obtained by the same text keyword extraction method is the same for different types of texts, such as long texts and short texts. In practical application, algorithms adopted by different condition environments are different, and no algorithm has a good effect in all environments. Meanwhile, the engineering also has great dependence on the accuracy of text preprocessing and text word segmentation. For the information of wrongly written characters, deformed words and the like of the text, the problem needs to be solved in a preprocessing stage, and the selection of a word segmentation algorithm and the identification of unknown words and ambiguous words have great influence on the extraction of the key words to a certain extent. Keyword extraction is a seemingly simple but very tricky task in practical application. Because the Chinese opinion type subjective text labeling corpus contains a large amount of information such as word segmentation, part of speech, dependency relationship, semantics, word concepts and opinions, the completed labeling is usually complex. In order to reduce the burden of the annotating personnel, improve the efficiency and accuracy of annotation and reduce the error rate of annotation, it is necessary to develop an automatic annotation system aiming at the keyword corpus to assist the work of the annotating personnel. At present, key word corpora in the field are relatively deficient, and the work of labeling the key word corpora is mainly completed by manual labeling, so that the problems of poor corpus labeling quality, complicated labeling process, low labeling efficiency, high human resource cost and the like widely exist. Meanwhile, the existing keyword corpus labeling system has the defects that the labeling method is single, the labeling method model is difficult to update automatically and the like, so that a set of semi-automatic keyword labeling and training platform capable of assisting in manually labeling the corpus is urgently needed to solve the problems.

Disclosure of Invention

The invention aims to solve the defects of the keyword corpus labeling and the use of the corpus in the training process, and provides a semi-automatic keyword corpus labeling training system which can reduce the complexity of the manual labeling process, reduce the labor work cost and improve the efficiency and the accuracy of mass keyword corpus labeling.

The above object of the present invention can be achieved by the following technical solutions: a keyword corpus annotation training extraction system comprises: keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, feedback type keyword labeling model learning and training module and keyword labeling model effect evaluation module, its characterized in that: the keyword corpus labeling preparation module distinguishes mass corpus data of different sources, selects keyword corpus sources aiming at keyword corpuses of different purposes, and sets the selected keyword corpus sources as to-be-labeled corpuses of different purposes, namely raw corpuses; the semi-automatic corpus keyword labeling module firstly creates a keyword labeling task, autonomously selects an adaptive algorithm and carries out automatic labeling based on an algorithm model aiming at different labeling use requirements and corpus characteristics, performs pre-labeling processing of a single keyword on the corpus data of a text to be labeled by integrating CHI, LDA, a keyword extraction algorithm based on graph sorting, TEXTRANK and TFIDF, can also automatically perform pre-labeling processing of the single keyword on the corpus data of the text to be labeled by a business rule, can also select a plurality of keyword extraction algorithms to label the keyword at the same time, fuses labeling results of a plurality of algorithms, further pre-judges the fused labeling results manually according to the keyword labeling business standard, stores the labeling results as cooked corpus, manages by a keyword corpus labeling preparation module, the method is used for training a labeling algorithm model, and a uniform keyword model access standard is provided to finish the work of labeling the corpus keywords; after the labeling task is completed, a feedback type keyword labeling model learning and training module provides learning and training of an algorithm model aiming at an internally integrated keyword labeling algorithm model and an externally depth-enhanced labeling algorithm model through setting parameters of the keyword algorithm model, retrains the keyword labeling algorithm model by using labeled keyword corpora, improves and updates the feedback model, and automatically feeds back and adjusts to complete a new keyword labeling task through continuous iteration between model updating and corpus labeling; the keyword labeling model effect evaluation module constructs a keyword evaluation index according to an evaluation index standard of the keyword, quantifies the evaluation index based on a keyword index rule, establishes a labeling algorithm comprehensive evaluation model, automatically evaluates the quantitative labeling effect of the model index, and automatically recommends an optimal labeling model for the subsequent keyword labeling task.

Compared with the prior art, the invention has the following beneficial effects:

the complexity of the manual marking process can be reduced, and the labor cost is reduced. The invention adopts a system which mainly comprises four modules of keyword corpus labeling preparation, semi-automatic corpus keyword labeling, feedback type keyword labeling model learning training and keyword labeling model effect evaluation, can provide an automatic labeling mode based on self-selection adaptive algorithm and multi-algorithm fusion aiming at different labeling use requirements and corpus characteristics, the multi-algorithm fusion automatic labeling adopts a voting method to perform fusion processing on multi-algorithm results, the performance of the integration method is superior to that of a single method under the condition of ignoring correlation, the pre-labeling work performed by the method can reduce the complexity of the manual labeling process, reduce the labor work cost and have certain flexibility and higher automatic processing capability.

The keyword corpus labeling efficiency is high. According to the method, the data from different sources are distinguished, so that the keyword linguistic data are managed; by supporting the integration of keyword extraction algorithms such as CHI, LDA, TEXTRANK, TFIDF and the like in a real-time background, aiming at different keyword linguistic data, a training model library for extracting keywords such as applicable labeling algorithms CHI, LDA, TEXTRANKRANK, TFIDF and the like is provided in the labeling process to be selectable, the pre-labeling processing of a single keyword method or the pre-labeling processing of multi-keyword method fusion is carried out on the linguistic data to be labeled, an artificial judgment link is introduced, and the system supports the automatic feedback adjustment of a real-time background keyword algorithm model to complete a new keyword labeling task, so that the time for obtaining information can be greatly shortened, the efficiency for obtaining information is improved, and the linguistic data labeling efficiency is greatly improved.

According to different labeling use requirements and corpus characteristics, an adaptive algorithm is selected autonomously, automatic labeling is carried out, single keyword pre-labeling processing or multi-keyword fusion pre-labeling processing is carried out on text corpus data to be labeled through integrating at least one keyword extraction algorithm in CHI, LDA, TEXTRANK and TFIDF, and a uniform keyword model is provided to access a standard to complete corpus keyword labeling work; and after the labeling task is finished, retraining the keyword model by using the labeling corpus. The model labeling effect is evaluated by establishing a labeling algorithm comprehensive evaluation model, and the keyword model learning training is fed back, so that the model achieves the best effect, the accuracy of the keyword labeling model is improved, the subsequent labeling tasks are newly added, the corpus keyword labeling quality and the algorithm model effect are improved through continuous iteration between model updating and corpus labeling, and the error rate of keyword labeling is reduced. Finally, the manual evidence judgment link is used for realizing the intervention evidence judgment of the labeling result, and the manual confirmation link is used for modifying, confirming and submitting the keyword labeling corpus so as to finish the corpus keyword labeling work and greatly improve the accuracy rate and accuracy of keyword extraction; experiments prove the effectiveness of the keyword labeling training extraction system applied to labeling keyword corpora.

The invention simplifies the user labeling operation process, supports the import, the training and the use of the external model through a friendly man-machine interactive labeling interface.

Drawings

FIG. 1 is a schematic diagram of a keyword corpus annotation training extraction system according to the present invention.

FIG. 2 is a flow diagram of the keyword model training process of FIG. 1.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

Detailed Description

See fig. 1. In a preferred embodiment described below, a keyword corpus annotation training extraction system includes: keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, feedback type keyword labeling model learning training module and keyword labeling model effect evaluation module, wherein: the keyword corpus labeling preparation module distinguishes mass corpus data of different sources, selects keyword corpus sources aiming at keyword corpuses of different purposes, and sets the selected keyword corpus sources as to-be-labeled corpuses of different purposes, namely raw corpuses; the semi-automatic corpus keyword labeling module firstly creates a keyword labeling task, further automatically selects an adaptive algorithm and carries out automatic labeling based on an algorithm model according to different labeling use requirements and corpus characteristics, performs single keyword pre-labeling processing on the corpus data of a text to be labeled by integrating CHI, LDA, a keyword extraction algorithm based on graph sorting, TEXTRANK and TFIDF, can perform single keyword pre-labeling processing on the corpus data of the text to be labeled by automatic labeling based on business rules, can also perform keyword labeling by simultaneously selecting a plurality of keyword extraction algorithms, fuses labeling results of a plurality of algorithms, further performs pre-judgment on the fused labeling results manually according to keyword labeling business standards, stores the labeling results as cooked corpus, and manages by a keyword corpus labeling preparation module, the method is used for training a labeling algorithm model, and a uniform keyword model access standard is provided to finish the work of labeling the corpus keywords; after the labeling task is completed, a feedback type keyword labeling model learning and training module provides learning and training of an algorithm model aiming at an internally integrated keyword labeling algorithm model and an externally depth-enhanced labeling algorithm model through setting parameters of the keyword algorithm model, retrains the keyword labeling algorithm model by using labeled keyword corpora, improves and updates the feedback model, and automatically feeds back and adjusts to complete a new keyword labeling task through continuous iteration between model updating and corpus labeling; the keyword labeling model effect evaluation module constructs a keyword evaluation index according to an evaluation index standard aiming at the keyword, quantifies the evaluation index based on a keyword index rule, establishes a labeling algorithm comprehensive evaluation model, automatically evaluates the quantitative labeling effect of the model index, and automatically recommends an optimal labeling model for a subsequent keyword labeling task.

The embodiment provides a text corpus tagging preparation module for managing the corpus to be tagged according to the source or the theme and providing preparation for a tagging task; the semi-automatic corpus keyword labeling module autonomously selects an adaptation algorithm and carries out automatic labeling aiming at different labeling use requirements and corpus characteristics, realizes intervention judgment of a labeling result through an artificial judgment link, and specifically comprises the following steps:

the semi-automatic corpus keyword labeling module creates a keyword labeling task according to different source corpuses; selecting an effect-adaptive algorithm model for each type of labeling task, for example, selecting keyword extraction algorithms such as CHI, LDA, TEXTRANK, TFIDF and the like in the keyword labeling task to finish automatic labeling, wherein a specific labeling algorithm can be configured according to the corpus automatic labeling effect, and a semi-automatic corpus keyword labeling module automatically recommends a default labeling algorithm according to the result of a keyword labeling model effect evaluation module; the semi-automatic corpus keyword labeling module firstly creates a keyword labeling task, further automatically selects an adaptive algorithm and carries out automatic labeling based on an algorithm model according to different labeling use requirements and corpus characteristics, performs single keyword pre-labeling processing on text corpus data to be labeled through integrating CHI, LDA and at least one keyword extraction algorithm in keyword extraction algorithms TEXTRANK and TFIDF based on graph sorting, and performs single keyword pre-labeling processing on the text corpus data to be labeled through automatic labeling based on business rules. The semi-automatic corpus keyword labeling module creates a service labeling rule aiming at a special labeling task and manages a labeling service rule, wherein the labeling service rule mainly comprises a service dictionary and a regular expression used for matching a character string, such as: key date and time: the geography of interest. The regular expression is directly defined as a variable such as reg and dim reg as expreg, after Microsoft describing runtime is selected, a dictionary object is directly defined as a variable, dim d as dictionary. The rough matching process of the regular expression is that the characters in the expression and the text are taken out in sequence for comparison, and if each character can be matched, the matching is successful; matching fails once there is a character that fails matching. Marking the corpus automatically by a marking person by adopting a marking business rule; the method comprises the steps of fusing an automatic labeling result based on an algorithm model and an automatic labeling result based on a business rule, selecting multiple keyword extraction algorithms for keyword labeling, fusing the labeling results of the multiple algorithms, further judging, modifying, confirming and storing the fused labeling results manually according to a keyword labeling business standard, storing the labeling results as cooked linguistic data, managing the cooked linguistic data by a keyword linguistic data labeling preparation module for use in labeling algorithm model training, and providing a uniform keyword model access standard to finish linguistic data keyword labeling work.

See fig. 2. The feedback type keyword labeling model learning training module provides model learning training for the keyword labeling algorithm model and the external depth enhancement labeling algorithm model which are integrated inside through the setting of the parameters of the keyword algorithm model. In the keyword model training processing flow, a feedback type keyword labeling model learning training module reads labeled linguistic data used for training, selects key algorithm training, performs off-line training on trainable algorithms such as CHI, LDA, TEXTRANK and TFIDF by using labeled linguistic data aiming at untrained algorithms without training process and ending, and invokes a uniform training model interface Train to generate a keyword model sequence file Kryo to ensure that the model accuracy reaches the best. After a keyword model sequence file Kryo is generated, a feedback type keyword labeling model learning training module judges whether a keyword model is stored or not, if not, the keyword model is ended, if so, an external algorithm model is imported according to a unified model access interface, the external algorithm model is updated or exported, a keyword model file comprising an algorithm name, a model name and a serialization model file is stored, and a keyword training model table is updated; and updating the model for labeling the keywords in the platform by using the trained model to complete a new keyword labeling task. In the updating of the keyword model, a feedback type model learning training module starts a keyword service, selects a pre-updated keyword algorithm, and finishes the operation if the selected keyword algorithm is a non-trainable algorithm; and judging whether to update the keyword model or not by analyzing a switch for updating the keyword in the configuration file according to selected trainable algorithms such as CHI, LDA, TEXTRANK, TFIDF and the like, and if not, ending. If so, reading the appointed keyword model file according to the keyword model name and the keyword training model table, performing deserialization on the read keyword model file, completing loading of the keyword model, and ending the program.

The evaluation module of the labeling model effect provides methods for constructing labels, construction rules, index quantification and the like for the model evaluation indexes, supports the evaluation of the model labeling effect by automatically constructing a labeling algorithm comprehensive evaluation model, and comprises the following specific steps: the labeling model effect evaluation module sets a single index algorithm according to the index standard; quantifying the indexes according to an index calculation rule, and constructing a labeling algorithm comprehensive evaluation model by adopting corresponding indexes of the organization according to different labeling tasks; and (4) completing the calculation of the index comprehensive value and feeding back the effect of the labeling model.

The quality and the evaluation standard of the keyword extraction do not have a unified evaluation method at home and abroad, and because the selection of the text data has higher subjectivity, the quality and the evaluation standard of the keyword extraction are carried out by adopting two modes of machine quantitative analysis and artificial subjective judgment. The most common indicators of machine quantitative analysis are accuracy p (precision), recall r (recall), average F of keyword extraction accuracy and recall, and consideration E for weighting keyword extraction accuracy and recall according to the application requirement, wherein,

the accuracy and recall are generally referred to as an inverse relationship. Through a certainThese methods increase accuracy, which results in a decrease in recall, and vice versa. In order to define different requirements of the application system on the accuracy and the recall rate, a weighted consideration can be given to the application system, so that a consideration value E for weighting the keyword extraction accuracy and the recall rate is obtained:

wherein, b is the added weight, the larger b is, the larger the weight of the accuracy rate in the consideration of the E value is, otherwise, the larger the weight of the recall rate is.

Besides, there are two common indicators, reference to a reference value binypreferrence measure (bprf) and a mechanism evaluation indicator MRR (meandercaptorark average reciprocal rank) that evaluates the search algorithm. The reference value bppref is an evaluation index that takes into account the sort order. For a document, if R out of M extracted keywords are standard answers, where the exact extraction is denoted by R and the erroneous extraction is denoted by n, the reference value bppref is calculated by the following formula:

the search algorithm evaluation mechanism evaluation index MRR is used for measuring the ranking condition of the first accurately recommended keyword of each document and is an evaluation index aiming at the document set. For a document d, use rank_dTo represent the ranking position of the first accurately recommended keyword, the evaluation index MRR is defined as:

and D is a document set for performing the keyword extraction test.

The method comprises the steps that a corpus to be labeled is managed according to sources or topics, and preparation is provided for labeling tasks; the method comprises the steps of integrating keyword extraction algorithms such as CHI, LDA, TEXTRANK, TFIDF and the like to complete semi-automatic labeling of keyword linguistic data, providing a selectable applicable labeling algorithm in a labeling process, and performing keyword pre-labeling processing on linguistic data to be labeled; and finally, modifying, confirming and submitting the marked corpus through a manual confirmation link to finish corpus marking work. And after the labeling task is finished, retraining the model by using the labeling corpora. And evaluating the labeling effect of the model by establishing a labeling algorithm comprehensive evaluation model, feeding back model learning training to enable the model to achieve the best effect for the subsequent new labeling task, and improving the corpus labeling quality and the algorithm model effect through continuous iteration between model updating and corpus labeling.

The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A keyword corpus annotation training extraction system comprises: keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, feedback type keyword labeling model learning and training module and keyword labeling model effect evaluation module, its characterized in that: the keyword corpus labeling preparation module distinguishes mass corpus data of different sources, selects keyword corpus sources aiming at keyword corpuses of different purposes, and sets the selected keyword corpus sources as to-be-labeled corpuses of different purposes, namely raw corpuses; the semi-automatic corpus keyword labeling module firstly creates a keyword labeling task, autonomously selects an adaptive algorithm and develops automatic labeling based on an algorithm model aiming at different labeling use requirements and corpus characteristics, performs single keyword pre-labeling processing on text corpus data to be labeled by integrating CHI, LDA, a keyword extraction algorithm based on graph sorting, TEXTRANK and TFIDF, or simultaneously selects a plurality of keyword extraction algorithms to label keywords and fuses a plurality of algorithm labeling results, after the labeling task is completed, a feedback type keyword labeling model learning and training module provides learning and training of the algorithm model aiming at an internally integrated keyword labeling algorithm model and an externally depth enhanced labeling algorithm model by setting parameters of the keyword algorithm model and retrains the keyword labeling algorithm model by using the labeled keyword corpus, the feedback model is improved and updated, and a new keyword labeling task is automatically fed back and adjusted through continuous iteration between model updating and corpus labeling; the keyword labeling model effect evaluation module constructs a keyword evaluation index according to an evaluation index standard of a keyword, quantifies the evaluation index based on a keyword index rule, establishes a labeling algorithm comprehensive evaluation model, automatically evaluates the quantitative labeling effect of the model index, and automatically recommends optimal labeling for a subsequent keyword labeling task; the model feedback type keyword labeling model learning training module reads labeled linguistic data used for training, selects a key algorithm for training, performs off-line training on CHI, LDA, TEXTRANK and TFIDF trainable algorithms by using labeled linguistic data aiming at untrained algorithms without training process and ending, and calls a uniform training model interface Train to generate a keyword model sequence file Kryo to ensure that the model accuracy reaches the best; after a keyword model sequence file Kryo is generated, a feedback type keyword labeling model learning training module judges whether a keyword model is stored or not, if not, the keyword model is ended, if so, an external algorithm model is imported according to a unified model access interface, the external algorithm model is updated or exported, a keyword model file comprising an algorithm name, a model name and a serialization model file is stored, and a keyword training model table is updated; updating the model for labeling the keywords in the platform by using the trained model to complete a new keyword labeling task; in the updating of the keyword model, a feedback type model learning training module starts a keyword service, selects a pre-updated keyword algorithm, and finishes the operation if the selected keyword algorithm is a non-trainable algorithm; judging whether the keyword model is updated or not by analyzing a key word updating switch in a configuration file according to the selected CHI, LDA, TEXTRANK and TFIDF trainable algorithm, if not, reading the appointed keyword model file according to the name of the keyword model and a keyword training model table, and performing deserialization on the read keyword model file to complete the loading of the keyword model and finish the program.

2. The keyword corpus annotation training extraction system according to claim 1, wherein: the fused labeling result is subjected to further interpretation according to the keyword labeling service standard by manpower, the labeling result is stored as a cooked corpus, the management is carried out through a keyword corpus labeling preparation module and is used for labeling algorithm model training, and a uniform keyword model access standard is provided to finish corpus keyword labeling work.

3. The keyword corpus annotation training extraction system according to claim 1, wherein: the semi-automatic corpus keyword labeling module creates a service labeling rule aiming at a special labeling task and manages the labeling service rule, wherein the labeling service rule mainly comprises a service dictionary and a regular expression used for matching a character string.

4. The keyword corpus annotation training extraction system according to claim 1, wherein: the marking model effect evaluation module sets a single index algorithm according to the index standard; quantifying the indexes according to an index calculation rule, and constructing a labeling algorithm comprehensive evaluation model by adopting corresponding indexes of the organization according to different labeling tasks; and (4) completing the calculation of the index comprehensive value and feeding back the effect of the labeling model.

5. The keyword corpus annotation training extraction system according to claim 1, wherein: and performing quality and evaluation standard of keyword extraction by adopting two modes of machine quantitative analysis and artificial subjective judgment.

6. The keyword corpus annotation training extraction system according to claim 5, wherein: the machine quantitative analysis indexes are accuracy P (precision), recall R (Recall), F value and E value, wherein:

rate of accuracy

Recall rate

Average of harmonic keyword extraction accuracy and recall

。

7. The keyword corpus annotation training extraction system according to claim 1, wherein: in order to define different requirements of an application system on the accuracy P and the recall ratio R, a weight value is given to weight the accuracy P and the recall ratio R, so that a weighted consideration value E of the recall ratio is obtained: e: