CN110298033B - Keyword corpus labeling training extraction system - Google Patents

Keyword corpus labeling training extraction system Download PDF

Info

Publication number
CN110298033B
CN110298033B CN201910455064.3A CN201910455064A CN110298033B CN 110298033 B CN110298033 B CN 110298033B CN 201910455064 A CN201910455064 A CN 201910455064A CN 110298033 B CN110298033 B CN 110298033B
Authority
CN
China
Prior art keywords
keyword
labeling
model
corpus
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910455064.3A
Other languages
Chinese (zh)
Other versions
CN110298033A (en
Inventor
崔莹
代翔
黄细凤
王侃
杨拓
余博
朱宇涛
李超
李源源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN201910455064.3A priority Critical patent/CN110298033B/en
Publication of CN110298033A publication Critical patent/CN110298033A/en
Application granted granted Critical
Publication of CN110298033B publication Critical patent/CN110298033B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a keyword corpus labeling training extraction tool, and aims to provide a labeling training tool which can reduce the complexity of a manual labeling process and improve the efficiency and accuracy of labeling massive keyword corpora. The invention is realized by the following technical scheme: the method comprises the steps that a keyword corpus labeling preparation module distinguishes mass corpus data from different sources, a semi-automatic corpus keyword labeling module creates a keyword labeling task, an adaptive algorithm is selected autonomously, automatic labeling based on an algorithm model is carried out, pre-labeling processing is carried out on text corpus data to be labeled through integrating at least one keyword extraction algorithm of CHI, LDA, TEXTRANK and TFIDF, labeling results of multiple algorithms are fused, and a feedback type keyword labeling model learning and training module trains a keyword labeling algorithm model after the labeling task is completed; and the keyword labeling model effect evaluation module automatically evaluates the quantitative labeling effect of the model index.

Description

Keyword corpus labeling training extraction system
Technical Field
The invention relates to the technical field of text mining, in particular to a keyword corpus semi-automatic labeling training extraction system.
Background
In the field of natural language processing, the most important key for processing massive text files is to extract the most concerned problems of users. Whether for long text or short text, the subject idea of the entire text can often be snooped by several keywords. Meanwhile, regardless of text-based recommendation or text-based search, the dependency on text keywords is also great, and the accuracy of keyword extraction directly relates to the final effect of a recommendation system or a search system. Therefore, keyword extraction is an important part in the field of text mining. The rapid development of the network provides people with a simple and convenient information acquisition way, and the number of electronic documents such as web pages, mails, electronic books and the like is increasing. However, the information resources which are increased explosively lack the content structuring, so that people have to spend a lot of time to read and arrange the information while obtaining a lot of information, and the retrieval efficiency of people is greatly reduced. Therefore, how to organize up numerous and disordered resources and improve the efficiency of information utilization makes it extremely important for people to simply, quickly and accurately acquire the key information of the texts. The automatic extraction of the keywords has wide application in various aspects. Especially in the fields of knowledge mining, information retrieval, text clustering, text classification and the like, automatic keyword indexing is a more fundamental and core technology. The keyword automatic indexing technology also plays a more critical role in the fields of relevant feedback, automatic filtering, event detection and tracking and the like. It can be said that the keyword automatic indexing technology is a fundamental work for performing all text automatic analysis processes, and is essential in many text analysis works. The extraction of the keywords plays an important role in the aspects of automatic abstracting, information retrieval, text classification, text clustering and the like. How to intelligently, quickly and effectively acquire information from the internet becomes a problem which needs to be solved urgently in the field of computers at present. Keyword extraction is an important means for realizing quick and accurate acquisition of information on the internet. Keywords have some characteristics, and keywords are generally nouns or noun phrases; keywords generally do not start or end with stop words; the length of the keywords is generally not too long. The problem to be considered in the selection of the characteristics of the keywords is that the development and selection of the characteristics are a key point and a difficult point in the work of extracting the keywords, and the good and bad characteristics are directly related to the judgment of the keywords.
In recent years, with the rapid development of large data acquisition means, text information on a network has been explosively increased, so that the difficulty in acquiring required information is increasing. It becomes particularly urgent to mine the maximum value from the data, which puts a completely new demand on intelligent analysis of large data. In order to process information resources expanding at a high speed, a manual processing method becomes impractical, so that an automatic processing method is needed to help people to effectively manage and organize information so as to solve the problem of rich information and poor knowledge. Under the background, technologies such as machine learning and deep learning are rapidly developed and have great success in big data application, and model algorithms used at the bottom of the technologies need to rely on a large amount of data labeling corpora as basic training supports. Most of the existing keyword extraction algorithms judge the importance of words by using the statistical information of the words, and select the words exceeding a certain threshold value as the keywords of the articles. However, the statistical method is computationally expensive and requires a large amount of statistical corpora. A plurality of key word measurement functions are provided based on the method, and the key word measurement functions comprise TF/IDF, entropy functions, distribution coefficients and the like. Many machine learning algorithms are also applied to keyword extraction, for example, CHI and TFIDF can be used as feature selection and weight calculation methods, except that TFIDF can be used for any text set, and CHI requires a label of a text with a classification label to calculate. TextRank was originally proposed as a keyword extraction method, and some attempts have been made as a weight calculation method later, but TextRank has high calculation complexity.
The mass data corpus labeling work has an important influence on the training of an algorithm model, is used as basic work in a big data analysis process, mainly supports links such as daily research and development, algorithm tuning, demonstration and verification of big data, and is a core foundation of big data mining analysis. The keyword is a word having a substantial meaning for the main content of the text, and is a word or a phrase selected from a title, an abstract, and a body in order to satisfy text indexing or search work. Keyword extraction is a process of selecting an appropriate feature item set capable of completely expressing the subject content from a single text or a text set through statistical and semantic analysis of core words. Since the key words are the most basic units for representing the meaning of the text subjects, the key word extraction is generally performed in the fields of natural language processing and Chinese information processing, such as automatic summarization, information retrieval, text clustering, automatic question answering, topic tracking and the like, and the key word extraction also has important clue value for information monitoring and tracking. The part of speech is the result obtained by word segmentation and grammar analysis. Most of the existing keywords are nouns or dynamic nouns. Generally, the position where a word appears is of great value to the word. For example, the headlines and abstract are the central ideas of the articles summarized by the authors, so the words appearing in these places have certain representativeness and are more likely to be keywords. However, since the habits of each author are different, the writing mode is different, and the position of the key sentence is also different, this is also a very wide method for obtaining the key word, and is not used alone in general. Whether a word is important in an article is judged, an easily-thought measuring index is word frequency, and the important word often appears in the article for multiple times. On the other hand, it is important that the words not appearing frequently are not the same, because some words appear frequently in various articles, and the importance of the words is certainly not as strong as that of the words appearing frequently only in a certain article. The word frequency indicates the frequency with which a word appears in the text. It is generally believed that the more frequently a word appears in text, the more likely it is that the word is to serve as the core word of an article. Word frequency simply counts the number of times a word appears in a text, but keywords obtained only by word frequency have great uncertainty, and for a text with a longer length, the method has great noise. The key point of the keywords based on the statistical characteristics lies in the calculation of characteristic quantization indexes, and the results obtained by different quantization indexes are different. Meanwhile, different quantization indexes have respective advantages and disadvantages, and in practical application, Topk words are obtained by combining different quantization indexes and are used as keywords. Keyword extraction has wide application in the field of text mining, and the existing method has certain problems. In the prior art, the method based on statistics and machine learning has the effect of machine learning which depends more on artificially labeled corpora, that is, model parameters are trained according to observed data (labeled corpora), the probability of occurrence of various participles is calculated by the model in a participle stage, and the participle result with high probability is taken as a final result. The machine learning-based method can be implemented on the premise that a knowledge base or a training base with a large enough data size is established. Because the problem of knowledge learning is not solved fundamentally at present, the updating of the knowledge base is slow and cannot keep up with the current scientific development. The labeled data set provides limited information, and manual labeling of samples is time-consuming and labor-consuming, and large-scale labeling is too expensive. The number of easily obtained unlabeled samples (e.g., web pages on the internet) is large relative to the labeled samples, and also approximates the distribution of data across the entire sample space. Providing as many annotation samples as possible requires hard and slow manual labor, which affects the construction of the whole system, and this creates a bottleneck problem for annotation. The idea of the keyword extraction algorithm based on statistical characteristics is to extract keywords of a document by using statistical information of words in the document. Generally, a text is preprocessed to obtain a set of candidate words, and then keywords are obtained from the candidate set by means of characteristic value quantization. Common sequence labeling models are HMM and CRF. The word segmentation algorithm can be used for processing ambiguous and unknown word problems, has better effect than the former word segmentation algorithm, but needs a large amount of manual labeling data and has lower word segmentation speed. The supervised keyword extraction algorithm regards the keyword extraction algorithm as a binary problem and judges whether a word or phrase in a document is a keyword or not. Since the problem is a classification problem, it is necessary to provide labeled training corpora, train a keyword extraction model using the training corpora, and extract keywords from documents whose keywords are to be extracted according to the model.
The traditional keyword extraction method is divided into two types, namely an unsupervised method and a supervised method. The unsupervised method comprises TF-IDF, Chi-squared, TextRank, LDA and other methods, the supervised method converts the keyword extraction problem into a two-classification problem for judging whether each word is a keyword, and the keyword extraction is carried out by people through the supervised methods such as NaiveBayes and a decision tree C4.5. Unsupervised and supervised methods have their advantages and disadvantages: the unsupervised method does not need to manually label a training set, so that the method is faster, but the effect is probably inferior to the supervised method because various information cannot be comprehensively utilized to sort the candidate words; the supervised method can adjust the influence degree of various information on the judgment of the keywords through training and learning, so the effect is better, but in the current data era, the labeling training set is very time-consuming and labor-consuming. The supervised text keyword extraction algorithm has the disadvantage of requiring high labor cost. The third category is that the computer simulates the understanding of sentences by people to achieve the effect of recognizing words, because of the complexity of Chinese semantics, various language information is difficult to organize into a form which can be recognized by a machine, because a large amount of training linguistic data needs to be marked, a manual method is time-consuming and labor-consuming, and the word segmentation system is still in a test stage at present. In the construction process of the language network graph, the preprocessed words are used as nodes, and the relation between the words is used as an edge. In the language network diagram, the weight between the edges is generally expressed by the degree of association between words. When the keyword is obtained by using the language network graph, the importance of each node needs to be evaluated, then the nodes are sequenced according to the importance, and the word represented by TopK nodes is selected as the keyword. Due to the characteristics of Chinese language, no explicit word boundary exists, certain difficulty is added to the task of automatically indexing keyword strings, and the keyword extraction model can achieve a stable effect by needing more training corpora. In practical applications, due to the complexity of the application environment, the effect obtained by the same text keyword extraction method is the same for different types of texts, such as long texts and short texts. In practical application, algorithms adopted by different condition environments are different, and no algorithm has a good effect in all environments. Meanwhile, the engineering also has great dependence on the accuracy of text preprocessing and text word segmentation. For the information of wrongly written characters, deformed words and the like of the text, the problem needs to be solved in a preprocessing stage, and the selection of a word segmentation algorithm and the identification of unknown words and ambiguous words have great influence on the extraction of the key words to a certain extent. Keyword extraction is a seemingly simple but very tricky task in practical application. Because the Chinese opinion type subjective text labeling corpus contains a large amount of information such as word segmentation, part of speech, dependency relationship, semantics, word concepts and opinions, the completed labeling is usually complex. In order to reduce the burden of the annotating personnel, improve the efficiency and accuracy of annotation and reduce the error rate of annotation, it is necessary to develop an automatic annotation system aiming at the keyword corpus to assist the work of the annotating personnel. At present, key word corpora in the field are relatively deficient, and the work of labeling the key word corpora is mainly completed by manual labeling, so that the problems of poor corpus labeling quality, complicated labeling process, low labeling efficiency, high human resource cost and the like widely exist. Meanwhile, the existing keyword corpus labeling system has the defects that the labeling method is single, the labeling method model is difficult to update automatically and the like, so that a set of semi-automatic keyword labeling and training platform capable of assisting in manually labeling the corpus is urgently needed to solve the problems.
Disclosure of Invention
The invention aims to solve the defects of the keyword corpus labeling and the use of the corpus in the training process, and provides a semi-automatic keyword corpus labeling training system which can reduce the complexity of the manual labeling process, reduce the labor work cost and improve the efficiency and the accuracy of mass keyword corpus labeling.
The above object of the present invention can be achieved by the following technical solutions: a keyword corpus annotation training extraction system comprises: keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, feedback type keyword labeling model learning and training module and keyword labeling model effect evaluation module, its characterized in that: the keyword corpus labeling preparation module distinguishes mass corpus data of different sources, selects keyword corpus sources aiming at keyword corpuses of different purposes, and sets the selected keyword corpus sources as to-be-labeled corpuses of different purposes, namely raw corpuses; the semi-automatic corpus keyword labeling module firstly creates a keyword labeling task, autonomously selects an adaptive algorithm and carries out automatic labeling based on an algorithm model aiming at different labeling use requirements and corpus characteristics, performs pre-labeling processing of a single keyword on the corpus data of a text to be labeled by integrating CHI, LDA, a keyword extraction algorithm based on graph sorting, TEXTRANK and TFIDF, can also automatically perform pre-labeling processing of the single keyword on the corpus data of the text to be labeled by a business rule, can also select a plurality of keyword extraction algorithms to label the keyword at the same time, fuses labeling results of a plurality of algorithms, further pre-judges the fused labeling results manually according to the keyword labeling business standard, stores the labeling results as cooked corpus, manages by a keyword corpus labeling preparation module, the method is used for training a labeling algorithm model, and a uniform keyword model access standard is provided to finish the work of labeling the corpus keywords; after the labeling task is completed, a feedback type keyword labeling model learning and training module provides learning and training of an algorithm model aiming at an internally integrated keyword labeling algorithm model and an externally depth-enhanced labeling algorithm model through setting parameters of the keyword algorithm model, retrains the keyword labeling algorithm model by using labeled keyword corpora, improves and updates the feedback model, and automatically feeds back and adjusts to complete a new keyword labeling task through continuous iteration between model updating and corpus labeling; the keyword labeling model effect evaluation module constructs a keyword evaluation index according to an evaluation index standard of the keyword, quantifies the evaluation index based on a keyword index rule, establishes a labeling algorithm comprehensive evaluation model, automatically evaluates the quantitative labeling effect of the model index, and automatically recommends an optimal labeling model for the subsequent keyword labeling task.
Compared with the prior art, the invention has the following beneficial effects:
the complexity of the manual marking process can be reduced, and the labor cost is reduced. The invention adopts a system which mainly comprises four modules of keyword corpus labeling preparation, semi-automatic corpus keyword labeling, feedback type keyword labeling model learning training and keyword labeling model effect evaluation, can provide an automatic labeling mode based on self-selection adaptive algorithm and multi-algorithm fusion aiming at different labeling use requirements and corpus characteristics, the multi-algorithm fusion automatic labeling adopts a voting method to perform fusion processing on multi-algorithm results, the performance of the integration method is superior to that of a single method under the condition of ignoring correlation, the pre-labeling work performed by the method can reduce the complexity of the manual labeling process, reduce the labor work cost and have certain flexibility and higher automatic processing capability.
The keyword corpus labeling efficiency is high. According to the method, the data from different sources are distinguished, so that the keyword linguistic data are managed; by supporting the integration of keyword extraction algorithms such as CHI, LDA, TEXTRANK, TFIDF and the like in a real-time background, aiming at different keyword linguistic data, a training model library for extracting keywords such as applicable labeling algorithms CHI, LDA, TEXTRANKRANK, TFIDF and the like is provided in the labeling process to be selectable, the pre-labeling processing of a single keyword method or the pre-labeling processing of multi-keyword method fusion is carried out on the linguistic data to be labeled, an artificial judgment link is introduced, and the system supports the automatic feedback adjustment of a real-time background keyword algorithm model to complete a new keyword labeling task, so that the time for obtaining information can be greatly shortened, the efficiency for obtaining information is improved, and the linguistic data labeling efficiency is greatly improved.
According to different labeling use requirements and corpus characteristics, an adaptive algorithm is selected autonomously, automatic labeling is carried out, single keyword pre-labeling processing or multi-keyword fusion pre-labeling processing is carried out on text corpus data to be labeled through integrating at least one keyword extraction algorithm in CHI, LDA, TEXTRANK and TFIDF, and a uniform keyword model is provided to access a standard to complete corpus keyword labeling work; and after the labeling task is finished, retraining the keyword model by using the labeling corpus. The model labeling effect is evaluated by establishing a labeling algorithm comprehensive evaluation model, and the keyword model learning training is fed back, so that the model achieves the best effect, the accuracy of the keyword labeling model is improved, the subsequent labeling tasks are newly added, the corpus keyword labeling quality and the algorithm model effect are improved through continuous iteration between model updating and corpus labeling, and the error rate of keyword labeling is reduced. Finally, the manual evidence judgment link is used for realizing the intervention evidence judgment of the labeling result, and the manual confirmation link is used for modifying, confirming and submitting the keyword labeling corpus so as to finish the corpus keyword labeling work and greatly improve the accuracy rate and accuracy of keyword extraction; experiments prove the effectiveness of the keyword labeling training extraction system applied to labeling keyword corpora.
The invention simplifies the user labeling operation process, supports the import, the training and the use of the external model through a friendly man-machine interactive labeling interface.
Drawings
FIG. 1 is a schematic diagram of a keyword corpus annotation training extraction system according to the present invention.
FIG. 2 is a flow diagram of the keyword model training process of FIG. 1.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.
Detailed Description
See fig. 1. In a preferred embodiment described below, a keyword corpus annotation training extraction system includes: keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, feedback type keyword labeling model learning training module and keyword labeling model effect evaluation module, wherein: the keyword corpus labeling preparation module distinguishes mass corpus data of different sources, selects keyword corpus sources aiming at keyword corpuses of different purposes, and sets the selected keyword corpus sources as to-be-labeled corpuses of different purposes, namely raw corpuses; the semi-automatic corpus keyword labeling module firstly creates a keyword labeling task, further automatically selects an adaptive algorithm and carries out automatic labeling based on an algorithm model according to different labeling use requirements and corpus characteristics, performs single keyword pre-labeling processing on the corpus data of a text to be labeled by integrating CHI, LDA, a keyword extraction algorithm based on graph sorting, TEXTRANK and TFIDF, can perform single keyword pre-labeling processing on the corpus data of the text to be labeled by automatic labeling based on business rules, can also perform keyword labeling by simultaneously selecting a plurality of keyword extraction algorithms, fuses labeling results of a plurality of algorithms, further performs pre-judgment on the fused labeling results manually according to keyword labeling business standards, stores the labeling results as cooked corpus, and manages by a keyword corpus labeling preparation module, the method is used for training a labeling algorithm model, and a uniform keyword model access standard is provided to finish the work of labeling the corpus keywords; after the labeling task is completed, a feedback type keyword labeling model learning and training module provides learning and training of an algorithm model aiming at an internally integrated keyword labeling algorithm model and an externally depth-enhanced labeling algorithm model through setting parameters of the keyword algorithm model, retrains the keyword labeling algorithm model by using labeled keyword corpora, improves and updates the feedback model, and automatically feeds back and adjusts to complete a new keyword labeling task through continuous iteration between model updating and corpus labeling; the keyword labeling model effect evaluation module constructs a keyword evaluation index according to an evaluation index standard aiming at the keyword, quantifies the evaluation index based on a keyword index rule, establishes a labeling algorithm comprehensive evaluation model, automatically evaluates the quantitative labeling effect of the model index, and automatically recommends an optimal labeling model for a subsequent keyword labeling task.
The embodiment provides a text corpus tagging preparation module for managing the corpus to be tagged according to the source or the theme and providing preparation for a tagging task; the semi-automatic corpus keyword labeling module autonomously selects an adaptation algorithm and carries out automatic labeling aiming at different labeling use requirements and corpus characteristics, realizes intervention judgment of a labeling result through an artificial judgment link, and specifically comprises the following steps:
the semi-automatic corpus keyword labeling module creates a keyword labeling task according to different source corpuses; selecting an effect-adaptive algorithm model for each type of labeling task, for example, selecting keyword extraction algorithms such as CHI, LDA, TEXTRANK, TFIDF and the like in the keyword labeling task to finish automatic labeling, wherein a specific labeling algorithm can be configured according to the corpus automatic labeling effect, and a semi-automatic corpus keyword labeling module automatically recommends a default labeling algorithm according to the result of a keyword labeling model effect evaluation module; the semi-automatic corpus keyword labeling module firstly creates a keyword labeling task, further automatically selects an adaptive algorithm and carries out automatic labeling based on an algorithm model according to different labeling use requirements and corpus characteristics, performs single keyword pre-labeling processing on text corpus data to be labeled through integrating CHI, LDA and at least one keyword extraction algorithm in keyword extraction algorithms TEXTRANK and TFIDF based on graph sorting, and performs single keyword pre-labeling processing on the text corpus data to be labeled through automatic labeling based on business rules. The semi-automatic corpus keyword labeling module creates a service labeling rule aiming at a special labeling task and manages a labeling service rule, wherein the labeling service rule mainly comprises a service dictionary and a regular expression used for matching a character string, such as: key date and time: the geography of interest. The regular expression is directly defined as a variable such as reg and dim reg as expreg, after Microsoft describing runtime is selected, a dictionary object is directly defined as a variable, dim d as dictionary. The rough matching process of the regular expression is that the characters in the expression and the text are taken out in sequence for comparison, and if each character can be matched, the matching is successful; matching fails once there is a character that fails matching. Marking the corpus automatically by a marking person by adopting a marking business rule; the method comprises the steps of fusing an automatic labeling result based on an algorithm model and an automatic labeling result based on a business rule, selecting multiple keyword extraction algorithms for keyword labeling, fusing the labeling results of the multiple algorithms, further judging, modifying, confirming and storing the fused labeling results manually according to a keyword labeling business standard, storing the labeling results as cooked linguistic data, managing the cooked linguistic data by a keyword linguistic data labeling preparation module for use in labeling algorithm model training, and providing a uniform keyword model access standard to finish linguistic data keyword labeling work.
See fig. 2. The feedback type keyword labeling model learning training module provides model learning training for the keyword labeling algorithm model and the external depth enhancement labeling algorithm model which are integrated inside through the setting of the parameters of the keyword algorithm model. In the keyword model training processing flow, a feedback type keyword labeling model learning training module reads labeled linguistic data used for training, selects key algorithm training, performs off-line training on trainable algorithms such as CHI, LDA, TEXTRANK and TFIDF by using labeled linguistic data aiming at untrained algorithms without training process and ending, and invokes a uniform training model interface Train to generate a keyword model sequence file Kryo to ensure that the model accuracy reaches the best. After a keyword model sequence file Kryo is generated, a feedback type keyword labeling model learning training module judges whether a keyword model is stored or not, if not, the keyword model is ended, if so, an external algorithm model is imported according to a unified model access interface, the external algorithm model is updated or exported, a keyword model file comprising an algorithm name, a model name and a serialization model file is stored, and a keyword training model table is updated; and updating the model for labeling the keywords in the platform by using the trained model to complete a new keyword labeling task. In the updating of the keyword model, a feedback type model learning training module starts a keyword service, selects a pre-updated keyword algorithm, and finishes the operation if the selected keyword algorithm is a non-trainable algorithm; and judging whether to update the keyword model or not by analyzing a switch for updating the keyword in the configuration file according to selected trainable algorithms such as CHI, LDA, TEXTRANK, TFIDF and the like, and if not, ending. If so, reading the appointed keyword model file according to the keyword model name and the keyword training model table, performing deserialization on the read keyword model file, completing loading of the keyword model, and ending the program.
The evaluation module of the labeling model effect provides methods for constructing labels, construction rules, index quantification and the like for the model evaluation indexes, supports the evaluation of the model labeling effect by automatically constructing a labeling algorithm comprehensive evaluation model, and comprises the following specific steps: the labeling model effect evaluation module sets a single index algorithm according to the index standard; quantifying the indexes according to an index calculation rule, and constructing a labeling algorithm comprehensive evaluation model by adopting corresponding indexes of the organization according to different labeling tasks; and (4) completing the calculation of the index comprehensive value and feeding back the effect of the labeling model.
The quality and the evaluation standard of the keyword extraction do not have a unified evaluation method at home and abroad, and because the selection of the text data has higher subjectivity, the quality and the evaluation standard of the keyword extraction are carried out by adopting two modes of machine quantitative analysis and artificial subjective judgment. The most common indicators of machine quantitative analysis are accuracy p (precision), recall r (recall), average F of keyword extraction accuracy and recall, and consideration E for weighting keyword extraction accuracy and recall according to the application requirement, wherein,
Figure GDA0003617921880000081
Figure GDA0003617921880000082
Figure GDA0003617921880000083
the accuracy and recall are generally referred to as an inverse relationship. Through a certainThese methods increase accuracy, which results in a decrease in recall, and vice versa. In order to define different requirements of the application system on the accuracy and the recall rate, a weighted consideration can be given to the application system, so that a consideration value E for weighting the keyword extraction accuracy and the recall rate is obtained:
Figure GDA0003617921880000084
wherein, b is the added weight, the larger b is, the larger the weight of the accuracy rate in the consideration of the E value is, otherwise, the larger the weight of the recall rate is.
Besides, there are two common indicators, reference to a reference value binypreferrence measure (bprf) and a mechanism evaluation indicator MRR (meandercaptorark average reciprocal rank) that evaluates the search algorithm. The reference value bppref is an evaluation index that takes into account the sort order. For a document, if R out of M extracted keywords are standard answers, where the exact extraction is denoted by R and the erroneous extraction is denoted by n, the reference value bppref is calculated by the following formula:
Figure GDA0003617921880000091
the search algorithm evaluation mechanism evaluation index MRR is used for measuring the ranking condition of the first accurately recommended keyword of each document and is an evaluation index aiming at the document set. For a document d, use rankdTo represent the ranking position of the first accurately recommended keyword, the evaluation index MRR is defined as:
Figure GDA0003617921880000092
and D is a document set for performing the keyword extraction test.
The method comprises the steps that a corpus to be labeled is managed according to sources or topics, and preparation is provided for labeling tasks; the method comprises the steps of integrating keyword extraction algorithms such as CHI, LDA, TEXTRANK, TFIDF and the like to complete semi-automatic labeling of keyword linguistic data, providing a selectable applicable labeling algorithm in a labeling process, and performing keyword pre-labeling processing on linguistic data to be labeled; and finally, modifying, confirming and submitting the marked corpus through a manual confirmation link to finish corpus marking work. And after the labeling task is finished, retraining the model by using the labeling corpora. And evaluating the labeling effect of the model by establishing a labeling algorithm comprehensive evaluation model, feeding back model learning training to enable the model to achieve the best effect for the subsequent new labeling task, and improving the corpus labeling quality and the algorithm model effect through continuous iteration between model updating and corpus labeling.
The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (7)

1. A keyword corpus annotation training extraction system comprises: keyword corpus labeling preparation module, semi-automatic corpus keyword labeling module, feedback type keyword labeling model learning and training module and keyword labeling model effect evaluation module, its characterized in that: the keyword corpus labeling preparation module distinguishes mass corpus data of different sources, selects keyword corpus sources aiming at keyword corpuses of different purposes, and sets the selected keyword corpus sources as to-be-labeled corpuses of different purposes, namely raw corpuses; the semi-automatic corpus keyword labeling module firstly creates a keyword labeling task, autonomously selects an adaptive algorithm and develops automatic labeling based on an algorithm model aiming at different labeling use requirements and corpus characteristics, performs single keyword pre-labeling processing on text corpus data to be labeled by integrating CHI, LDA, a keyword extraction algorithm based on graph sorting, TEXTRANK and TFIDF, or simultaneously selects a plurality of keyword extraction algorithms to label keywords and fuses a plurality of algorithm labeling results, after the labeling task is completed, a feedback type keyword labeling model learning and training module provides learning and training of the algorithm model aiming at an internally integrated keyword labeling algorithm model and an externally depth enhanced labeling algorithm model by setting parameters of the keyword algorithm model and retrains the keyword labeling algorithm model by using the labeled keyword corpus, the feedback model is improved and updated, and a new keyword labeling task is automatically fed back and adjusted through continuous iteration between model updating and corpus labeling; the keyword labeling model effect evaluation module constructs a keyword evaluation index according to an evaluation index standard of a keyword, quantifies the evaluation index based on a keyword index rule, establishes a labeling algorithm comprehensive evaluation model, automatically evaluates the quantitative labeling effect of the model index, and automatically recommends optimal labeling for a subsequent keyword labeling task; the model feedback type keyword labeling model learning training module reads labeled linguistic data used for training, selects a key algorithm for training, performs off-line training on CHI, LDA, TEXTRANK and TFIDF trainable algorithms by using labeled linguistic data aiming at untrained algorithms without training process and ending, and calls a uniform training model interface Train to generate a keyword model sequence file Kryo to ensure that the model accuracy reaches the best; after a keyword model sequence file Kryo is generated, a feedback type keyword labeling model learning training module judges whether a keyword model is stored or not, if not, the keyword model is ended, if so, an external algorithm model is imported according to a unified model access interface, the external algorithm model is updated or exported, a keyword model file comprising an algorithm name, a model name and a serialization model file is stored, and a keyword training model table is updated; updating the model for labeling the keywords in the platform by using the trained model to complete a new keyword labeling task; in the updating of the keyword model, a feedback type model learning training module starts a keyword service, selects a pre-updated keyword algorithm, and finishes the operation if the selected keyword algorithm is a non-trainable algorithm; judging whether the keyword model is updated or not by analyzing a key word updating switch in a configuration file according to the selected CHI, LDA, TEXTRANK and TFIDF trainable algorithm, if not, reading the appointed keyword model file according to the name of the keyword model and a keyword training model table, and performing deserialization on the read keyword model file to complete the loading of the keyword model and finish the program.
2. The keyword corpus annotation training extraction system according to claim 1, wherein: the fused labeling result is subjected to further interpretation according to the keyword labeling service standard by manpower, the labeling result is stored as a cooked corpus, the management is carried out through a keyword corpus labeling preparation module and is used for labeling algorithm model training, and a uniform keyword model access standard is provided to finish corpus keyword labeling work.
3. The keyword corpus annotation training extraction system according to claim 1, wherein: the semi-automatic corpus keyword labeling module creates a service labeling rule aiming at a special labeling task and manages the labeling service rule, wherein the labeling service rule mainly comprises a service dictionary and a regular expression used for matching a character string.
4. The keyword corpus annotation training extraction system according to claim 1, wherein: the marking model effect evaluation module sets a single index algorithm according to the index standard; quantifying the indexes according to an index calculation rule, and constructing a labeling algorithm comprehensive evaluation model by adopting corresponding indexes of the organization according to different labeling tasks; and (4) completing the calculation of the index comprehensive value and feeding back the effect of the labeling model.
5. The keyword corpus annotation training extraction system according to claim 1, wherein: and performing quality and evaluation standard of keyword extraction by adopting two modes of machine quantitative analysis and artificial subjective judgment.
6. The keyword corpus annotation training extraction system according to claim 5, wherein: the machine quantitative analysis indexes are accuracy P (precision), recall R (Recall), F value and E value, wherein:
rate of accuracy
Figure FDA0003617921870000021
Recall rate
Figure FDA0003617921870000022
Average of harmonic keyword extraction accuracy and recall
Figure FDA0003617921870000023
7. The keyword corpus annotation training extraction system according to claim 1, wherein: in order to define different requirements of an application system on the accuracy P and the recall ratio R, a weight value is given to weight the accuracy P and the recall ratio R, so that a weighted consideration value E of the recall ratio is obtained: e:
Figure FDA0003617921870000024
wherein, b is the added weight, the larger b is, the larger the weight of the accuracy rate in the consideration of the E value is, otherwise, the larger the weight of the recall rate is.
CN201910455064.3A 2019-05-29 2019-05-29 Keyword corpus labeling training extraction system Active CN110298033B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910455064.3A CN110298033B (en) 2019-05-29 2019-05-29 Keyword corpus labeling training extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910455064.3A CN110298033B (en) 2019-05-29 2019-05-29 Keyword corpus labeling training extraction system

Publications (2)

Publication Number Publication Date
CN110298033A CN110298033A (en) 2019-10-01
CN110298033B true CN110298033B (en) 2022-07-08

Family

ID=68027297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910455064.3A Active CN110298033B (en) 2019-05-29 2019-05-29 Keyword corpus labeling training extraction system

Country Status (1)

Country Link
CN (1) CN110298033B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781290A (en) * 2019-10-10 2020-02-11 南京摄星智能科技有限公司 Extraction method of structured text abstract of long chapter
CN110968695A (en) * 2019-11-18 2020-04-07 罗彤 Intelligent labeling method, device and platform based on active learning of weak supervision technology
CN111125312A (en) * 2019-12-24 2020-05-08 深圳视界信息技术有限公司 Text labeling method and system
CN111143577B (en) 2019-12-27 2023-06-16 北京百度网讯科技有限公司 Data labeling method, device and system
CN111476034B (en) * 2020-04-07 2023-05-12 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN111859854A (en) * 2020-06-11 2020-10-30 第四范式(北京)技术有限公司 Data annotation method, device and equipment and computer readable storage medium
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device
CN112269877A (en) * 2020-10-27 2021-01-26 维沃移动通信有限公司 Data labeling method and device
CN112365159A (en) * 2020-11-11 2021-02-12 福建亿榕信息技术有限公司 Deep neural network-based backup cadre recommendation method and system
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN112307175B (en) * 2020-12-02 2021-11-02 龙马智芯(珠海横琴)科技有限公司 Text processing method, text processing device, server and computer readable storage medium
CN112632284A (en) * 2020-12-30 2021-04-09 上海明略人工智能(集团)有限公司 Information extraction method and system for unlabeled text data set
CN112395395B (en) * 2021-01-19 2021-05-28 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium
CN112862458A (en) * 2021-03-02 2021-05-28 岭东核电有限公司 Nuclear power test procedure supervision method and device, computer equipment and storage medium
CN113536783A (en) * 2021-07-14 2021-10-22 福建亿榕信息技术有限公司 Model-based new word discovery method
CN115511668B (en) * 2022-10-12 2023-09-08 金华智扬信息技术有限公司 Case supervision method, device, equipment and medium based on artificial intelligence
CN118095251B (en) * 2024-04-23 2024-06-18 北京国际大数据交易有限公司 Offline text data evaluation method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN108197098A (en) * 2017-11-22 2018-06-22 阿里巴巴集团控股有限公司 A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment
CN108268447A (en) * 2018-01-22 2018-07-10 河海大学 A kind of mask method of Tibetan language name entity
CN108595460A (en) * 2018-01-05 2018-09-28 中译语通科技股份有限公司 Multichannel evaluating method and system, the computer program of keyword Automatic
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN108960338A (en) * 2018-07-18 2018-12-07 苏州科技大学 The automatic sentence mask method of image based on attention-feedback mechanism
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180196870A1 (en) * 2017-01-12 2018-07-12 Microsoft Technology Licensing, Llc Systems and methods for a smart search of an electronic document

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997344A (en) * 2017-03-31 2017-08-01 成都数联铭品科技有限公司 Keyword abstraction system
CN108197098A (en) * 2017-11-22 2018-06-22 阿里巴巴集团控股有限公司 A kind of generation of keyword combined strategy and keyword expansion method, apparatus and equipment
CN108595460A (en) * 2018-01-05 2018-09-28 中译语通科技股份有限公司 Multichannel evaluating method and system, the computer program of keyword Automatic
CN108268447A (en) * 2018-01-22 2018-07-10 河海大学 A kind of mask method of Tibetan language name entity
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN108960338A (en) * 2018-07-18 2018-12-07 苏州科技大学 The automatic sentence mask method of image based on attention-feedback mechanism
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A Unified Model for Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media,Jinseok Nam Semi-Supervised Neural Networks for Nested Named Entity Recognition;Hangfeng He等;《AAAI》;20170204;3216-3222 *
Semi-supervised sequence tagging with bidirectional language models;Matthew E. Peters等;《arXiv》;20170429;1-10 *
国外知识抽取系统研究;刘晓娟等;《情报科学》;20090715;第27卷(第07期);1110-1113 *
教学视频的文本语义镜头分割和标注;王敏等;《数据采集与处理》;20161115;第31卷(第06期);1171-1177 *
面向 3D CT 影像处理的无监督推荐标注算法;冯浩哲等;《计算机辅助设计与图形学学报》;20190215;第31卷(第02期);183-189 *

Also Published As

Publication number Publication date
CN110298033A (en) 2019-10-01

Similar Documents

Publication Publication Date Title
CN110298033B (en) Keyword corpus labeling training extraction system
CN116628172B (en) Dialogue method for multi-strategy fusion in government service field based on knowledge graph
CN109493977B (en) Text data processing method and device, electronic equipment and computer readable medium
CN108287822B (en) Chinese similarity problem generation system and method
CN108121829B (en) Software defect-oriented domain knowledge graph automatic construction method
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN109726274B (en) Question generation method, device and storage medium
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN112100356A (en) Knowledge base question-answer entity linking method and system based on similarity
WO2018153215A1 (en) Method for automatically generating sentence sample with similar semantics
CN111209412A (en) Method for building knowledge graph of periodical literature by cyclic updating iteration
CN115809345A (en) Knowledge graph-based multi-source data difference traceability retrieval method
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN112036177A (en) Text semantic similarity information processing method and system based on multi-model fusion
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN112861990A (en) Topic clustering method and device based on keywords and entities and computer-readable storage medium
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN111061828B (en) Digital library knowledge retrieval method and device
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN114238653B (en) Method for constructing programming education knowledge graph, completing and intelligently asking and answering
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN111027306A (en) Intellectual property matching technology based on keyword extraction and word shifting distance
CN114661872A (en) Beginner-oriented API self-adaptive recommendation method and system
CN115248839A (en) Knowledge system-based long text retrieval method and device
CN111339777A (en) Medical related intention identification method and system based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant