CN116702786A - Chinese professional term extraction method and system integrating rules and statistical features - Google Patents

Chinese professional term extraction method and system integrating rules and statistical features Download PDF

Info

Publication number
CN116702786A
CN116702786A CN202310973797.2A CN202310973797A CN116702786A CN 116702786 A CN116702786 A CN 116702786A CN 202310973797 A CN202310973797 A CN 202310973797A CN 116702786 A CN116702786 A CN 116702786A
Authority
CN
China
Prior art keywords
term
terms
professional
candidate
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310973797.2A
Other languages
Chinese (zh)
Other versions
CN116702786B (en
Inventor
孙宇清
李成
龚斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202310973797.2A priority Critical patent/CN116702786B/en
Publication of CN116702786A publication Critical patent/CN116702786A/en
Application granted granted Critical
Publication of CN116702786B publication Critical patent/CN116702786B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

A Chinese technical term extraction method and system integrating rules and statistical features belong to the technical field of natural language processing, and comprise the following steps: in the technical term discovery part, a word segmentation tool which is universal in the field of natural language processing and comprises a word frequency statistical dictionary and a probability algorithm is adopted. In the term screening stage, term extraction indexes and extraction technologies based on statistics, such as word frequency, number of words, mutual information of points, degree of freedom of words, information quantity difference ratio and the like, are provided, and the term in the word segmentation result is extracted from the universal words or concepts. In the evaluation optimization stage, an evaluation index and an optimization technology based on the information quantity difference of the professional term in the professional text and the universal text are provided, and candidate results are evaluated and optimally extracted from the two angles of language rules and contexts.

Description

Chinese professional term extraction method and system integrating rules and statistical features
Technical Field
The invention discloses a method and a system for extracting Chinese professional terms by integrating rules and statistical features, belonging to the technical field of natural language processing.
Background
The term extraction refers to processing an input professional text, and automatically extracting the term contained in the text is a basic work in the field of natural language processing. The professional text generally contains a large number of technical terms or concepts, which are professional and normative as a result of technological progress and socioeconomic development, and can simply and accurately define the meaning of related theories, technologies and methods in the field. The correct division of the technical terms has important values in the application fields of keyword extraction, professional dictionary generation, professional text understanding, knowledge discovery and the like, for example, a structured knowledge graph extraction task based on the professional text firstly needs to identify the technical terms and entities, and then the relationship between the concept terms and the entities is deduced through a relationship extraction model. The term of art often plays an important role as a keyword in the syntactic structure, containing a larger amount of information than other general words, but is misclassified in the chinese art text segmentation and term recognition process. Thus, there is an urgent need to solve the text-based term extraction problem.
Compared with the English Chinese special terms which are independent words or the constitution forms of a plurality of words related by continuous characters, the Chinese special term extraction problem is essentially that continuous character sequences with special field meaning are identified from texts, and is a difficult problem in the field of natural language processing, and the best result F1 value of the current latest model is about 30% to 46% as seen from experimental results of technical documents' Wu H, ma B, liu W, et al Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning [ J ]. 2022 ]. The practical difficulty of the problem is derived from the word forming method of Chinese vocabularies, chinese professional terms are composed of a plurality of words or words, each word and word has meaning, so that the task of identifying the professional terms in the text needs to define the scope of the professional terms vocabulary according to the text semantics, and the scope of the professional terms vocabulary is stripped from the traditional semantics of the constituent vocabularies.
The related front work of the technical term extraction problem mainly comprises the following steps:
chinese patent document CN108287825a discloses a term identifying and extracting method and system, comprising: performing multiple recognition extraction on the term; identifying a multi-term combination term; matching translation; and performing term extraction. The term extraction scheme provided by the document firstly screens text word segmentation results according to parts of speech, and then extracts the technical terms based on vocabulary co-occurrence, wherein the essence of the term extraction scheme is that the technical terms are extracted based on statistical probability, the angle is single, linguistic internal features of the technical terms cannot be completely described, and the extraction process depends on a technical term library.
Chinese patent document CN114528835A discloses a semi-supervised technical term extraction method, medium and equipment based on interval discrimination, and belongs to the field of natural language processing. In the section-discrimination-based technical term extraction method provided by the invention, section features including semantic features, part-of-speech features and length features are constructed for discriminating the technical terms aiming at the characteristics of the technical terms, and compared with the traditional sequence labeling method, the nesting problem among the terms can be effectively solved. Meanwhile, the semi-supervised extraction process constructed by aiming at the technical term extraction task has a certain relieving effect on the problems of difficulty in labeling the technical term and high construction cost of the data set. The semi-supervision mechanism adopted by the invention can obtain a better extraction effect under a small quantity of training samples, and the feature construction method aiming at the technical terms can enable the extraction result to be more accurate.
The Chinese patent document CN113343683A is a Chinese new word discovery method integrating a self-encoder and countermeasure training, comprising the following steps: 1) Extracting semantic information of sentence level by using a text reconstruction self-encoder in an unsupervised pre-training mode; 2) Adding priori syntactic knowledge, and fusing the priori syntactic knowledge with the character vectors to form character syntactic splicing vectors so as to improve the accuracy of ambiguous word division; 3) Performing countermeasure training on the character syntax splice vector: the mixed data of the input source domain and the target domain are merged into the sharing layer, the countermeasure architecture is utilized to generate a feature vector irrelevant to the domain, information irrelevant to the domain is extracted, and the problem that the labeling data of the professional domain is less is solved by utilizing the feature that the domains have commonality; 4) And (3) marking the character sequence obtained after the countermeasure training in the step (3) by adopting a conditional random field network layer so as to find new words and output a finding result. The method fully utilizes the sufficient annotation corpus in the general field and the priori syntactic knowledge to assist word segmentation.
Chinese patent document CN114912449A is a technical feature keyword extraction method and system based on code description text, and belongs to the technical field of natural language processing. The invention comprehensively considers the related information of code technical characteristics such as semantics, syntax, vocabulary specificity and the like, and adopts a fusion analysis method of vocabulary knowledge and sentence syntax knowledge to combine co-occurrence vocabulary and dependency relationship to construct a semantic association graph; extracting text abstract semantic information by adopting a pre-training model BERT as a text encoder; and calculating the weight of the vocabulary by adopting a random walk algorithm so as to capture long-distance semantic dependency relationship among the vocabularies and consider the importance and the specificity of the keywords.
Chinese patent document CN103309852a relates to a synthetic word discovery method in a specific field based on statistics and rules. The method comprises the following steps: the word segmentation system is utilized to segment words and word part labels, the traverse word segmentation result is filtered by utilizing stop words and word forming rules, the traverse is utilized to generate a directed graph of atomic words, the deep traverse is utilized to arrange and combine possible synthetic words, the statistical indexes and word forming rules are used for constraint at the same time, a synthetic word candidate set is generated for manual screening, and the synthetic words are imported into a dictionary file for later use.
The methods proposed in the above Chinese patent documents all depend on a high-quality professional term library with a certain scale, but the real application often lacks the professional term library, so that the model is only applicable to the professional field with labeling data; the fact that the method proposed by the above patent ignores the determination of terms is based on a general concept, lacks a term extraction scheme for this angle; the extraction result lacks an evaluation criterion.
The term extraction scheme proposed in the chinese patent document CN108287825A, CN114528835A, CN103309852a lacks quantitative evaluation criteria for extraction results, and scientific analysis of the extraction results cannot be performed quantitatively. The method proposed in CN103309852a is a non-automated method, which requires manual participation during its operation, and cannot process large amounts of data quickly. Generally, existing methods only perform term of art metrics at a relatively single level, lacking in diversity optimization extraction.
Disclosure of Invention
Aiming at the defects of the prior art, the invention discloses a Chinese professional term extraction method integrating rules and statistical characteristics.
The invention also discloses a Chinese professional term extraction system for realizing the method.
Summary of The Invention
The invention discloses a Chinese technical term extraction method integrating rules and statistical characteristics, which comprises the following steps: term of art discovery, term of art screening and evaluation optimization;
in the technical term discovery part, a word segmentation tool which is universal in the natural language processing field and comprises a word frequency statistical dictionary and a probability algorithm is adopted, and if candidate technical terms are recorded in the word segmentation dictionary and are completely separated, a subsequent screening stage is directly carried out; if the technical terms are not directly separated by the word segmentation device although in the word segmentation dictionary, namely, the technical terms are not contained in the dictionary, candidate technical terms are extracted through sub-word splicing based on sub-word part-of-speech matching rules, point mutual information and vocabulary freedom.
In the term screening stage, the invention provides the term extraction index and extraction technology based on statistics of word frequency, number of words, point-to-point information, degree of freedom of vocabulary, information quantity difference ratio and the like, and the term in the word segmentation result is extracted from the general vocabulary or concept.
In the evaluation optimization stage, the invention provides an evaluation index and an optimization technology based on the information quantity difference of the professional term in the professional text and the universal text, and the candidate result is evaluated and optimally extracted from the two angles of the language rule and the context.
Technical term interpretation:
1. professional text: in the present invention, a specific area of expertise is referred to and chinese text containing a large number of terms of expertise and concepts is included.
2. Terminology of art: the present invention is directed to concepts and terminology which are embodied in a rich variety of forms and specific combinations for describing the associated problems, theories, techniques, methods, etc. as defined in the art.
3. Entry number: in the present invention, the number of pages that are obtained in an internet online query using a particular search engine and that fully mention the query vocabulary is referred to.
4. Degree of freedom of vocabulary: in the present invention, the flexible condition of the vocabulary and the context vocabulary thereof in the given professional text is divided into a left degree of freedom and a right degree of freedom, namely probability distribution condition of adjacent words.
5. Statistical characteristics: in the present invention, for a given specialized text, the generic term features are summarized based on probability theory and statistics. Such as word frequency, number of terms, mutual information of points, degrees of freedom of vocabulary, etc.
6. Contextual characteristics: in the invention, the meaning of the term of art in the application process and the meaning of the context and the feature sum of the rules of language are shown.
7. Relative specificity: in the present invention, the relative differences between the terms of art and general concepts in the rules of language are referred to.
8. Feature necessity: the invention refers to a comprehensive extraction index obtained by adjusting the influence of a screening index by changing weights or design rules.
9. General text: in the present invention, the term "chinese text" is not a specific area of expertise and does not include a large number of terms and concepts.
T-SNE: a related technology for reducing the dimension of a high-dimension vector to two dimensions in the field of computers is similar to BERT, and belongs to the common general knowledge of the field. The relevant person can know the algorithm through inquiry.
The detailed technical scheme of the invention is as follows:
a Chinese technical term extraction method integrating rules and statistical features is characterized by comprising the following steps:
s1: performing word segmentation processing on the professional text by using a word segmentation device so as to output word segmentation results;
the word segmentation device adopted in the invention is a word segmentation device commonly used in the field, and the word segmentation device cannot consider the technical terms when performing word segmentation, which is determined by the design concept and the adopted algorithm; the word segmentation device obtains a directed acyclic graph of word segmentation sentences from front to back on an input text according to a dictionary, and carries out word segmentation on the text only according to the dictionary, so that professional terms which are not in the dictionary are segmented into a plurality of sub-concepts at the same step; dynamic programming is then used to look up the most probable path segmentation combinations based on word frequency from back to front. Because the dictionary of the word segmentation device is obtained based on statistics of a general corpus, the occurrence times of some professional terms are relatively less, the word segmentation probability is lower, and the wrong word segmentation is easy;
In addition, after word segmentation is performed by the word segmentation device, the word segmentation result comprises class A technical terms, class B technical terms and non-technical terms:
the class A technical terms refer to the technical terms recorded in a word segmentation device dictionary and can be directly separated by the word segmentation device, so that a task target extracted by the technical terms is converted into the technical terms and general concepts in a word segmentation result, a large-scale general corpus is designed to be compared with the professional texts, the statistical difference of the vocabularies in the two corpuses is calculated, and the candidate technical terms are judged according to the statistical difference;
the class B technical terms refer to the technical terms which are not recorded in a word segmentation device dictionary, and due to the limitation of a word segmentation device design concept and a specific algorithm based on statistics, some technical terms are segmented into a plurality of sub-concepts in the word segmentation process;
the non-technical terms refer to: other common concepts or words which do not form terms except the class A technical terms and the class B technical terms in the word segmentation result;
for example, the word segmentation result performed on the specialized text "concept extraction is an important task of natural language processing" is [ "concept extraction", "is", "natural language", "processing", "important task" ], where "concept extraction" is a class a specialized term, which is segmented by a word segmenter; "natural language processing" is a class B term of art that is not split by the word splitter because the "natural language processing" is split into two parts: "natural language" and "processing", the remainder being non-technical terms, i.e., noise data that needs to be filtered;
S2: inputting the word segmentation result to a term discovery module to obtain candidate technical terms:
in the extraction process of the technical terms, the technical terms are classified, and different discovery strategies are adopted according to different types of the technical terms because the grammar of the vocabulary level forming the technical terms, the functions embodied by the sentence components undertaken in the application process and the disciplinary difficulty reflected by the occurrence rule of the sentence components in the whole text all increase the challenges of the task;
the class A term of art is directly regarded as a candidate term of art;
splicing adjacent sub-words in the class B technical terms, and taking the splicing result as a candidate technical term, wherein the specific splicing times are judged according to Ji Pufu law, which is not the content to be protected by the invention, and in addition, the splicing mode brings noise, so that the noise needs to be filtered by a term screening module according to the designed technical term extraction index;
s3: inputting candidate technical terms into a term screening module, and screening the candidate technical terms based on extraction indexes to obtain final technical terms, wherein the extraction indexes comprise extraction indexes based on statistical design, extraction indexes based on linguistic design and extraction indexes based on cognitive scientific design;
The extraction index based on the statistical design refers to: setting threshold values for word frequency and number of entries respectively, and regarding candidate professional terms within a threshold value range as words, namely, the candidate professional concepts statistically conform to rules formed by word assembly in linguistics; the design of the threshold value is obtained according to experimental results and experience; then selecting point mutual information and vocabulary freedom degree to further judge whether the vocabulary forms a technical term, wherein the specific use method is that the point mutual information value and the vocabulary freedom degree are used as weight items for calculating the comprehensive score of the technical term in the follow-up mode besides the set related threshold value; the word frequency refers to the number of times that candidate professional terms appear in the input professional text; the term number refers to the number of pages which are obtained by adopting a specific search engine in internet online query and completely mention query words, and can be obtained in a web crawler mode; setting a threshold value based on word frequency and number of entries, and screening candidate professional terms conforming to the threshold value;
based on the above description, the technical advantages of the technical features are as follows: screening out partial candidate professional terms which do not form vocabulary from the angle of statistical rules; because the candidate technical terms of the non-lexical words are necessarily composed of noise generated by wrong word segmentation or front-back concatenation of sub words; therefore, the statistical characteristics of candidate professional terms in professional texts are measured by using word frequency, the statistical characteristics of candidate professional terms in the continuous machine inquiry are measured by using the number of terms, and the combination of the two can judge whether the candidate professional terms accord with the rule formed by vocabulary in linguistics from the statistical angle, so that the rationality of the candidate professional terms serving as the professional terms is primarily judged.
The point mutual information is used for measuring the correlation between the subwords in the candidate technical terms, the point mutual information is used for measuring the correlation of two random variables in the probability theory, the influence of the determination of one random variable on the uncertainty of the other random variable is measured from the aspect of information entropy, and the deeper the influence is, the greater the correlation between the two random variables is; in the method, the relevance of two random variables is measured, and aiming at the problem of extracting the technical terms, the relevance among split sub-vocabularies in the technical terms is measured by using point mutual information, the combination of the two sub-vocabularies is judged to be the possibility of the technical terms by the relevance of the two sub-vocabularies, and the higher the relevance is, the higher the probability that the combination of the two sub-vocabularies appears, and the higher the probability that the combination of the two sub-vocabularies is taken as the technical terms is;
the calculation formula of the point mutual information among the subwords in the candidate technical terms is as follows:
in the formulaAnd formula->In (I)>Respectively representing subwords constituting candidate terms; />Representing candidate terminology; />Representing candidate term->Frequency of occurrence in the input professional text; />Representing subword->Frequency of occurrence in the input professional text; />Representing subword->Frequency of occurrence in the input professional text;
The calculation method is as formula->Shown, wherein->Representing the number of occurrences of candidate term of art in the input art text; />Representing a professional text length;
the technical advantages of the technical characteristics are as follows: sub-words forming the term of art are regarded as random variables based on statistics, the possibility of forming the term of art is measured according to the relativity between the two sub-words, the dot mutual information value is relatively low because the relativity between the two words forming noise is weak, the part of the dot mutual information value lower than the threshold value is filtered through experiments and experiences, and if the relativity between the sub-words is weak, the words formed by the two sub-words are considered to be not terms of art in the text;
the vocabulary degree of freedom is used for measuring application flexibility of candidate technical terms, and is measured specifically through disordered degree of words appearing on the left side and the right side of the technical terms, and because noise appears often in accidental situations or the context of the occurrence is relatively fixed, the vocabulary degree of freedom of noise is generally lower, and the technical terms and the noise can be distinguished through the vocabulary degree of freedom, and the method comprises the following steps: left vocabulary degree of freedom and right vocabulary degree of freedom;
The left vocabulary degree of freedom calculation formula is as follows:
in the formulaIn (I)>Wherein->Representing candidate specialty concepts, ++>The representation appears in +.>Total number of all words on left, +.>Representing the number of times a word appears to the left of the word;
for example: the total number of words, e.g. "lifetime", of all the words appearing to the left of a word, appears four times in the article, respectively "lifetime", "next lifetime", "lifetime", i.e. the total number of words at this time is 4: "one", "this", "lower", "this"; the number of kinds of the single words is 3: "one", "this", "down";
the right vocabulary degree of freedom calculation formula is as follows:
in the formulaIn (I)>Wherein->Representing candidate specialty concepts, ++>The representation appears in +.>Total number of all words on right, +.>Indicate->The number of occurrences of the seed word on the right side of the word;
and the above formulaThe obtained calculation result isThe existing property is consistent with the entropy value describing the inherent confusion degree in physics, and the entropy value of words appearing on the left side and the right side of a word is calculated, wherein the fewer words appear to bring about huge information quantity, the more words appear to bring about smaller information quantity, which accords with the common rule cognition, namely the measurement mode is reasonable, so that the application flexibility degree of the word is described.
The technical advantages of the technical characteristics are as follows: describing the flexibility degree of a word by measuring entropy values of words appearing on the left side and the right side of the candidate special terms, namely, the more the words appearing are provided with more information, the more the words appearing are provided with less information, then, according to specific values of left and right vocabulary degrees of freedom of each candidate special term, setting a threshold according to experience and experimental results, rejecting the vocabulary lower than the threshold, and the left and right vocabulary degrees of freedom are also used as two weight items for calculating the comprehensive scores of the special terms subsequently, so that noise reduction treatment is carried out on the candidate special terms generated by combination;
the extraction index based on linguistic design refers to: candidate professional terms are screened through part-of-speech rules, and because the generation of the professional terms accords with specific linguistic rules, the part-of-speech of the sub-words forming the professional terms accords with fixed part-of-speech collocation rules, so that candidate professional terms can be screened through part-of-speech rules, vocabularies which do not accord with part-of-speech forming rules of the professional terms are removed, and partial part-of-speech rules in statistical language modeling and automatic Chinese text proofreading technology are adopted as shown in the following table 1:
TABLE 1 part-of-speech collocation rules inside professional words
In table 1, letters represent parts of speech; n represents a noun; v represents a verb; vn represents a term; vi represents a bad object verb; a represents adjectives; l represents a idiom; b represents a distinction word; ng represents a nameelement; an represents a nameword; m represents a graduated word; the letter meaning conforms to the general linguistic definition;
and for noise reduction, for binary terminology: at least some of the two molecular words comprising the same have passed the screening for class a terminology, referring to the screening method herein for class a concepts; these screening methods are not uniformly applicable to two kinds of concepts, for example, the A-class concept has no subword, so the dot mutual information described by the formula (1) cannot be used, but other screening strategies without similar limitation can be used, and the result obtained according to the strategy is prior to the B-class term of art, so the subword can be limited to have the specialty;
the technical advantages of the technical characteristics are as follows: from the linguistic aspect, according to the existing technical term generation rule and the part-of-speech composition of the candidate technical term sub-words, judging whether the candidate technical term accords with the sub-word part-of-speech matching rule of the technical term or not, and therefore merging into an expert knowledge auxiliary model of linguistic to extract the technical term;
The extraction index based on the cognitive scientific design refers to: the method comprises the steps of relative specificity and feature necessity, based on the theoretical basis of human knowledge and understanding of the technical terms and established on the general concepts, designing a general large-scale corpus, comparing the general large-scale corpus with a professional text corpus to be subjected to the extraction of the technical terms, analyzing statistical differences of the candidate technical terms in the professional text corpus and the general large-scale corpus, and further judging the professionality of the candidate technical terms:
the relative specificity is used for judging the specificity of the candidate special terms compared with the candidate special terms in a general corpus, for example, people daily reports are used as the general corpus, and word frequency, point mutual information values and occurrence probability of the candidate special terms in the general corpus are calculated; judging whether the candidate term is a term according to the expression of the candidate term in the general corpus;
specifically, a first threshold value is set in the universal text, aiming at word frequency, the fact that the occurrence times of the special terms in the universal text are not excessive is considered, so that candidate special terms with the occurrence times smaller than or equal to the first threshold value are screened, the first threshold value is obtained through experimental analysis and experience, and the recommendation is set to be ten;
Aiming at the point mutual information value, considering that the sub-word correlation of the professional term in the input professional text is higher than the sub-word correlation of the professional term in the general corpus, screening candidate professional terms with the point mutual information value larger than the point mutual information value in the general text;
setting a second threshold for the occurrence probability, and considering that the occurrence probability of the special term in the special text is larger than the occurrence probability of the special term in the general text, screening candidate special terms with the ratio of the occurrence probability of the special term in the special text to the occurrence probability of the special term in the general text being larger than the second threshold, wherein the second threshold can be obtained through experimental analysis and experience, and is recommended to be set to be four hundred;
the technical advantages of the technical characteristics are as follows: from the perspective of cognitive science, based on the theoretical basis of human knowledge and understanding of the technical terms and on the general concepts, a daily corpus of people is selected as a large-scale general corpus, and then the performances of candidate technical terms in the two corpora are compared, so that the professionality of the candidate technical terms is judged;
the feature necessity is used for integrating various extraction indexes, and designing corresponding special term integrated scores and rules according to the importance degrees of various extraction indexes, wherein the special term integrated score formula is as follows:
In equation (5), CWS is the term of art composite score;representing word frequency; />Representing the point-to-point information value;respectively representing left and right vocabulary degrees of freedom;
the rules specifically refer to that the point mutual information values of candidate terms and the comprehensive scores of terms are required to be ranked in a fixed order before all the candidate terms, and one thousand of the terms are preferred, wherein the values are obtained through experimental analysis and experience, and finally an automatic intelligent data marking model is obtained;
the technical advantages of the technical characteristics are as follows: the index weight with larger influence on judging the specificity of the term is improved, and the constraint of sequencing is added, so that the most prominent part can be screened out under the condition of lower threshold value setting; and this setting of stiffness requirements strengthens the constraint on terms of art, further filtering noise that was not filtered in the previous step.
In summary, the invention is divided into term discovery, term screening:
in the term discovery part, a word segmentation tool based on statistics is adopted, the word segmentation tool comprises a dictionary with word frequency statistics and a word segmentation algorithm based on statistics, the word segmentation tool is applicable to two cases according to the quantity of words contained in the dictionary, the case that the dictionary contains more than hundred thousand words is called a large dictionary case, and the case that the dictionary contains more than hundred thousand words is called a small dictionary case. Aiming at the condition of a large dictionary, a part of commonly used technical terms are recorded in the dictionary of the word segmentation device, so that the part of the technical terms can be completely separated by adopting a word segmentation algorithm based on statistics, and the type A is marked. The other part of more rare technical terms are not contained in the dictionary and are difficult to be directly separated by the word separator, and the type B is marked. According to the naming principle of Chinese vocabulary, the class B technical term often consists of a plurality of general vocabulary, such as genetic engineering, a refrigerant pumping device, a liquid refrigerant injection joint and the like.
In the term screening part, aiming at class B technical terms, the invention provides a mode of splicing word segmentation results of a word segmentation device to find the technical terms of the incorrectly segmented words, and then the class A technical terms and the class B technical terms are sequentially designed from the statistical perspective to extract the technical terms based on word frequency, word count, mutual information and vocabulary degree of freedom indexes. From the linguistic point of view, the part-of-speech composition of the sub-words constituting the term of art is considered, and the term of art is further extracted according to the sub-word part-of-speech composition rules of the term of art. From the perspective of cognitive science, based on the theoretical basis of human knowledge and understanding of the technical terms and established on the general concepts, a general large-scale corpus is designed to be compared with a professional text corpus to be subjected to the extraction of the technical terms, and the technical terms are extracted according to the two comparison extraction indexes of the relative specificity of the design and the field-dependent feature necessity.
Preferably, according to the present invention, the method further includes S4 evaluating the extracted term of art using a quantitative evaluation criterion, the evaluation criterion including: the evaluation standards from the aspect of statistical characteristics and the evaluation standards based on the basic rule design of cognitive science can be evaluated separately or comprehensively;
The evaluation criterion from the aspect of statistical characteristics is based on the assumption that the occurrence frequency of the professional term in the professional text is higher than that in the general text, and the statistical characteristics of a word in a specific sample are counted
In the formulaIn (I)>Representing vocabulary; />Representing a professional document; />Representing the number of occurrences of the vocabulary w in the document; />Representing the total length of the document according to the formula +.>Designing specific evaluation criteria, wherein from the perspective of cognitive science, part of the expertise of the term is embodied in the aspect of the law of language; the more this vocabulary is applied to the professional text, the greater the likelihood that it is a professional term; the more concepts are applied to the general text, the greater the likelihood that they are general concepts;
the evaluation standard based on the cognitive science basic rule design comprises the following steps:
in the formulaFormula->In (I)>The technical terms extracted from the automatic intelligent data marking model are represented; />Representing a general concept obtained by random sampling; />Representing professional text; />Representing a generic text;
equation (7) is based on the assumption that terms of art are more common in professional text than in general text, so if the extraction result satisfies the equationThe extracted technical terms are specified to have specificity;
Formula (VI)Based on the assumption that the occurrence probability of the general concepts in the general text is greater than the occurrence frequency of the term in the general text, the formula verifies whether the randomly extracted general concepts have versatility.
According to a preferred embodiment of the present invention, the S4 evaluation criterion further includes: based on the evaluation standard of the context characteristics, namely, representing the vocabulary by using the embedded word vector, performing dimension reduction visualization on the embedded word vector.
According to the invention, the method for performing dimension reduction visualization on the embedded word vector preferably comprises the following steps:
the context features refer to semantic connotation and feature summation of rules of language expressed by the context of the professional term in the application process, specifically, word embedding is performed on the extracted professional term and general concepts obtained by random sampling according to a Chinese BERT (Bidirectional Encoder Representation from Transformers) model, namely, the embedded word vector is used for representing vocabulary, then dimension reduction visualization is performed on the embedded word vector through T-SNE so as to analyze the visualization result, and the two can be distinguished after the visualization, as shown in fig. 1, so as to illustrate that the algorithm proposed by the invention is effective.
The technical advantages of the technical characteristics are as follows: based on inherent characteristics of the technical terms, comprehensively evaluating the extracted technical terms from two different angles of statistical characteristics and contextual characteristics; secondly, consider that the term of art should be relative to the general concept, so that the general concept of random sampling is used as comparison data when designing both evaluation methods.
The invention designs and evaluates the evaluation index of the extraction result based on the necessary conditions forming the technical term, in particular from the two aspects of language rule and context, and evaluates the extraction result according to the evaluation index of the two points of design quantization and visualization that the appearance frequency of the technical term in the professional text is higher than that of the technical term in the general text and the context between the technical term and the general concept is obviously different.
The Chinese technical term extraction system for realizing the method is characterized by comprising the following steps: a term discovery module for implementing the term discovery and a term screening module for implementing the term screening.
Preferably, according to the invention, the system further comprises an evaluation module for evaluating the extracted terms using a quantitative evaluation criterion.
The invention has the beneficial effects that:
1. the invention combines the statistical rule and expert rule, and adopts deterministic algorithm to extract the technical terms. Compared with the defect that a part of special term dictionary or single rule is required for screening the special term in the prior art, the invention fuses a plurality of statistical characteristic information such as word frequency, entry number, point mutual information value, vocabulary degree of freedom and the like, and expert knowledge such as word part rule, relative specificity, feature necessity and the like, and can better divide the characteristic value range of the special term in terms of differential special term and general concept, and reflect the linguistic word forming rule of the special term, the language rule in terms of description of the special knowledge and the statistical rule in terms of feature selection.
2. The invention designs the related technical term extraction technology aiming at two different types of conditions, not only can solve the reality problem of a main stream word segmentation tool, so that the invention has wider application range, but also adapts to the conditions of continuously emerging new terms and new concepts which are not contained in a dictionary, the proposed technical scheme can solve the data characteristics of a real environment, and the proposed rule accords with Ji Pufu law.
3. The invention provides a multi-angle quantitative evaluation rule and a professional term extraction optimization technology, which can measure the professionality of terms from two angles of statistical characteristics and contextual characteristics.
Drawings
FIG. 1 is a graphical representation of the extracted terms and generic conceptual context differences of the present invention;
in fig. 1, dots "++" represent terms of art, english specific content; the cross "x" represents a generic term, english general concept; the horizontal axis and the vertical axis of the graph have no special physical significance and only represent two dimensions after dimension reduction; coordinates of dots "++and crosses" × "represent positions of the terms and general terms after dimension reduction;
FIG. 2 is an overall block diagram of the term of art extraction system of the present invention;
fig. 3 is a flow chart of the design of the term extraction system according to the present invention.
Detailed Description
The present invention will be described in detail with reference to examples and drawings, but is not limited thereto.
Example 1,
As shown in fig. 2 and 3, a method for extracting chinese terms with rule and statistical features integrated includes:
s1: performing word segmentation processing on the professional text by using a word segmentation device so as to output word segmentation results;
the word segmentation device adopted in the invention is a word segmentation device commonly used in the field, and the word segmentation device cannot consider the technical terms when performing word segmentation, which is determined by the design concept and the adopted algorithm; the word segmentation device obtains a directed acyclic graph of word segmentation sentences from front to back on an input text according to a dictionary, and carries out word segmentation on the text only according to the dictionary, so that professional terms which are not in the dictionary are segmented into a plurality of sub-concepts at the same step; dynamic programming is then used to look up the most probable path segmentation combinations based on word frequency from back to front. Because the dictionary of the word segmentation device is obtained based on statistics of a general corpus, the occurrence times of some professional terms are relatively less, the word segmentation probability is lower, and the wrong word segmentation is easy;
Since there may be some special symbols in the professional text that affect the accuracy of the segmentation or some stop words that affect statistics. The invention preprocesses professional text, and specific preprocessing steps comprise deleting special characters in the text and removing stop words in the text. A text preprocessing step is required to be declared, the text preprocessing step is required to be carried out when a large number of special symbols influencing the understanding of the text are contained in the professional text, and the text preprocessing step can be selectively carried out if the quality of the aimed professional text is good;
in addition, after word segmentation is performed by the word segmentation device, the word segmentation result comprises class A technical terms, class B technical terms and non-technical terms:
the class A technical terms refer to the technical terms recorded in a word segmentation device dictionary and can be directly separated by the word segmentation device, so that a task target extracted by the technical terms is converted into the technical terms and general concepts in a word segmentation result, a large-scale general corpus is designed to be compared with the professional texts, the statistical difference of the vocabularies in the two corpuses is calculated, and the candidate technical terms are judged according to the statistical difference;
the class B technical terms refer to the technical terms which are not recorded in a word segmentation device dictionary, and due to the limitation of a word segmentation device design concept and a specific algorithm based on statistics, some technical terms are segmented into a plurality of sub-concepts in the word segmentation process;
The non-technical terms refer to: other common concepts or words which do not form terms except the class A technical terms and the class B technical terms in the word segmentation result;
for example, the word segmentation result performed on the specialized text "concept extraction is an important task of natural language processing" is [ "concept extraction", "is", "natural language", "processing", "important task" ], where "concept extraction" is a class a specialized term, which is segmented by a word segmenter; "natural language processing" is a class B term of art that is not split by the word splitter because the "natural language processing" is split into two parts: "natural language" and "processing", the remainder being non-technical terms, i.e., noise data that needs to be filtered;
s2: inputting the word segmentation result to a term discovery module to obtain candidate technical terms:
in the extraction process of the technical terms, the technical terms are classified, and different discovery strategies are adopted according to different types of the technical terms because the grammar of the vocabulary level forming the technical terms, the functions embodied by the sentence components undertaken in the application process and the disciplinary difficulty reflected by the occurrence rule of the sentence components in the whole text all increase the challenges of the task;
The class A term of art is directly regarded as a candidate term of art;
splicing adjacent sub-words in the class B technical terms, and taking the splicing result as a candidate technical term, wherein the specific splicing times are judged according to Ji Pufu law, which is not the content to be protected by the invention, and in addition, the splicing mode brings noise, so that the noise needs to be filtered by a term screening module according to the designed technical term extraction index;
s3: inputting candidate technical terms into a term screening module, and screening the candidate technical terms based on extraction indexes to obtain final technical terms, wherein the extraction indexes comprise extraction indexes based on statistical design, extraction indexes based on linguistic design and extraction indexes based on cognitive scientific design;
in this step a different term extraction strategy is used for both classes of terms A, B, as shown in particular in fig. 2.
The present invention employs the term discovery strategy in the lower left corner of fig. 2 for class a terminology. Firstly, word segmentation is carried out on the general text and the professional text through a word segmentation device. And then, statistics of statistics characteristics of the word segmentation results corresponding to the universal text are carried out according to the professional text word segmentation results. And then input into the term screening module together.
Term discovery strategy for class B technical terms the present invention employs the lower right hand corner of fig. 2. Multiplexing word segmentation results of the word segmentation device aiming at the professional text, and determining the splicing times according to Ji Pufu law to form candidate professional terms. And then input to the term screening module, it should be noted that the term screening module of the class B professional term finder already contains relevant information in the general text, so that there is no problem in calculating features such as relative specificity and the like which need participation of the general text.
The extraction index based on the statistical design refers to: setting threshold values for word frequency and number of entries respectively, and regarding candidate professional terms within a threshold value range as words, namely, the candidate professional concepts statistically conform to rules formed by word assembly in linguistics; the design of the threshold value is obtained according to experimental results and experience; then selecting point mutual information and vocabulary freedom degree to further judge whether the vocabulary forms a technical term, wherein the specific use method is that the point mutual information value and the vocabulary freedom degree are used as weight items for calculating the comprehensive score of the technical term in the follow-up mode besides the set related threshold value; the word frequency refers to the number of times that candidate professional terms appear in the input professional text; the term number refers to the number of pages which are obtained by adopting a specific search engine in internet online query and completely mention query words, and can be obtained in a web crawler mode; setting a threshold value based on word frequency and number of entries, and screening candidate professional terms conforming to the threshold value;
The point mutual information is used for measuring the correlation between the subwords in the candidate technical terms, the point mutual information is used for measuring the correlation of two random variables in the probability theory, the influence of the determination of one random variable on the uncertainty of the other random variable is measured from the aspect of information entropy, and the deeper the influence is, the greater the correlation between the two random variables is; in the method, the relevance of two random variables is measured, and aiming at the problem of extracting the technical terms, the relevance among split sub-vocabularies in the technical terms is measured by using point mutual information, the combination of the two sub-vocabularies is judged to be the possibility of the technical terms by the relevance of the two sub-vocabularies, and the higher the relevance is, the higher the probability that the combination of the two sub-vocabularies appears, and the higher the probability that the combination of the two sub-vocabularies is taken as the technical terms is;
the calculation formula of the point mutual information among the subwords in the candidate technical terms is as follows:
in the formulaAnd formula->In (I)>Respectively representing subwords constituting candidate terms; />Representing candidate terminology; />Representing candidate term->Frequency of occurrence in the input professional text; />Representing subword->Frequency of occurrence in the input professional text; />Representing subword->Frequency of occurrence in the input professional text;
The calculation method is as formula->Shown, wherein->Representing the number of occurrences of candidate term of art in the input art text; />Representing a professional text length;
the vocabulary degree of freedom is used for measuring application flexibility of candidate technical terms, and is measured specifically through disordered degree of words appearing on the left side and the right side of the technical terms, and because noise appears often in accidental situations or the context of the occurrence is relatively fixed, the vocabulary degree of freedom of noise is generally lower, and the technical terms and the noise can be distinguished through the vocabulary degree of freedom, and the method comprises the following steps: left vocabulary degree of freedom and right vocabulary degree of freedom;
the left vocabulary degree of freedom calculation formula is as follows:
in the formulaIn (I)>Wherein->Representing candidate specialty concepts, ++>The representation appears in +.>Total number of all words on left, +.>Representing the number of times a word appears to the left of the word;
for example: the total number of words, e.g. "lifetime", of all the words appearing to the left of a word, appears four times in the article, respectively "lifetime", "next lifetime", "lifetime", i.e. the total number of words at this time is 4: "one", "this", "lower", "this"; the number of kinds of the single words is 3: "one", "this", "down";
The right vocabulary degree of freedom calculation formula is as follows:
in the formulaIn (I)>Wherein->Representing candidate specialty concepts, ++>The representation appears in +.>Total number of all words on right, +.>Indicate->The number of occurrences of the seed word on the right side of the word;
the property of the calculated result obtained by the formula (4) is consistent with the entropy value describing the inherent confusion degree in physics, and the entropy value of the words appearing on the left side and the right side of a word is calculated, wherein the words appearing less bring huge information quantity, and the words appearing more bring smaller information quantity, which accords with the common rule cognition, namely the measurement mode is reasonable, so that the application flexibility degree of the word is described.
The extraction index based on linguistic design refers to: candidate professional terms are screened through part-of-speech rules, and because the generation of the professional terms accords with specific linguistic rules, the part-of-speech of the sub-words forming the professional terms accords with fixed part-of-speech collocation rules, so that candidate professional terms can be screened through part-of-speech rules, vocabularies which do not accord with part-of-speech forming rules of the professional terms are removed, and partial part-of-speech rules in statistical language modeling and automatic Chinese text proofreading technology are adopted as shown in the following table 1:
TABLE 1 part-of-speech collocation rules inside professional words
In table 1, letters represent parts of speech; n represents a noun; v represents a verb; vn represents a term; vi represents a bad object verb; a represents adjectives; l represents a idiom; b represents a distinction word; ng represents a nameelement; an represents a nameword; m represents a graduated word; the letter meaning conforms to the general linguistic definition;
and for noise reduction, for binary terminology: at least some of the two molecular words comprising the same have passed the screening for class a terminology, referring to the screening method herein for class a concepts; these screening methods are not uniformly applicable to two kinds of concepts, e.g., class a concepts have no subwords, so formulas cannot be usedThe described point-to-point information, but other screening strategies without similar restrictions can be used, and the result obtained according to the strategy is prior to the class B technical term, so that the subword can be restricted to have specificity;
the extraction index based on the cognitive scientific design refers to: the method comprises the steps of relative specificity and feature necessity, based on the theoretical basis of human knowledge and understanding of the technical terms and established on the general concepts, designing a general large-scale corpus, comparing the general large-scale corpus with a professional text corpus to be subjected to the extraction of the technical terms, analyzing statistical differences of the candidate technical terms in the professional text corpus and the general large-scale corpus, and further judging the professionality of the candidate technical terms:
The relative specificity is used for judging the specificity of the candidate special terms compared with the candidate special terms in a general corpus, for example, people daily reports are used as the general corpus, and word frequency, point mutual information values and occurrence probability of the candidate special terms in the general corpus are calculated; judging whether the candidate term is a term according to the expression of the candidate term in the general corpus;
specifically, a first threshold value is set in the universal text, aiming at word frequency, the fact that the occurrence times of the special terms in the universal text are not excessive is considered, so that candidate special terms with the occurrence times smaller than or equal to the first threshold value are screened, the first threshold value is obtained through experimental analysis and experience, and the recommendation is set to be ten;
aiming at the point mutual information value, considering that the sub-word correlation of the professional term in the input professional text is higher than the sub-word correlation of the professional term in the general corpus, screening candidate professional terms with the point mutual information value larger than the point mutual information value in the general text;
setting a second threshold for the occurrence probability, and considering that the occurrence probability of the special term in the special text is larger than the occurrence probability of the special term in the general text, screening candidate special terms with the ratio of the occurrence probability of the special term in the special text to the occurrence probability of the special term in the general text being larger than the second threshold, wherein the second threshold can be obtained through experimental analysis and experience, and is recommended to be set to be four hundred;
The feature necessity is used for integrating various extraction indexes, and designing corresponding special term integrated scores and rules according to the importance degrees of various extraction indexes, wherein the special term integrated score formula is as follows:
in equation (5), CWS is the term of art composite score;representing word frequency; />Representing the point-to-point information value;respectively representing left and right vocabulary degrees of freedom;
the rules specifically refer to that the point mutual information values of candidate terms and the comprehensive scores of terms are required to be ranked in a fixed order before all the candidate terms, and one thousand of the terms are preferred, wherein the values are obtained through experimental analysis and experience, and finally an automatic intelligent data marking model is obtained;
the rules specifically refer to the point mutual information value of the candidate term and the term comprehensive score must be ranked in the previous thousand of the candidate terms, wherein the value is obtained through experimental analysis and experience.
In summary, the invention is divided into term discovery, term screening:
in the term discovery part, a word segmentation tool based on statistics is adopted, the word segmentation tool comprises a dictionary with word frequency statistics and a word segmentation algorithm based on statistics, the word segmentation tool is applicable to two cases according to the quantity of words contained in the dictionary, the case that the dictionary contains more than hundred thousand words is called a large dictionary case, and the case that the dictionary contains more than hundred thousand words is called a small dictionary case. Aiming at the condition of a large dictionary, a part of commonly used technical terms are recorded in the dictionary of the word segmentation device, so that the part of the technical terms can be completely separated by adopting a word segmentation algorithm based on statistics, and the type A is marked. The other part of more rare technical terms are not contained in the dictionary and are difficult to be directly separated by the word separator, and the type B is marked. According to the naming principle of Chinese vocabulary, the class B technical term often consists of a plurality of general vocabulary, such as genetic engineering, a refrigerant pumping device, a liquid refrigerant injection joint and the like.
In the term screening part, aiming at class B technical terms, the invention provides a mode of splicing word segmentation results of a word segmentation device to find the technical terms of the incorrectly segmented words, and then the class A technical terms and the class B technical terms are sequentially designed from the statistical perspective to extract the technical terms based on word frequency, word count, mutual information and vocabulary degree of freedom indexes. From the linguistic point of view, the part-of-speech composition of the sub-words constituting the term of art is considered, and the term of art is further extracted according to the sub-word part-of-speech composition rules of the term of art. From the perspective of cognitive science, based on the theoretical basis of human knowledge and understanding of the technical terms and established on the general concepts, a general large-scale corpus is designed to be compared with a professional text corpus to be subjected to the extraction of the technical terms, and the technical terms are extracted according to the two comparison extraction indexes of the relative specificity of the design and the field-dependent feature necessity.
EXAMPLE 2,
The method for extracting chinese terms of art with rule and statistical features fused as in embodiment 1, further comprising S4 evaluating the extracted terms of art using a quantitative evaluation criterion, the evaluation criterion comprising: the evaluation standards from the aspect of statistical characteristics and the evaluation standards based on the basic rule design of cognitive science can be evaluated separately or comprehensively;
The evaluation criterion from the aspect of statistical characteristics is based on the assumption that the occurrence frequency of the professional term in the professional text is higher than that in the general text, and the statistical characteristics of a word in a specific sample are counted
In the formulaIn (I)>Representing vocabulary; />Representing a professional document; />Representing vocabulary +.>The number of occurrences; />Representing the total length of the document according to the formula +.>Designing specific evaluation criteria, wherein from the perspective of cognitive science, part of the expertise of the term is embodied in the aspect of the law of language; the more this vocabulary is applied to the professional text, the greater the likelihood that it is a professional term; the more concepts are applied to the general text, the greater the likelihood that they are general concepts;
the evaluation standard based on the cognitive science basic rule design comprises the following steps:
in the formulaFormula->In (I)>The technical terms extracted from the automatic intelligent data marking model are represented; />Representing a general concept obtained by random sampling; />Representing professional text; />Representing a generic text;
formula (VI)Based on the assumption that terms of art are more common in professional texts than in general texts, therefore if the extraction result satisfies the formula +.>The extracted technical terms are specified to have specificity;
Formula (VI)Based on the assumption that the occurrence probability of the general concepts in the general text is greater than the occurrence frequency of the term in the general text, the formula verifies whether the randomly extracted general concepts have versatility.
The S4 evaluation criterion further includes: based on the evaluation standard of the context characteristics, namely, representing the vocabulary by using the embedded word vector, performing dimension reduction visualization on the embedded word vector. In particular, the evaluation criteria in fig. 2 includes two parts, namely, evaluation based on statistical features and context features.
The method for performing dimension reduction visualization on the embedded word vector comprises the following steps:
the context features refer to semantic connotation and feature summation of rules of language expressed by the context of the professional term in the application process, specifically, word embedding is performed on the extracted professional term and general concepts obtained by random sampling according to a Chinese BERT (Bidirectional Encoder Representation from Transformers) model, namely, the embedded word vector is used for representing vocabulary, then dimension reduction visualization is performed on the embedded word vector through T-SNE so as to analyze the visualization result, and the two can be distinguished after the visualization, as shown in fig. 1, so as to illustrate that the algorithm proposed by the invention is effective.
The invention designs and evaluates the evaluation index of the extraction result based on the necessary conditions forming the technical term, in particular from the two aspects of language rule and context, and evaluates the extraction result according to the evaluation index of the two points of design quantization and visualization that the appearance frequency of the technical term in the professional text is higher than that of the technical term in the general text and the context between the technical term and the general concept is obviously different.
The method for extracting chinese terms with rule and statistical features according to embodiments 1 and 2, wherein the specific manner of extracting terms based on the case a and extracting terms based on the case B is as follows:
selecting a general dictionary and a word segmentation device, respectively segmenting the large corpus general text and the professional text, and counting the word frequency numbers of all the words in the word segmentation result. And then comprehensively judging whether the term of the word segmentation result is the professional term segmented by the word segmentation tool according to the ratio of the word frequency number of the concept in the professional text to the word frequency number of the concept in the large corpus and various indexes designed in the part. Meanwhile, in order to balance word frequency differences generated by a large corpus and professional texts due to text lengths, the word frequency segmentation method and device scale word frequency segmentation results according to word number proportions. Through the steps and the threshold design, a part of professional terms which can be separated by the word separator can be screened out.
In order to extract the terms in the text more comprehensively, some terms divided into a plurality of sub-words by the word divider need to be considered. Therefore, word segmentation results are spliced according to Ji Pufu law, the spliced words are treated as a new word, and then professional terms are screened. The word frequency, the number of words, the mutual information of points, the degree of freedom of words, the part of speech rule, the relative specificity, the feature necessity and the like are measured respectively. Comprehensively judging whether the combined vocabulary is a technical term.
A method for extracting chinese terminology with rule and statistical features fused as in embodiments 1 and 2, wherein the threshold design comprises the following two steps.
1) The threshold value set in this embodiment is a result of comprehensively considering by a plurality of extraction standards. Taking point mutual information as an example, firstly setting a low-threshold extraction standard, sequencing a large number of vocabularies conforming to the standard, and judging the numerical value of threshold design according to the sequencing result. It should be noted that the part-of-speech constraint of the term sub-words is always present in the process, that is, the part-of-speech combination of the term sub-words needs to meet the combination standard of the term sub-words.
2) After obtaining the threshold values of word frequency, number of terms, mutual information, degree of freedom of vocabulary and part of speech rules, several characteristic indexes are needed to be selected on the basis of the threshold values and compared with the index values of the words in the large-scale general corpus. The relative specificity of the technical terms is obtained in this way, and a part of noise is filtered out again according to the relative specificity. In addition, the above extraction indexes are combined, and joint extraction rules outside the threshold are designed to form a feature necessity extraction scheme. Specifically, the invention selects the concept of simultaneous ranking of the dot mutual information value and the weighted sum of a plurality of indexes into the previous thousand names in all candidate professional terms as the extraction result.
By using the methods described in embodiments 1 and 2, the economic law is used as a professional text, the daily report corpus of people is used as a general text, and the specific extraction method of the professional terms is as follows:
a. since the composition of terms of art is a fixed constraint on the part of speech of its sub-words. Therefore, the invention firstly carries out part-of-speech analysis on the sub-words of all the combined concepts and judges whether the part-of-speech combination accords with the part-of-speech combination rule of the sub-words of the professional term. All non-conforming combined words are filtered out.
b. The threshold is designed according to the past multiple experimental results as follows:
table 2 threshold settings
The design comprehensive weighting values are as follows:
wherein Freq represents word frequency, PMI represents dot mutual information,and->Respectively representing left and right vocabulary degrees of freedom;
c. in addition to meeting the above thresholds, there is a need for joint constraints that meet multiple thresholds. Specifically, after experimental analysis based on a variety of extraction criteria, it was found that the extracted term of art was most accurate when both the required CWS and the point-to-point information values met the top one thousand of ranks.
d. To embody the specificity of terms of art with respect to general concepts, a general data set needs to be selected as a comparison data set. Specifically, based on the general rule that the term of art should not appear in a large amount in the general text, the invention sets rules, and further screens out the candidate term of art with the word frequency less than or equal to 10 in the general large corpus.
The following results were finally obtained: the invention can extract professional concepts to a certain extent from the aspects of partial extraction results, and proves the effectiveness of the method.
Aiming at the extracted technical terms, the invention evaluates the extraction result:
(i) And evaluating the extraction result from the perspective of statistical characteristics. The extraction result is required to be more common in professional text than it is in general text. The general concepts of random selection are more common in general text. The experiment is carried out for ten times, the general concept is randomly sampled every time, and then the average value of the ten experimental results is taken as the final result.
Specifically adopting the formulaFormula->Formula->And (5) performing calculation. The following results were finally obtained. The probability of 67.98% of the extracted technical terms appearing in the technical text is higher than that of the extracted technical terms appearing in the general text. Either too high or too low a model result is undesirable. Because a part of the extracted technical terms can be directly separated by the word separatorAccounting for 50 percent of the total proportion. The term "professional" is a term that is generally accepted in the sense that the term is more well known and is often used. 76.74% of the common concepts of random sampling are more common in common text. The general concept of random sampling is proved to have certain versatility.
(ii) The extraction result is evaluated from a contextual point of view. And selecting sentences appearing in the extraction result as the context of the extraction result. And sentences that match in a large corpus using randomly sampled generic concepts as the context of the generic concepts. Then, the BERT is used for word embedding to obtain word embedding representations of two contexts, the T-SNE dimension reduction visualization is carried out on the word embedding representations, the visualization results of the word embedding representations on a two-dimensional plane are shown in figure 1, and from the graphical results, the professional terms extracted by the method can be obviously different from the contexts of the general concepts from the context angle, so that the effectiveness of the method is proved to a certain extent.
It is clear from this that the term vectors of the term and the generic concept are significantly borderline, i.e. the context of the extraction result is different from the context in which the generic concept appears.
By using the method, the effectiveness of extracting the technical terms is verified by taking the economic method as an input sample and selecting a proper open relation extraction method.
Based on the above method 178 technical terms are obtained. And then designing the concept relation category of the professional sample, and designing the following four relation types, namely a composition relation, an attribute relation, a constraint relation and a correlation relation by referring to WordNet.
Wherein, the composition relation is used to describe the inclusion relation between two concepts. For example, "regulations include administrative regulations and local regulations". An attribute relationship is a certain characteristic used to describe a certain concept. Constraint relationships are relationship states that describe a relative imbalance between two concepts, i.e., one concept may restrict constraint of the other concept. A correlation is a state that is used to describe a relative balance between two concepts.
Then, the relationship extraction is performed on the selected professional text based on the extracted professional terms using a model corresponding to "Zhao J, gui T, zhang Q, et al A Relation-Oriented Clustering Method for Open Relation Extraction [ J ]. ArXiv preprint arXiv:2109.07205, 2021. The main idea is to cluster terms into relationships between terms. After being adjusted, the device is mainly divided into the following three parts.
The data set is encoded using a pre-trained BERT model. The target data is mapped to a high-dimensional vector h.
Mapping high latitude vector h to low dimensional vector by nonlinear mapping function gThen clustering is performed. Due to its use- >The distance to the cluster center is used as a loss function, so in order to avoid collapse of the g-map onto the cluster center, a non-linear decoder d is designed to collapse +.>Mapping back to the original high-dimensional vector space and adding +.>Distance to h. This will not collapse the g mapping onto the cluster center.
And classifying the clustering results through a classifier.
Finally, the extracted relation result is obtained. And then enabling the technical terms to serve as nodes and the relationships among the technical terms to serve as edges, and constructing a knowledge graph.
The invention designs an automatic analysis rule based on linguistic knowledge, and extracts terms in a professional text by combining vocabulary statistics characteristics of a corpus, wherein the evaluation indexes and extraction technologies of the professional terms such as word frequency, network entry number, word point mutual information, vocabulary freedom degree and the like based on the corpus reflect the statistics information according to the professional terms; sentence function of the fusion technical term and part-of-speech rules of the constituent sub-words reflect linguistic knowledge of the new word naming specification; the relative specificity of the vocabulary and the domain dependency index and the related technical term extraction technology reflect the law of cognitive science, namely, the definition of human on the technical term is a theoretical basis established on the general concept, the extraction is based on the relative difference information quantity of a general large-scale corpus and a text corpus to be trained, the occurrence frequency of the technical term in the professional text is higher than the occurrence frequency of the technical term in the general text, and the context between the technical term and the general concept is obviously different. The principle, the quantization index and the extraction technology reflect the inherent language rules and social attributes of the language, and have important theoretical significance and practical value.
EXAMPLE 3,
A chinese term of art extraction system for implementing the method of embodiments 2 and 3, comprising: a term discovery module for implementing the term discovery and a term screening module for implementing the term screening.
EXAMPLE 4,
The system of embodiment 3 further comprising an evaluation module that evaluates the extracted term of art using a quantitative evaluation criterion.

Claims (6)

1. A Chinese technical term extraction method integrating rules and statistical features is characterized by comprising the following steps:
s1: performing word segmentation processing on the professional text by using a word segmentation device so as to output word segmentation results;
after word segmentation is carried out by the word segmentation device, the word segmentation result comprises class A technical terms, class B technical terms and non-technical terms:
the class A technical terms refer to the technical terms recorded in a word segmentation device dictionary;
the class B technical terms refer to the technical terms which are not recorded in a word segmentation device dictionary and are segmented into a plurality of sub-concepts in the word segmentation process;
the non-technical terms refer to: other common concepts or words which do not form terms except the class A technical terms and the class B technical terms in the word segmentation result;
s2: inputting the word segmentation result to a term discovery module to obtain candidate technical terms:
The class A term of art is directly regarded as a candidate term of art;
splicing adjacent subwords in the class B technical terms, and taking the splicing result as a candidate technical term;
s3: the candidate technical terms are input into a term screening module, and the candidate technical terms are screened based on extraction indexes to obtain final technical terms, wherein the extraction indexes comprise extraction indexes based on statistical design, extraction indexes based on linguistic design and extraction indexes based on cognitive scientific design.
2. The method for extracting Chinese technical terms by combining rules and statistical features according to claim 1, wherein the extraction index based on statistical design is: setting thresholds for word frequency and number of entries respectively, and regarding candidate professional terms within the threshold range as words; then selecting point mutual information and vocabulary freedom degree to further judge whether the vocabulary forms a technical term;
the calculation formula of the point mutual information among the subwords in the candidate technical terms is as follows:
in the formulaAnd formula->In (I)>Respectively representing subwords constituting candidate terms; />Representing candidate terminology; />Representing candidate term- >Frequency of occurrence in the input professional text; />Representing subword->Frequency of occurrence in the input professional text; />Representing subword->Frequency of occurrence in the input professional text;
the calculation method is as formula->Shown, wherein->Representing the number of occurrences of candidate term of art in the input art text; />Representing a professional text length;
distinguishing terms and noise through degrees of lexical freedom includes: left vocabulary degree of freedom and right vocabulary degree of freedom;
the left vocabulary degree of freedom calculation formula is as follows:
in the formulaIn (I)>Wherein->Representing candidate specialty concepts, ++>The representation appears in +.>Total number of all words on left, +.>Representing the number of times a word appears to the left of the word;
the right vocabulary degree of freedom calculation formula is as follows:
in the formulaIn (I)>Wherein->Representing candidate specialty concepts, ++>The representation appears in +.>Total number of all words on right, +.>Indicate->The number of occurrences of the seed word on the right side of the word;
the extraction index based on linguistic design refers to: screening candidate technical terms through part-of-speech rules; for binary terminology requirements: at least some of the two sub-words that make up it have passed the screening for class a terminology;
The extraction index based on the cognitive scientific design refers to: the method comprises the steps of designing a general large-scale corpus, comparing the general large-scale corpus with a professional text corpus to be subjected to professional term extraction, analyzing statistical differences of candidate professional terms in the professional text corpus and the general large-scale corpus, and further judging the professionality of the candidate professional terms:
the relative specificity is used for judging the specificity of the candidate technical terms compared with the specificity of the candidate technical terms in the general corpus;
specifically, a first threshold value is set in the universal text, and candidate professional terms with occurrence times smaller than or equal to the first threshold value are screened for word frequency;
aiming at the point mutual information value, screening candidate professional terms with the point mutual information value larger than the point mutual information value in the general text;
setting a second threshold for the occurrence probability, and screening candidate professional terms with the ratio of the occurrence probability of the professional terms in the professional text to the occurrence probability of the professional terms in the universal text being larger than the second threshold;
the feature necessity is used for integrating various extraction indexes, and designing corresponding special term integrated scores and rules according to the importance degrees of various extraction indexes, wherein the special term integrated score formula is as follows:
In the formulaIn CWS is a term of art composite score; />Representing word frequency; />Representing the point-to-point information value; />Respectively representing left and right vocabulary degrees of freedom;
the rules specifically refer to point mutual information values of candidate terms and the fact that the comprehensive scores of terms are required to be ranked in a fixed order before all the candidate terms are ranked, and finally an automatic intelligent data marking model is obtained.
3. The method of claim 1, further comprising S4 evaluating the extracted terms using a quantitative evaluation criterion, the evaluation criterion comprising: evaluation standards from the aspect of statistical characteristics and evaluation standards designed based on basic rules of cognitive science;
the evaluation criterion from the aspect of statistical characteristics is based on the assumption that the occurrence frequency of the professional term in the professional text is higher than that in the general text, and the statistical characteristics of a word in a specific sample are counted
In the formulaIn (I)>Representing vocabulary; />Representing a professional document; />Representing vocabulary +.>The number of occurrences;representing the total length of the document;
the evaluation standard based on the cognitive science basic rule design comprises the following steps:
In the formulaFormula->In (I)>The technical terms extracted from the automatic intelligent data marking model are represented; />Representing a general concept obtained by random sampling; />Representing professional text; />Representing generic text.
4. A method for extracting chinese terms of art with rule and statistical features fused as defined in claim 3, wherein said evaluation criteria in S4 further comprises: based on the evaluation standard of the context characteristics, namely, representing the vocabulary by using the embedded word vector, performing dimension reduction visualization on the embedded word vector.
5. The method for extracting Chinese terminology with rule and statistical features fused according to claim 4, wherein said method for performing dimension reduction visualization of embedded word vectors comprises:
according to the Chinese BERT model, word embedding is carried out on the extracted technical terms and the general concepts obtained by random sampling, namely, the embedded word vectors are used for representing words, and then dimension reduction visualization is carried out on the embedded word vectors through T-SNE.
6. A chinese term extraction system for implementing a method for extracting chinese terms incorporating rules and statistical features as claimed in any one of claims 1 to 5, comprising: a term discovery module for implementing the term discovery and a term screening module for implementing the term screening; the system also includes an evaluation module that evaluates the extracted term of art using a quantitative evaluation criterion.
CN202310973797.2A 2023-08-04 2023-08-04 Chinese professional term extraction method and system integrating rules and statistical features Active CN116702786B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310973797.2A CN116702786B (en) 2023-08-04 2023-08-04 Chinese professional term extraction method and system integrating rules and statistical features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310973797.2A CN116702786B (en) 2023-08-04 2023-08-04 Chinese professional term extraction method and system integrating rules and statistical features

Publications (2)

Publication Number Publication Date
CN116702786A true CN116702786A (en) 2023-09-05
CN116702786B CN116702786B (en) 2023-11-17

Family

ID=87831450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310973797.2A Active CN116702786B (en) 2023-08-04 2023-08-04 Chinese professional term extraction method and system integrating rules and statistical features

Country Status (1)

Country Link
CN (1) CN116702786B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0619968A (en) * 1991-09-13 1994-01-28 Oki Electric Ind Co Ltd Automatic extraction device for technical term
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN105426539A (en) * 2015-12-23 2016-03-23 成都电科心通捷信科技有限公司 Dictionary-based lucene Chinese word segmentation method
JP2016164724A (en) * 2015-03-06 2016-09-08 株式会社東芝 Vocabulary knowledge acquisition device, vocabulary knowledge acquisition method, and vocabulary knowledge acquisition program
CN108038106A (en) * 2017-12-22 2018-05-15 北京工业大学 A kind of fine granularity field term self-learning method based on context semanteme
CN108287825A (en) * 2018-01-05 2018-07-17 中译语通科技股份有限公司 A kind of term identification abstracting method and system
CN112966508A (en) * 2021-04-05 2021-06-15 集智学园(北京)科技有限公司 General automatic term extraction method
CN113191147A (en) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 Unsupervised automatic term extraction method, apparatus, device and medium
US20210256049A1 (en) * 2020-02-17 2021-08-19 International Business Machines Corporation Descriptor Uniqueness for Entity Clustering
CN113343683A (en) * 2021-06-18 2021-09-03 山东大学 Chinese new word discovery method and device integrating self-encoder and countertraining
CN114036933A (en) * 2022-01-10 2022-02-11 湖南工商大学 Information extraction method based on legal documents
CN114154484A (en) * 2021-11-12 2022-03-08 中国长江三峡集团有限公司 Construction professional term library intelligent construction method based on mixed depth semantic mining
CN114528835A (en) * 2022-02-17 2022-05-24 杭州量知数据科技有限公司 Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN114912449A (en) * 2022-07-18 2022-08-16 山东大学 Technical feature keyword extraction method and system based on code description text
US20230111582A1 (en) * 2020-09-22 2023-04-13 Tencent Technology (Shenzhen) Company Limited Text mining method based on artificial intelligence, related apparatus and device

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0619968A (en) * 1991-09-13 1994-01-28 Oki Electric Ind Co Ltd Automatic extraction device for technical term
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
JP2016164724A (en) * 2015-03-06 2016-09-08 株式会社東芝 Vocabulary knowledge acquisition device, vocabulary knowledge acquisition method, and vocabulary knowledge acquisition program
CN105426539A (en) * 2015-12-23 2016-03-23 成都电科心通捷信科技有限公司 Dictionary-based lucene Chinese word segmentation method
CN108038106A (en) * 2017-12-22 2018-05-15 北京工业大学 A kind of fine granularity field term self-learning method based on context semanteme
CN108287825A (en) * 2018-01-05 2018-07-17 中译语通科技股份有限公司 A kind of term identification abstracting method and system
US20210256049A1 (en) * 2020-02-17 2021-08-19 International Business Machines Corporation Descriptor Uniqueness for Entity Clustering
US20230111582A1 (en) * 2020-09-22 2023-04-13 Tencent Technology (Shenzhen) Company Limited Text mining method based on artificial intelligence, related apparatus and device
CN112966508A (en) * 2021-04-05 2021-06-15 集智学园(北京)科技有限公司 General automatic term extraction method
CN113191147A (en) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 Unsupervised automatic term extraction method, apparatus, device and medium
CN113343683A (en) * 2021-06-18 2021-09-03 山东大学 Chinese new word discovery method and device integrating self-encoder and countertraining
CN114154484A (en) * 2021-11-12 2022-03-08 中国长江三峡集团有限公司 Construction professional term library intelligent construction method based on mixed depth semantic mining
CN114036933A (en) * 2022-01-10 2022-02-11 湖南工商大学 Information extraction method based on legal documents
CN114528835A (en) * 2022-02-17 2022-05-24 杭州量知数据科技有限公司 Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN114912449A (en) * 2022-07-18 2022-08-16 山东大学 Technical feature keyword extraction method and system based on code description text

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
WU, HUANQIN等: "Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning", THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE *
YUAN, YU等: "Supervised Learning for Robust Term Extraction", 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP) *
史东娜: "基于半监督学习的特定领域术语抽取算法的研究", 中国优秀硕士学位论文全文数据库 信息科技辑 *
樊梦佳;段东圣;杜翠兰;张仰森;佟玲玲;: "统计与规则相融合的领域术语抽取算法", 计算机应用研究, no. 08 *

Also Published As

Publication number Publication date
CN116702786B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
Ruiz-Casado et al. Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia
RU2662688C1 (en) Extraction of information from sanitary blocks of documents using micromodels on basis of ontology
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
RU2679988C1 (en) Extracting information objects with the help of a classifier combination
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
Oudah et al. NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic
CN110502744B (en) Text emotion recognition method and device for historical park evaluation
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN113268569B (en) Semantic-based related word searching method and device, electronic equipment and storage medium
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN107315734A (en) A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme
CN113312922B (en) Improved chapter-level triple information extraction method
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN112733547A (en) Chinese question semantic understanding method by utilizing semantic dependency analysis
CN111159342A (en) Park text comment emotion scoring method based on machine learning
CN114065760B (en) Legal text class case retrieval method and system based on pre-training language model
CN113297842A (en) Text data enhancement method
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN115525763A (en) Emotion analysis method based on improved SO-PMI algorithm and fusion word vector
KR102206781B1 (en) Method of fake news evaluation based on knowledge-based inference, recording medium and apparatus for performing the method
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field
CN114547303A (en) Text multi-feature classification method and device based on Bert-LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant