CN106815209B - Uygur agricultural technical term identification method - Google Patents

Uygur agricultural technical term identification method Download PDF

Info

Publication number
CN106815209B
CN106815209B CN201510895066.6A CN201510895066A CN106815209B CN 106815209 B CN106815209 B CN 106815209B CN 201510895066 A CN201510895066 A CN 201510895066A CN 106815209 B CN106815209 B CN 106815209B
Authority
CN
China
Prior art keywords
word
state
term
string
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510895066.6A
Other languages
Chinese (zh)
Other versions
CN106815209A (en
Inventor
张海军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201510895066.6A priority Critical patent/CN106815209B/en
Publication of CN106815209A publication Critical patent/CN106815209A/en
Application granted granted Critical
Publication of CN106815209B publication Critical patent/CN106815209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Uygur agricultural technical term identification method, and relates to the technical field of computer application. The method comprises the following steps: counting word string frequency and C _ value of words of the corpus from Uygur language corpus, selecting words corresponding to the C _ value meeting the C _ value threshold, taking the words as anchor candidate terms, and counting the statistical characteristics of the anchor candidate terms; performing part-of-speech tagging and segmentation of word stems and word tails on all words in the corpus to obtain language features; and integrating the statistical characteristics and the language characteristics by using a finite state automata to construct a state transition matrix, thereby realizing the automatic identification of agricultural technical terms under the control of the finite state automata. The invention realizes that the accuracy rate of technical term identification in the Uighur agricultural field is improved by 4 percent, the recall rate is improved by about 3 percent, and the blank of the Uighur agricultural field term identification is filled.

Description

Uygur agricultural technical term identification method
Technical Field
The invention relates to the technical field of computer application, in particular to a Uyghur agricultural technical term identification method.
Background
At present, no method for automatically identifying agricultural domain terms of Uygur is available, although the identification method of other domain terms of Uygur except for agriculture adopts a rule-based method and a statistic-based method or a combination of the rule-based method and the statistic-based method, because the method does not fully consider the linguistic knowledge characteristics formed by abundant linguistic morphological changes of Uygur as a sticky language, a large amount of labeled linguistic data is required to support in the identification process, the identification effect is excessively dependent on the scale and the labeling result of the labeled linguistic data, the automatic identification effect of the domain terms is poor and the identification efficiency is low, and meanwhile, because the domain characteristics based on linguistic knowledge are not sufficiently applied in the existing identification methods of the other domain terms, the pertinence of the term extraction domain is poor; and a unified framework for integrating the statistical characteristics and the language knowledge characteristics for the automatic term recognition is lacked, and various characteristics are randomly used, so that the problem of poor integral recognition effect is caused.
Disclosure of Invention
The present invention is directed to providing a method for identifying Uyghur agricultural technical terms, thereby solving the aforementioned problems in the prior art.
In order to achieve the above object, the present invention provides a method for identifying Uigur agricultural technical terms, comprising:
s1, counting word string frequency and C _ value values of words of the corpus from Uygur language corpus, selecting words corresponding to the C _ value values meeting C _ value threshold values, taking the words as anchor candidate terms, and counting statistical characteristics of the anchor candidate terms;
the statistical features include: string frequency, C-value, left-right entropy, mutual information and inverted document frequency;
s2, performing part-of-speech tagging and segmentation of word stems and word tails on all words in the corpus to obtain language features; the language features include: the word stem and word end characteristics and the part of speech characteristics in the multi-word terms;
and S3, integrating the statistical characteristics and the language characteristics by using a finite state automaton, constructing a state transition matrix, and realizing automatic identification of agricultural technical terms under the control of the finite state automaton.
Preferably, in step S1, the C _ value of the anchor candidate term is calculated according to formula (1):
Figure GDA0002155410360000021
wherein C _ value (a) represents the C _ value of the multi-word string of the anchor candidate term, a represents the multi-word string of the anchor candidate term, | a | represents the length of the multi-word string, f (a) represents the frequency of occurrence of the multi-word string in the entire corpus, TaRepresenting sets of multi-word strings, P (T), with multi-word string a as substringa) Representation set TaNumber of elements in (1).
Preferably, in step S1, the mutual information of the anchor candidate terms is calculated according to formula (2):
Figure GDA0002155410360000022
wherein, x and y respectively represent two strings, MI (x, y) represents mutual information of the string x and the string y, and p (x), p (y) represent probabilities of the string x and the string y appearing in the corpus; p (x, y) represents the probability that the strings x, y as a whole co-occur in the corpus.
Preferably, in step S1, the left-right entropy of the anchor candidate term is calculated as follows:
a1, based on a large-scale corpus repeating pattern extraction method of layer-by-layer pruning, statistically determining the frequency of length candidate word strings from the corpus, sequencing and storing the result in a file F0;
extracting 1 more word strings with length than that of the word strings in F0 from the corpus based on a large-scale corpus repeating pattern extraction method of pruning layer by layer, then sequentially carrying out statistical frequency and sequencing processing, and storing the processing result in a file F1;
after removing all the first characters in the file F1, sequentially carrying out sorting, merging and frequency counting processing and storing the processing result in a file F2;
after all tail characters in F1 are removed, sorting, merging and frequency counting processing are sequentially carried out, and processing results are stored in F3; then, the calculation of the left entropy and the right entropy of the character string described in the file F0 is performed by a2 and A3, respectively;
a2, reading the current record R of the file F0, reading the current record R' of the file F2, and calculating the left entropy of the character string in the file F0 according to the following method:
judging whether R is equal to R', and if so, entering A21; if not, go to A22;
a21, calculating the entropy contributed by the R 'tail character to the pattern R, increasing the pointer of F2 by 1, reading the current character string R', and repeatedly executing the step A21 until the F2 reaches the tail of the file, thereby completing the calculation of the left entropy of all the character strings in the file F0;
a22, ending the calculation of the left entropy of the current mode R, increasing the pointer of F0 by 1, returning to A2 and starting to calculate the left entropy of the current character string of the file F0;
a23, reopen file F0, open file F3, and begin to perform fast calculation of right entropy of the string in F0:
a3, reading the current record R of the file F0, reading the current character string R' of F3, and calculating the right entropy of the character string in the file F0 according to the following method:
judging whether R is equal to R', and if so, entering A31; if not, go to A32;
a31, calculating the entropy of R 'tail characters on the contribution of the mode R, increasing the pointer of F3 by 1, reading the current character string R', and repeatedly executing the step A31 until the F3 reaches the tail of the file, thereby completing the calculation of the right entropy of all the character strings in the file F0;
a32, the right entropy calculation of the current mode R is finished, the pointer of F0 is increased by 1, and the return A3 starts to calculate the right entropy of the current character string of the file F0.
Preferably, in step S1, the threshold is a preset threshold or a dynamic threshold calculated in the identification process.
Preferably, in step S2, the part of speech collocation rule in the multi-word term specifically includes: a + N, N + N, V + N, V + V, A + A + N, N + A + N, V + A + N, N + C + V, V + C + V, V + C + V + N, V + D + N + N, N + C + V + N, N + A + D + N, A + N + C + V + N, V + N + C + V + N, V + N + C + A + N, wherein A represents adjectives, N represents nouns, V represents verbs, C represents conjuncts, and D represents adverbs.
Preferably, step S3 is implemented according to the following steps:
b1, based on any anchor candidate term E extracted in step S1, judging whether the stem and end-of-word characteristics of the anchor candidate term conform to preset stem and end-of-word agricultural field characteristic rules, if yes, entering B2, and if not, judging the next anchor candidate term;
b2, judging whether the inverted document frequency of the anchor candidate term E meets the inverted document frequency threshold of the word-type term, if so, entering B3; if not, return to B1;
b3, relating the left and right entropy and the preset corresponding threshold value through the mutual information of the anchor candidate term E;
when the mutual information is smaller than a preset mutual information threshold value and the left-right entropy is larger than a preset left-right entropy threshold value, the combination compactness of the anchor point candidate term E and the front and rear words is low, and the anchor point candidate term E is a word-type term;
when the relationship between the mutual information, the left entropy, the right entropy and the corresponding preset threshold is other than the relationship, the combining compactness of the anchor candidate term E and the preceding and following words is high, whether the part-of-speech characteristics of the terms among the words meet part-of-speech matching rules in the multi-word terms is checked, if so, the anchor candidate term E and the preceding and following words are combined into the multi-word terms, and if not, the anchor candidate term E and the preceding and following words are combined to be not agricultural technical terms.
More preferably, in step B3, the number of anchor candidate term E + antecedent, or the number of antecedent + anchor candidate term E + antecedent is less than or equal to 5.
Preferably, in step S3, the state transition matrix is configured as follows:
establishing a state transition matrix with 8 states and 5 input judgment conditions;
the 8 states are:
state 1 is the anchor candidate term state detected by C _ value;
the state 2 is a transition state which is screened by the language characteristics;
state 3 is a reject state one, indicating an unacceptable word candidate string state;
state 4 is the state of the initially selected word-type term after statistical characteristic test;
state 5 is the state that any word type term is expanded to multi-word type terms, and whether the character string after expansion accords with the multi-word digital language is checked;
state 6 is an accept state one, indicating that an anchor candidate is identified as belonging to a word-type term;
the state 7 is an accepting state two, which indicates that the character string after being expanded is recognized as a multi-word type term;
state 8 is a rejected state two, indicating that the expanded string is not a Uygur agricultural terminology;
wherein, the state 0 is additionally set as an initial state;
the 5 input judgment conditions are as follows:
the method comprises the following steps that 1, whether a C _ value of a linguistic data character string of the C _ value is larger than or equal to a preset C _ value threshold value or not is judged, if yes, a state 1 is entered, and if not, a state 3 is entered;
judging whether the word stems and the word end characteristics of the character strings of the corpus accord with word stem and word end agricultural field characteristic combinations or not under the condition 2, if so, entering a state 2, and if not, entering a state 3;
condition 3, judging whether the document reversing rate of the character strings of the corpus is greater than or equal to a preset document reversing rate threshold value, if so, entering a state 4, and if not, entering a state 3;
condition 4, judging whether the left-right entropy and mutual information combination characteristics accord with preset corresponding threshold values, if so, entering a state 5, and if not, entering a state 6;
condition 5, judging whether the character string of any word type term expanded to the front word and the back word accords with the part of speech collocation rule in the multi-word term, if so, entering a state 7; if not, state 8 is entered.
The invention has the beneficial effects that:
the simple and effective Uyghur agricultural field term identification method improves the term automatic identification effect, provides technical support for the machine translation of Uyghur and Chinese and the bilingual information retrieval of Uyghur and Chinese, and provides reference and reference for the technical research of term extraction in other fields.
On the basis of the field characteristics based on rules and statistics, the finite state automaton is used for integrating the relationship among different characteristics, a characteristic-based state transition matrix is constructed, the automatic recognition of agricultural field terms under multiple characteristics is realized, and the extraction of word-type terms and multi-word-type terms can be effectively considered.
Aiming at the research deficiency in the term recognition of the current field, the invention mainly carries out two-point innovation, firstly, provides an entry stem and word end matching rule for field term extraction, and the entry stem and word end matching rule is used as a field characteristic based on a language rule to realize the quick recognition of the field term; secondly, a term recognition state transition matrix facing the agricultural field is constructed, integration of term recognition features based on finite state automata is achieved, a unified framework is provided for term recognition, and the term recognition normalization is beneficial to exploration.
The evaluation indexes for detecting the term identification effect are the accuracy and the recall rate, and experiments show that the term identification effect of the method reaches the current better level. Because no method for identifying terms in the agricultural field exists at present, compared with the current best level of other fields, the accuracy rate is improved by 4 percent, the recall rate is improved by about 3 percent, and the blank of identifying terms in the agricultural field of Uygur language is filled.
Drawings
FIG. 1 is a flow chart of the Uigur agricultural technology term identification method;
fig. 2 is a flowchart of the method in step S3 of the method for identifying agricultural technical terms in uygur.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The invention is mainly aimed at the recognition of two types of domain terms, the first type is the recognition of word-type terms, and the second type is the recognition of multi-word-type terms. The technical scheme adopted by the invention is that a finite state automaton is applied, language knowledge characteristics and statistical characteristics are integrated, a state conversion matrix for term recognition is constructed, and the field term recognition is realized step by step under the action of a main program controller. In the specific processing, various features and rules are used in a stepwise and comprehensive manner for term recognition for a specific word. The identification process comprises the following steps:
s1, counting word string frequency and C _ value values of words of the corpus from Uygur language corpus, selecting words corresponding to the C _ value values meeting C _ value threshold values, taking the words as anchor candidate terms, and counting statistical characteristics of the anchor candidate terms; the statistical features include: string frequency, C-value, left-right entropy, mutual information and inverted document frequency;
s2, performing part-of-speech tagging and segmentation of word stems and word tails on all words in the corpus to obtain language features; the language features include: the word stem and word end characteristics and the part of speech characteristics in the multi-word terms;
and S3, integrating the statistical characteristics and the language characteristics by using a finite state automaton, constructing a state transition matrix, and realizing automatic identification of agricultural technical terms under the control of the finite state automaton.
The more detailed explanation is:
the threshold value in this application is either a predetermined threshold value or a dynamic threshold value calculated during the identification process.
1. C _ value in step S1
The C _ value is a measure of the degree of the candidate term, and the calculation of the length, frequency and mutual inclusion relationship among strings of the candidate term is a method for calculating the domain relevancy of the term by applying a statistical means, so that the candidate domain term can be effectively extracted.
The C _ value of the anchor candidate term is calculated according to equation (1):
Figure GDA0002155410360000081
wherein C _ value (a) represents the C _ value of the multi-word string of the anchor candidate term, a represents the multi-word string of the anchor candidate term, | a | represents the length of the multi-word string, f (a) represents the frequency of occurrence of the multi-word string in the entire corpus, TaRepresenting sets of multi-word strings, P (T), with multi-word string a as substringa) Representation set TaNumber of elements in (1).
A great deal of current research shows that C _ value is an effective index for detecting the candidate word terminology degree (Termood), can effectively represent the measurement problem of term length and nested terms, and is used for extracting and filtering terms in many researches. The index is applied to extract the candidate terms, and the specific thresholds are different in different fields or different documents and different corpus scales. The step of calculating C _ value is used as the first step of the whole term extraction, mainly for extracting candidate terms, which will be used as anchor points of the whole term extraction, and on the basis of the above, statistics and calculation of relevant statistics are performed. The statistics to be calculated here include: mutual information, left-right entropy and inverse document frequency of left and right words of the anchor word. These statistics are mainly used for the definition of word-type terms of the candidate terms and for the statistical detection of multi-word-type terms.
2. Left-right entropy in step S1
The left entropy and the right entropy are used for measuring the flexibility degree of the variable collocation in the context, if the variable collocation is very flexible, the possibility that the variable is taken as a whole is higher, and the strength degree of the internal combination is measured through the flexibility degree of external use. In the present invention, it is mainly used to measure the strength of the word combination in the multi-word term, i.e. the probability of appearing as a whole.
The left and right entropy of the anchor candidate term is calculated as follows:
a1, based on a large-scale corpus repeating pattern extraction method of layer-by-layer pruning, statistically determining the frequency of length candidate word strings from the corpus, sequencing and storing the result in a file F0;
extracting 1 more word strings with length than that of the word strings in F0 from the corpus based on a large-scale corpus repeating pattern extraction method of pruning layer by layer, then sequentially carrying out statistical frequency and sequencing processing, and storing the processing result in a file F1;
after removing all the first characters in the file F1, sequentially carrying out sorting, merging and frequency counting processing and storing the processing result in a file F2;
after all tail characters in F1 are removed, sorting, merging and frequency counting processing are sequentially carried out, and processing results are stored in F3; then, the calculation of the left entropy and the right entropy of the character string described in the file F0 is performed by a2 and A3, respectively;
a2, reading the current record R of the file F0, reading the current record R' of the file F2, and calculating the left entropy of the character string in the file F0 according to the following method:
judging whether R is equal to R', and if so, entering A21; if not, go to A22;
a21, calculating the entropy contributed by the R 'tail character to the pattern R, increasing the pointer of F2 by 1, reading the current character string R', and repeatedly executing the step A21 until the F2 reaches the tail of the file, thereby completing the calculation of the left entropy of all the character strings in the file F0;
a22, ending the calculation of the left entropy of the current mode R, increasing the pointer of F0 by 1, returning to A2 and starting to calculate the left entropy of the current character string of the file F0;
a23, reopen file F0, open file F3, and begin to perform fast calculation of right entropy of the string in F0:
a3, reading the current record R of the file F0, reading the current character string R' of F3, and calculating the right entropy of the character string in the file F0 according to the following method:
judging whether R is equal to R', and if so, entering A31; if not, go to A32;
a31, calculating the entropy of R 'tail characters on the contribution of the mode R, increasing the pointer of F3 by 1, reading the current character string R', and repeatedly executing the step A31 until the F3 reaches the tail of the file, thereby completing the calculation of the right entropy of all the character strings in the file F0;
a32, the right entropy calculation of the current mode R is finished, the pointer of F0 is increased by 1, and the return A3 starts to calculate the right entropy of the current character string of the file F0.
The left entropy and the right entropy are directly calculated according to the definition, the efficiency is low, and the term recognition speed is seriously influenced. The left-right entropy calculation method can effectively calculate the left-right entropy of the candidate character string, is a method with the calculation speed in a linear relation with the corpus scale, is irrelevant to the scale of the character string to be calculated, and greatly improves the left-right entropy calculation efficiency.
3. The mutual information in step S1
The left-right mutual information is used for measuring the degree of the mutual relationship between two variables, is a measure of the correlation between the two variables, is used for detecting the correlation degree between two Uygur words in the term recognition in the agricultural field, and is used as another important measure for multi-word term detection.
The mutual information of the anchor candidate terms is calculated according to formula (2):
Figure GDA0002155410360000101
wherein, x and y respectively represent two strings, MI (x, y) represents mutual information of the string x and the string y, and p (x), p (y) represent probabilities of the string x and the string y appearing in the corpus; p (x, y) represents the probability that the strings x, y as a whole co-occur in the corpus.
4. The inverse document frequency in step S1
The inverse document frequency is used for measuring the domain discrimination of the candidate term and refers to the contribution degree of the candidate term to the document discrimination capability, if one candidate term appears in a plurality of documents, the contribution degree to the document discrimination is very small, otherwise, the contribution degree is large. It is calculated using the logarithm of the reciprocal of the frequency of occurrence of the candidate term in the document.
(II) character string label, word stem and word end characteristics, and multiple-word term part-of-speech collocation rule
1. The labeling refers to performing part-of-speech labeling on words in the corpus, namely labeling grammar categories of the words, including information such as nouns, verbs, adjectives and the like.
2. The method is characterized in that the stem and the suffix are segmented into Uygur words, the real words of the Uygur words are formed by adding stems and tails, the stems represent the main parts of the words, the tails comprise configuration tails and formation tails, and the tails need to be extracted and analyzed before term recognition so as to apply the linguistic knowledge characteristics and the field characteristics.
The word stem and word end rule judging conditions refer to specific word ends which embody field terms and word stem types corresponding to the specific word ends summarized according to earlier research. This rule supplements the statistical features as a specific linguistic knowledge rule. In research, the combination mode between the word stem and the word end has larger domain correlation, so that the combination mode can be used as a language knowledge characteristic for recognizing domain terms.
3. The part-of-speech collocation rules refer to part-of-speech collocation relationships among the multiple-word field term words obtained according to a large amount of previous researches and summaries, the collocation relationships can filter terms from the level of language rules, and the method is simple in structure and high in accuracy and efficiency. Since the terms generally refer to only real words, a specific real word combination is considered as a part-of-speech collocation sequence in the part-of-speech combination. The form is as follows: noun + noun, adjective + noun, etc.
For multi-word type terms, the statistical characteristics are met, and meanwhile, the part-of-speech collocation relationship among the multi-words is also met. In the present application, the part of speech collocation rules in the multi-word terms specifically include: a + N, N + N, V + N, V + V, A + A + N, N + A + N, V + A + N, N + C + V, V + C + V, V + C + V + N, V + D + N + N, N + C + V + N, N + A + D + N, A + N + C + V + N, V + N + C + V + N, V + N + C + A + N, wherein A represents adjectives, N represents nouns, V represents verbs, C represents conjuncts, and D represents adverbs.
Step S3 is specifically implemented as follows:
b1, based on any anchor candidate term E extracted in step S1, judging whether the stem and end-of-word characteristics of the anchor candidate term conform to preset stem and end-of-word agricultural field characteristic rules, if yes, entering B2, and if not, judging the next anchor candidate term;
b2, judging whether the inverted document frequency of the anchor candidate term E meets the inverted document frequency threshold of the word-type term, if so, entering B3; if not, return to B1;
b3, relating the left and right entropy and the preset corresponding threshold value through the mutual information of the anchor candidate term E;
when the mutual information is smaller than a preset mutual information threshold value and the left-right entropy is larger than a preset left-right entropy threshold value, the combination compactness of the anchor point candidate term E and the front and rear words is low, and the anchor point candidate term E is a word-type term;
when the relationship between the mutual information, the left entropy, the right entropy and the corresponding preset threshold is other than the relationship, the combining compactness of the anchor candidate term E and the preceding and following words is high, whether the part-of-speech characteristics of the terms among the words meet part-of-speech matching rules in the multi-word terms is checked, if so, the anchor candidate term E and the preceding and following words are combined into the multi-word terms, and if not, the anchor candidate term E and the preceding and following words are combined to be not agricultural technical terms. In step B3, the number of anchor candidate term E + antecedent, or the number of antecedent + anchor candidate term E + antecedent is less than or equal to 5.
1. The finite state automata refers to a deterministic finite state automata (see fig. 2 in particular), and refers to an automata which can transition to a deterministic state for a deterministic input in a current state. This certainty facilitates state transitions for which the computer makes a determination based on the determination input.
2. Referring to table 1, the state transition matrix is a control matrix for implementing state transition between the current state and the input features, and the inside of the state transition matrix stores relationships between the identification states and the input features in a table form.
In step S3, the state transition matrix is constructed as follows:
establishing a state transition matrix with 8 states and 5 input judgment conditions, controlling the automatic operation of the finite state automaton and implementing field term recognition based on statistics and language knowledge rules; more specifically:
the 8 states are:
state 1 is the anchor candidate term state detected by C _ value;
the state 2 is a transition state which is screened by the language characteristics;
state 3 is a reject state one, indicating an unacceptable word candidate string state;
state 4 is the state of the initially selected word-type term after statistical characteristic test;
state 5 is the state that any word type term is expanded to multi-word type terms, and whether the character string after expansion accords with the multi-word digital language is checked;
state 6 is an accept state one, indicating that an anchor candidate is identified as belonging to a word-type term;
the state 7 is an accepting state two, which indicates that the character string after being expanded is recognized as a multi-word type term;
state 8 is a rejected state two, indicating that the expanded string is not a Uygur agricultural terminology;
wherein, the state 0 is additionally set as an initial state;
the 5 input judgment conditions are as follows:
the method comprises the following steps that 1, whether a C _ value of a linguistic data character string of the C _ value is larger than or equal to a preset C _ value threshold value or not is judged, if yes, a state 1 is entered, and if not, a state 3 is entered;
judging whether the word stems and the word end characteristics of the character strings of the corpus accord with word stem and word end agricultural field characteristic combinations or not under the condition 2, if so, entering a state 2, and if not, entering a state 3;
condition 3, judging whether the document reversing rate of the character strings of the corpus is greater than or equal to a preset document reversing rate threshold value, if so, entering a state 4, and if not, entering a state 3;
condition 4, judging whether the left-right entropy and mutual information combination characteristics accord with preset corresponding threshold values, if so, entering a state 5, and if not, entering a state 6;
condition 5, judging whether the character string of any word type term expanded to the front word and the back word accords with the part of speech collocation rule in the multi-word term, if so, entering a state 7; if not, state 8 is entered.
Table 18 state transition matrix of 5 state input judgment conditions
Figure GDA0002155410360000141
State transition matrix logic flow: state 0 is the initial state; state 1 is a candidate term state that has undergone C _ value detection; the state 3 is a transition state which is subjected to feature screening in the language knowledge field, and a satisfier is a candidate term in the field; under the states of 0, 1 and 2, if the corresponding input judgment standard is not met, the state is switched to a state 3, which is a rejection state and represents an unacceptable word candidate string state; state 4 is a state which is tested by term statistical characteristics, is basically determined as a word type term and can be used as the basis for detecting multiple word terms; state 6 is an accept state indicating that a word-type term has been recognized; state 5 is a process of expanding a single-word term to a multi-word term, a left-right gradual expansion mode is carried out, if a statistical standard is met, part-of-speech collocation rule condition detection is carried out, a person who is met is used as the multi-word term, and a state 7 is entered, and the state is also an accepting state and indicates that the multi-word term is accepted; state 8 another reject state, indicating that the multiword string cannot be accepted as a term.
The recognition method is adopted to carry out a term recognition experiment. In the first experiment, web pages are downloaded from the Kunlun network in 7 months in 2013, 100 web pages in the agricultural field are extracted from the web pages, and through manual labeling, term extraction is performed by using the method disclosed by the invention, the term extraction accuracy is 88.2%, and the recall rate is 77.8%. In the second experiment, 150 webpages in the agricultural field are downloaded from the Kunlun network in 11 months in 2013, and through manual labeling, the term extraction accuracy and the recall rate are respectively 88.6% and 78.1% when the method is used for carrying out the term extraction.
By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:
the simple and effective Uyghur agricultural field term identification method improves the term automatic identification effect, provides technical support for the machine translation of Uyghur and Chinese and the bilingual information retrieval of Uyghur and Chinese, and provides reference and reference for the technical research of term extraction in other fields.
On the basis of the field characteristics based on rules and statistics, the finite state automaton is used for integrating the relationship among different characteristics, a characteristic-based state transition matrix is constructed, the automatic recognition of agricultural field terms under multiple characteristics is realized, and the extraction of word-type terms and multi-word-type terms can be effectively considered.
Aiming at the research deficiency in the term recognition of the current field, the invention mainly carries out two-point innovation, firstly, provides an entry stem and word end matching rule for field term extraction, and the entry stem and word end matching rule is used as a field characteristic based on a language rule to realize the quick recognition of the field term; secondly, a term recognition state transition matrix facing the agricultural field is constructed, integration of term recognition features based on finite state automata is achieved, a unified framework is provided for term recognition, and the term recognition normalization is beneficial to exploration.
The evaluation indexes for detecting the term identification effect are the accuracy and the recall rate, and experiments show that the term identification effect of the method reaches the current better level. Because no method for identifying terms in the agricultural field exists at present, compared with the current best level of other fields, the accuracy rate is improved by 4 percent, the recall rate is improved by about 3 percent, and the blank of identifying terms in the agricultural field of Uygur language is filled.
In the invention, the knowledge characteristics based on the language rules are used for matching with the statistical characteristics to filter the candidate terms. Experiments show that the term domain recognition rate of the characteristics is 96%, and the effect is very obvious.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims (9)

1. A Uygur agricultural technical term identification method is characterized by comprising the following steps:
s1, counting word string frequency and C _ value values of words of the corpus from Uygur language corpus, selecting words corresponding to the C _ value meeting a C _ value threshold value, taking the words corresponding to the C _ value meeting the C _ value threshold value as anchor candidate terms, and counting the statistical characteristics of the anchor candidate terms;
the statistical features include: string frequency, C _ value, left-right entropy, mutual information and inverted document frequency;
s2, performing part-of-speech tagging and segmentation of word stems and word tails on all words in the corpus to obtain language features; the language features include: the word stem and word end characteristics and the part of speech characteristics in the multi-word terms;
and S3, integrating the statistical characteristics and the language characteristics by using a finite state automaton, constructing a state transition matrix, and realizing automatic identification of agricultural technical terms under the control of the finite state automaton.
2. The method of claim 1, wherein in step S1, the C _ value of the anchor candidate term is calculated according to formula (1):
Figure FDA0002321822800000012
wherein C _ value (a) represents the C _ value of the multi-word string of the anchor candidate term, a represents the multi-word string of the anchor candidate term, | a | represents the length of the multi-word string, f (a) represents the frequency of occurrence of the multi-word string in the entire corpus, TaRepresenting sets of multi-word strings, P (T), with multi-word string a as substringa) Representation set TaNumber of elements in (1).
3. The method of claim 1, wherein in step S1, the mutual information of the anchor candidate terms is calculated according to formula (2):
Figure FDA0002321822800000011
wherein, x and y respectively represent two strings, MI (x and y) represents the mutual information of the string x and the string y, and p (x), p (y) represent the probability of the string x and the string y appearing in the corpus; p (x, y) represents the probability that the strings x, y as a whole co-occur in the corpus.
4. The method of claim 1, wherein in step S1, the left-right entropy of the anchor candidate term is calculated as follows:
a1, based on a large-scale corpus repeating pattern extraction method of layer-by-layer pruning, statistically determining the frequency of length candidate word strings from the corpus, sequencing and storing the result in a file F0;
extracting 1 more word strings with length than that of the word strings in F0 from the corpus based on a large-scale corpus repeating pattern extraction method of pruning layer by layer, then sequentially carrying out statistical frequency and sequencing processing, and storing the processing result in a file F1;
after removing all the first characters in the file F1, sequentially carrying out sorting, merging and frequency counting processing and storing the processing result in a file F2;
after all tail characters in F1 are removed, sorting, merging and frequency counting processing are sequentially carried out, and processing results are stored in F3; then, the left entropy and the right entropy of the string recorded in the file F0 are calculated through A2 and A3 respectively;
a2, reading the current record R of the file F0, reading the current record R' of the file F2, and calculating the left entropy of the character string in the file F0 according to the following method:
judging whether R is equal to R', and if so, entering A21; if not, go to A22;
a21, calculating the entropy of the contribution of the R 'tail character to the current record R, increasing the pointer of F2 by 1, reading the current record R', and repeatedly executing the step A21 until the F2 reaches the tail of the file, thereby completing the calculation of the left entropy of all the character strings in the file F0;
a22, ending the calculation of the left entropy of the current record R, increasing the pointer F0 by 1, returning to A2 and starting to calculate the left entropy of the current string of the file F0;
a23, reopen file F0, open file F3, and begin to perform fast calculation of right entropy of the string in F0:
a3, reading the current record R of the file F0, reading the current record R' of the file F3, and calculating the right entropy of the character string in the file F0 according to the following method:
judging whether R is equal to R', and if so, entering A31; if not, go to A32;
a31, calculating the entropy of R 'tail characters on the current record R, increasing the pointer of F3 by 1, reading the current record R', and repeating the step A31 until the F3 reaches the tail of the file, thereby completing the calculation of the right entropy of all the character strings in the file F0;
a32, the calculation of the right entropy of the current record R is finished, the pointer of F0 is increased by 1, and the return A3 is started to calculate the right entropy of the current string of the file F0.
5. The method according to claim 1, wherein the threshold is a preset threshold or a dynamic threshold calculated in the identification process in step S1.
6. The method according to claim 1, wherein in step S2, the part-of-speech feature in the multi-word term is specifically: a + N, N + N, V + N, V + V, A + A + N, N + A + N, V + A + N, N + C + V, V + C + V, V + C + V + N, V + D + N + N, N + C + V + N, N + A + D + N, A + N + C + V + N, V + N + C + V + N, V + N + C + A + N, wherein A represents adjectives, N represents nouns, V represents verbs, C represents conjuncts, and D represents adverbs.
7. The method according to claim 1, wherein step S3 is implemented according to the following steps:
b1, based on any anchor candidate term E extracted in step S1, judging whether the stem and end-of-word characteristics of the anchor candidate term conform to preset stem and end-of-word agricultural field characteristic rules, if yes, entering B2, and if not, judging the next anchor candidate term;
b2, judging whether the inverted document frequency of the anchor candidate term E meets the inverted document frequency threshold of the word-type term, if so, entering B3; if not, return to B1;
b3, comparing the relationship between the mutual information of the anchor candidate term E and the left-right entropy and the corresponding threshold value which is preset;
when the mutual information is smaller than a preset mutual information threshold value and the left-right entropy is larger than a preset left-right entropy threshold value, the combination compactness of the anchor point candidate term E and the front and rear words is low, and the anchor point candidate term E is a word-type term;
when the relationship between the mutual information, the left entropy, the right entropy and the corresponding preset threshold is other than the relationship, the combining compactness of the anchor candidate term E and the preceding and following words is high, whether the part-of-speech characteristics of the terms among the words meet part-of-speech matching rules in the multi-word terms is checked, if so, the anchor candidate term E and the preceding and following words are combined into the multi-word terms, and if not, the anchor candidate term E and the preceding and following words are combined to be not agricultural technical terms.
8. The method of claim 7, wherein in step B3, the number of anchor candidate term E + antecedents, or the number of antecedents + anchor candidate term E + antecedents is less than or equal to 5.
9. The method according to claim 1, wherein in step S3, the state transition matrix is constructed according to the following implementation:
establishing a state transition matrix with 8 states and 5 input judgment conditions;
the 8 states are:
state 1 is the anchor candidate term state detected by C _ value;
the state 2 is a transition state which is screened by the language characteristics;
state 3 is a reject state one, indicating an unacceptable word candidate string state;
state 4 is the state of the initially selected word-type term after statistical characteristic test;
state 5 is the state that any word type term is expanded to a multi-word type term, and whether the expanded word string conforms to the multi-word term is checked;
state 6 is an accept state one, indicating that an anchor candidate term is identified as a word-type term;
the state 7 is an accepting state two, which represents that the expanded word string is recognized as a multi-word type term;
state 8 is a rejected state two, indicating that the expanded string is not a Uygur agricultural terminology;
wherein, the state 0 is additionally set as an initial state;
the 5 input judgment conditions are as follows:
the method comprises the following steps that 1, whether a C _ value of a corpus word string is larger than or equal to a preset C _ value threshold value is judged, if yes, a state 1 is entered, and if not, a state 3 is entered;
judging whether the word stems and the word end characteristics of the word strings of the corpus conform to the word stem and word end agricultural field characteristic combination or not under the condition 2, if so, entering a state 2, and if not, entering a state 3;
condition 3, judging whether the document reversing frequency of the word strings of the corpus is greater than or equal to a preset document reversing frequency threshold value, if so, entering a state 4, and if not, entering a state 3;
condition 4, judging whether the left-right entropy and mutual information combination characteristics accord with preset corresponding threshold values, if so, entering a state 5, and if not, entering a state 6;
condition 5, judging whether the character string of any word type term expanded to the front word and the back word accords with the part-of-speech characteristics in the multi-word term, and entering a state 7 if the character string accords with the part-of-speech characteristics; if not, state 8 is entered.
CN201510895066.6A 2015-11-30 2015-11-30 Uygur agricultural technical term identification method Active CN106815209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510895066.6A CN106815209B (en) 2015-11-30 2015-11-30 Uygur agricultural technical term identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510895066.6A CN106815209B (en) 2015-11-30 2015-11-30 Uygur agricultural technical term identification method

Publications (2)

Publication Number Publication Date
CN106815209A CN106815209A (en) 2017-06-09
CN106815209B true CN106815209B (en) 2020-03-17

Family

ID=59105782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510895066.6A Active CN106815209B (en) 2015-11-30 2015-11-30 Uygur agricultural technical term identification method

Country Status (1)

Country Link
CN (1) CN106815209B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033080B (en) * 2018-07-12 2023-03-24 上海金仕达卫宁软件科技有限公司 Medical term standardization method and system based on probability transfer matrix
CN109508365A (en) * 2018-11-01 2019-03-22 新疆大学 It is a kind of for terminology management and the analysis method of extraction
CN111968648B (en) * 2020-08-27 2021-12-24 北京字节跳动网络技术有限公司 Voice recognition method and device, readable medium and electronic equipment
CN112966508B (en) * 2021-04-05 2023-08-25 集智学园(北京)科技有限公司 Universal automatic term extraction method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271448A (en) * 2007-03-19 2008-09-24 株式会社东芝 Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
CN101751386A (en) * 2009-12-28 2010-06-23 华建机器翻译有限公司 Identification method of unknown words
JP2011085993A (en) * 2009-10-13 2011-04-28 Nippon Telegr & Teleph Corp <Ntt> Apparatus, method and program for analyzing information
CN103885931A (en) * 2012-12-19 2014-06-25 新疆信息产业有限责任公司 Method for extracting Uyghur specific terms in electric power industry based on statistic model
CN103902522A (en) * 2012-12-28 2014-07-02 新疆电力信息通信有限责任公司 Uygur language stem extracting method
CN103902525A (en) * 2012-12-28 2014-07-02 新疆电力信息通信有限责任公司 Uygur language part-of-speech tagging method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271448A (en) * 2007-03-19 2008-09-24 株式会社东芝 Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
JP2011085993A (en) * 2009-10-13 2011-04-28 Nippon Telegr & Teleph Corp <Ntt> Apparatus, method and program for analyzing information
CN101751386A (en) * 2009-12-28 2010-06-23 华建机器翻译有限公司 Identification method of unknown words
CN103885931A (en) * 2012-12-19 2014-06-25 新疆信息产业有限责任公司 Method for extracting Uyghur specific terms in electric power industry based on statistic model
CN103902522A (en) * 2012-12-28 2014-07-02 新疆电力信息通信有限责任公司 Uygur language stem extracting method
CN103902525A (en) * 2012-12-28 2014-07-02 新疆电力信息通信有限责任公司 Uygur language part-of-speech tagging method

Also Published As

Publication number Publication date
CN106815209A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
CN109241530B (en) Chinese text multi-classification method based on N-gram vector and convolutional neural network
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN112214610A (en) Entity relation joint extraction method based on span and knowledge enhancement
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN110543564B (en) Domain label acquisition method based on topic model
CN110688836A (en) Automatic domain dictionary construction method based on supervised learning
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
CN106815209B (en) Uygur agricultural technical term identification method
CN109271524B (en) Entity linking method in knowledge base question-answering system
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN108363691B (en) Domain term recognition system and method for power 95598 work order
CN106776672A (en) Technology development grain figure determines method
Benzebouchi et al. Multi-classifier system for authorship verification task using word embeddings
CN112633011B (en) Research front edge identification method and device for fusing word semantics and word co-occurrence information
CN110134777A (en) Problem De-weight method, device, electronic equipment and computer readable storage medium
CN110866087B (en) Entity-oriented text emotion analysis method based on topic model
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN113535960A (en) Text classification method, device and equipment
CN113626604A (en) Webpage text classification system based on maximum interval criterion
CN117131345A (en) Multi-source data parameter evaluation method based on data deep learning calculation
CN103034657B (en) Documentation summary generates method and apparatus
CN111984790A (en) Entity relation extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant