CN106021227A - State transition and neural network-based Chinese chunk parsing method - Google Patents

State transition and neural network-based Chinese chunk parsing method Download PDF

Info

Publication number
CN106021227A
CN106021227A CN201610324281.5A CN201610324281A CN106021227A CN 106021227 A CN106021227 A CN 106021227A CN 201610324281 A CN201610324281 A CN 201610324281A CN 106021227 A CN106021227 A CN 106021227A
Authority
CN
China
Prior art keywords
word
speech
vector
chunk
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610324281.5A
Other languages
Chinese (zh)
Other versions
CN106021227B (en
Inventor
戴新宇
程川
陈家骏
黄书剑
张建兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201610324281.5A priority Critical patent/CN106021227B/en
Publication of CN106021227A publication Critical patent/CN106021227A/en
Application granted granted Critical
Publication of CN106021227B publication Critical patent/CN106021227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention proposes a state transition and neural network-based Chinese chunk parsing method. The method comprises the steps of converting a chunk parsing task into a serialized tagging task; tagging a sentence by using a state transition-based framework; scoring a transition operation to be carried out in each state by using a forward neural network in the tagging process; and taking a distributed representation characteristic of words and part-of-speech tagging learned by utilizing a two-way long short-term memory neural network model as an additional information characteristic of a tagging model, thereby improving the accuracy of chunk parsing. Compared with other Chinese chunk parsing technologies, the Chinese chunk parsing method has the advantages that characteristics of chunk levels can be more flexibly added by using the state transition-based framework, combination modes among the characteristics can be automatically learned by using the neural network, the useful additional information characteristic is introduced by utilizing the two-way long short-term memory neural network model, and the combination of the state transition-based framework, the neural network and the two-way long short-term memory neural network model effectively improves the accuracy of chunk parsing.

Description

chinese chunk analysis method based on state transition and neural network
Technical Field
The invention relates to a method for carrying out Chinese shallow syntax analysis by utilizing a computer, in particular to a method for carrying out automatic Chinese chunk analysis by utilizing a mode based on the combination of state transition and a neural network.
Background
The Chinese syntactic analysis is a basic task in Chinese information processing, and the wide application demand thereof attracts a great deal of relevant research so as to promote the rapid development of the relevant technology thereof. The complete syntactic analysis has low analysis accuracy and low speed due to factors such as high complexity of the problem itself, and the like, so that the practicability is limited. Chunk analysis, also called shallow syntactic analysis, is different from complete syntactic analysis aimed at obtaining a complete syntactic tree of a sentence, where the analysis is aimed at identifying certain relatively simple-structured, non-nested sentence components in the sentence, such as non-nested noun phrases, verb phrases, etc. The recognition target is non-nested and non-overlapping phrase components which are in accordance with certain grammatical rules in sentences, so that the complexity of a chunk analysis task is lower and the processing speed is higher compared with the complete syntactic analysis, and meanwhile, the recognition target is always paid attention by researchers as the recognition target can be used as a preprocessing stage of a plurality of tasks such as machine translation, complete syntactic analysis, information extraction and the like. Chunking analysis for chinese continues to be a relevant research with the advent of chinese tree libraries and the extraction of data sets from them by researchers for chunking analysis tasks.
In the way of modeling block analysis tasks, it is a common approach to consider them as serialized annotation tasks. The working process is as follows: for the sentence to be analyzed, each word is labeled (i.e. labeled) from left to right by taking the word as a unit, wherein one labeling mode is to label the word as a chunk start word with types (noun phrase, verb phrase, adjective phrase and the like), a single chunk word, and five types of chunk end words, chunk internal words and chunk external words without types. When the whole sentence is marked in this way, the complete chunk information is extracted from the sentence. The invention also considers the Chinese chunk analysis task as a serialization labeling task and adopts the five labeling modes when modeling the Chinese chunk analysis task.
A statistical-based method is widely applied to a chunking analysis task, and a classical model in structured learning is commonly used for processing the chunking analysis task, such as a hidden markov model, a conditional random field model, a support vector machine model based on dynamic programming, and the like. However, due to the model itself, the method has limited use of features at the chunk level, and has little influence on a chunk analysis task which takes the whole sentence as a processing object and needs to consider more global information. In order to alleviate the influence brought by the model, a state transition-based method is a choice, is used more in complete syntactic analysis, and has the characteristics of high efficiency and accuracy. The working process is as follows: for the sentence to be analyzed, the words are read in from left to right in sequence by taking the words as units, each read word is labeled, the labeling type refers to the labeling mode, the labeling operation is carried out each time, the state defined on the whole sentence is transferred (one state of the sentence records which words of the current sentence are labeled, the labeling type corresponding to each labeled word and which words are not labeled yet), and the selection of the specific labeling type is completed by the trained scoring model. When a word is labeled, the labeling types of all words on the left side of the word in a sentence are determined, so that the information of the labeled words can be fully utilized to guide the labeling of the current word, and particularly the related information of the chunk on the left side of the word which is identified as the chunk is utilized to guide the labeling. In order to make more use of the information characteristics at the chunk level, the present invention employs a state transition-based approach to chinese chunk analysis.
Neural networks are a common machine learning method, and have the ability to automatically learn a feature combination from some basic atomic features, which is different from the conventional method that requires a user to design a large number of task-related templates based on a priori knowledge such as linguistic correlations. Neural networks have been extensively tried in chinese information processing, but have not been used in chinese chunking analysis so far. The use of the neural network can save the work of manually customizing a large number of combined feature templates, and can automatically learn the combination of the features by means of the strong expression capability of the neural network. On the other hand, in the conventional chunk analysis technology, the information features used when labeling each word are all words or part-of-speech information in a certain fixed-size window based on the current word, but after analyzing chinese sentences in the tree library, it can be found that many information features useful for chunk analysis are often outside the window, for example, punctuation information such as "book", etc., text mode information such as "words, …", etc., which are spaced at intervals of a pause number, and such information is often wide in range and is not easily incorporated into the conventional chunk analysis technology. In order to fully utilize the information, the invention uses a bidirectional long and short memory neural network to calculate the word and part-of-speech sequence of the sentence, thereby capturing far-distance word and part-of-speech characteristics more.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects that the model used in the existing Chinese chunking analysis technology can not fully utilize the chunking level and the remote information characteristics and needs to manually customize a complex combined characteristic template, the invention provides a method based on state transition and a neural network to relieve the limitation in the aspect and improve the accuracy of Chinese chunking analysis.
In order to solve the technical problems, the invention discloses a Chinese chunk analysis method based on state transition and a neural network and additional description about a model parameter training method used in the analysis process.
The Chinese chunk analysis method based on state transition and neural network comprises the following steps:
step 1, a computer reads a Chinese text file containing a sentence to be analyzed, defines the type of a Chinese chunk, divides the sentence to be analyzed, labels the part of speech of each word, and determines the part of speech label type which can be selected according to the current sentence state when labeling the part of speech;
and 2, performing Chinese chunk analysis on the sentence to be analyzed by using a state transition and neural network-based method.
Wherein, step 1 includes the following steps:
step 1-1, defining Chinese chunk types by using 12 phrase types defined on the basis of Chinese version CTB (the Chinese Penn Treebank)4.0 (the Tree Bank is a labeled tree Bank of the university of Pennsylvania for Chinese corpus); the type of the chunk is selected by a user according to specific targets, and a traditional Chinese chunk analysis task generally has two specific phrase identification tasks: one is to identify only noun phrases, and the other is to identify 12 types of chunks defined on the basis of version CTB4.0 in the Bingzhou Tree library. In example 1, the second approach was taken and the specific meanings for these 12 phrase types are illustrated in table 1:
TABLE 1 description of Chinese chunk types
Type (B) Means of Examples of the present invention
ADJP Adjective phrases developing/JJ Country/NN
ADVP Adverb phrase general/AD use/VV
CLP Category-type phrases Hongkong yuan/M and/CC dollar/M
DNP Multiple limiting phrases /DEG of
DP Qualifier phrase this/DT
DVP Ground word phrase equal/VA harmony/VA ground/DEV
LCP Phrase of orientation Recent years/NT coming/LC
LST Sequence phrases (/ PU-CD)/PU
NP Noun phrases Highway/NN project/NN
PP Preposition phrase and/P complete machine plant/NN
QP Quantitative word phrase one/CD/M
VP Verb phrases permanent/AD full-on/VV
Wherein "NN" in "country/NN" is a part of speech corresponding to the word, "NN" represents a noun, "VV" represents a verb, and the like.
And step 1-2, determining the tagging type which can be selected when the part of speech tagging is carried out on each word to be tagged in the tagging process by adopting a mode of combining a BIOES tagging system with the Chinese chunk type defined in the step 1-1. After modeling the chunk analysis task into a serialized annotation task, what annotation system needs to be adopted is determined. In the english chunk analysis task, the adopted labeling system generally includes two types, i.e., BIO and biees, that is, each word in a sentence is labeled with a combination of a chunk type and BIO or biees. In the BIO labeling mode, B represents the beginning of a chunk, I represents the inside of the chunk, and O represents other positions except the chunk; in the biees notation, B denotes the start of a chunk, I denotes the inside of a chunk, E denotes the end of a chunk, O denotes other locations than chunks, and S denotes a word to form chunks individually. The meaning of the biees notation system is illustrated below using a labeled sentence as an example. First, a sentence is given that has been chunked by chunk:
[ NP Shanghai Pudong ] [ NP development and construction by method ] [ VP synchronization ] [. ]
NP indicates that the chunk is a noun phrase, VP indicates that the chunk is a verb phrase, ". "indicates that the word does not belong to any one chunk. The sentence is labeled by a BIOES labeling system in the following form:
shanghai _ B-NP Pudong _ E-NP develops _ B-NP and makes _ I-NP make _ I-NP build _ E-NP synchronous _ S-VP. O to be noted, the inventionThe notations in the specification will be made according to the BIOES system. Further, the combination of the chunk type and the BIOES is not a complete combination between the two, only B and S are fully combined with the chunk type, i.e., it is assumed that the chunk type shares type1,type2,…,typekK in total, then they, when combined with B and S, have B-type1,B-type2,…,B-typek,S-type1,S-type2,…,S-typekThe total number of the types is 2k, and I, O, S are added, so that the number of the types marked in 2k +3 is total, and the number of the types marked in k is 12, so that the number of the types is 27. The above example sentences are marked in this way as follows:
shanghai _ B-NP Pudong _ E development _ B-NP and I method I construction _ E synchronization _ S-VP. O
In addition, in the annotation process, the generation of the candidate annotation type of a certain word is also restricted by a certain rule, and the restriction in the invention is as follows:
1. the first word of the sentence cannot be I, E;
2. the type is B-typexThe latter word of the word(s) cannot be B-typey、O、S-typey
3. The word after the type I word cannot be B-typey、O、S-typey
4. The word after the word of type O cannot be I, E;
5. a word of type E cannot be followed by I, E;
6. the type is S-typexThe latter word cannot be I, E.
In the step 1, a computer reads a natural language text file containing a sentence to be analyzed, and when Chinese chunk analysis is performed, required input is performed to complete part-of-speech tagging of each word besides the fact that the sentence is already divided into words. For example, a complete sentence input is shown in table 2:
TABLE 2 complete sentence input to be analyzed
Word Part-of-speech tagging
France NR
National defense NN
Length of the neck NN
Lyotropic (r) cell NR
1 day NT
Say that VV
PU
France NR
Is under way AD
Study of VV
From P
Wave black NR
Army withdrawal device VV
Is/are as follows DEC
Plan for NN
PU
And 2, performing block analysis on each read sentence by using a state transition and neural network-based method. In the serialization labeling method based on the state transition, for each sentence, the words are read in from left to right in sequence by taking the words as units, the reading of each word can cause one transition of the state of the current sentence, and one state of the sentence records which words of the current sentence are labeled, the labeling type corresponding to each labeled word and which words are not labeled. If the label for each word is unique, then after labeling each word in the sentence, a complete sequence of labels for the sentence is obtained, and the process can be simply described as: suppose a sentence length of n and an initial state of s1Marking the t-th word as marktThe state after labeling the t-th word is st+1Then the whole process can be briefly described as The marking sequence corresponding to the whole sentence is mark1,mark2,…,marknThis way of labeling is called greedy search in the present invention. However, the labeling accuracy of the whole sentence obtained by the labeling mode is low, so the method adopts a column search method to finish the labeling of the whole sentence.
Before describing the method of column search in detail, a brief introduction to the full search is required: a complete search is different from a greedy search in that when labeling is performed on each word in the search process, only one labeling result is obtained, but a labeling result set (i.e., a state set) is obtained, and it is assumed that the state set in which a sentence is located before labeling the ith word is denoted as SiSo the state set of the sentence before labeling the first word of the sentence is S1Wherein only one state is represented asThe candidate annotation type for the first word when annotated is defined by step 1-2, assuming for the set of states S1When each state in the table is labeled with the current word, the number of the labeling modes which can be selected is k, and then the states are labeledState set S obtained after complete k kinds of labeling and expansion2In which there are k states, denoted as(the order is sorted by score height); similarly, when labeling the second word, the state set S will be labeled2Will have k expansion for each state in the set, the resulting new set of states will have k2A state, expressed asBy analogy, the expansion of the t-th word can obtain the completion of the whole sentenceFull annotation state collectionIf each expansion operation (i.e., what labeling was done for this) can remain in the new state after expansion, it is possible to draw from the state set Sn+1Go back from each state in (1), restore a complete annotation sequence for the sentence, wherein S isn+1The sequence of the state recovery with the highest score is the labeling result of the sentence by the method. Using this search method, the state set size will grow rapidly, which is not feasible in real-world operation, so the column search method is adopted in the present invention to reduce the state set after each expansion. Column search differs from full search in that: in the previous state set St-1When all the states in (1) are expanded, no matter how many states of the obtained new state set are, only m with the highest score (the selection of m is selected by a user according to specific tasks, generally, the larger m is, the higher the obtained labeling precision is, but the higher the overhead is, for example, m selected in embodiment 1 is 4) are reserved, so that the size of the new state set obtained after the state expansion operation for each word is completed can be ensured not to exceed m. From the state set S as a full searchn+1And the state with the highest median score is traced back forwards, and the labeling sequence of the sentence obtained by reduction is the labeling result of the sentence by the method. This column search approach is used in the present invention.
The length of the sentence to be analyzed is denoted by n throughout step 2, and step 2 comprises the following steps:
step 2-1, under a given state (one state records which words in the current sentence are labeled and labeled types thereof, and records which words are unlabeled words at the same time), scoring is carried out on all labeled types when the tth word is processed; the given state is that the front t-1 words of the sentence to be analyzed are labeled and the corresponding labeling types are known, the t-th to nth words are unlabeled words and the t-th word is the next word to be processed;
step 2-2, a set of states S is giventFor each state in the set of states when processing the tth wordScoring all the label types according to the mode in the step 2-1, wherein the scoring is completed through calculation, each label type is endowed with a real numerical value, the real numerical value is called the score corresponding to the type, then candidate label types are generated according to the mode in the step 1-2, words are labeled according to each candidate label type so as to expand the state, m new states with the highest score are selected according to the column search mode, and a new state set S is obtainedt+1
And 2-3, executing steps 2-1 and 2-2 on t equal to 1,2, …, n to obtain a final target state set Sn+1And extracting the state with the highest scoreAnd backtracking from the state to obtain a labeling sequence with the highest score, completing the type labeling of all the words at the moment, and restoring the labeling sequence with the highest score into a corresponding chunk analysis result, wherein the result is the analysis result of the current sentence.
The state transition operation for each word in the invention is a category labeling operation performed on the read-in word in a certain current sentence state. When labeling the t-th word, the previous state set S is giventIn one of the states, the labeling type set capable of being labeled is defined by the step 1-2, the operation of scoring each label in the labeling set is completed by a forward neural network, and the process of scoring the labeling type capable of being labeled in the given state of the current word by using the neural network comprises two steps: firstly, generating characteristic information, namely generating neural network input; and secondly, scoring all candidate categories by utilizing a neural network. The step 2-1 specifically comprises the following steps:
step 2-1-1, generating a feature vector, wherein the feature vector comprises a basic information feature vector and an additional information feature vector;
and 2-1-2, calculating the feature vector input generated in the step 2-1-1 by using a forward neural network to obtain the scores of all candidate labeling types.
It is first noted that, in information processing, there are mainly two ways for the representation of each feature, one is a one-hot representation, and the other is a distributed representation. one-hot represents that a very long vector is used for representing a feature, the length of the vector is the size of a feature dictionary formed by all the features, the corresponding position of only the feature in the feature dictionary in the vector component is 1, and the rest are all 0; the distributed representation is that each feature is given a real-valued vector representing the feature, and the dimension of the vector is set according to the task requirement. It should be noted that these two representations are widely used in the art, and should be well known to those skilled in the art, and will not be described herein. The expression mode adopted by the invention is distributed expression, namely each feature is endowed with a real value vector with a certain dimension, and the dimension of the feature set in the embodiment 1 is 50. The generation of the part of input in the invention comprises two steps, namely the generation of the basic information characteristic and the generation of the additional information characteristic. All words in the sentence to be analyzed are sequentially denoted as w from left to right in the whole step 2-1-11,w2,…,wn,wnRepresenting the nth word in the sentence to be analyzed, wherein the value of n is a natural number; parts of speech corresponding to all words in the sentence to be analyzed are sequentially represented as p from left to right1,p2,…,pn,pnRepresenting the part of speech corresponding to the nth word in the sentence to be analyzed; a feature vector corresponding to a feature is denoted as e (, and step 2-1-1 comprises the steps of:
and 2-1-1-1, generating a basic information characteristic vector. The basic information characteristic vector comprises a word in a certain window by taking the position of the current word to be labeled as a reference, a characteristic vector corresponding to the part-of-speech characteristic, and a characteristic vector corresponding to the class characteristic of the labeled word in the certain window by taking the position of the current word to be labeled as a reference,the specific process is as follows: the word feature vector in the basic information features comprises: characteristic vector e (w) corresponding to the second word counted leftwards by taking the current word to be processed as the center-2) And a feature vector e (w) corresponding to the first word counted to the left by taking the current word to be processed as the center-1) And the feature vector e (w) corresponding to the current word to be processed0) And counting the characteristic vector e (w) corresponding to the first word to the right by taking the current word to be processed as the center1) And a feature vector e (w) corresponding to the second word counted right with the current word to be processed as the center2);
The part-of-speech feature vector includes: characteristic vector e (p) corresponding to part of speech of second word counted to left by taking current word to be processed as center-2) And a characteristic vector e (p) corresponding to the part of speech of the first word counted to the left by taking the current word to be processed as the center-1) Characteristic vector e (p) corresponding to part of speech of current word to be processed0) And a characteristic vector e (p) corresponding to part of speech of the first word counted rightwards by taking the current word to be processed as the center1) And a characteristic vector e (p) corresponding to the part of speech of the second word counted rightwards by taking the current word to be processed as the center2) And a feature vector e (p) corresponding to the part-of-speech combination of the first word and the second word counted to the left by taking the current word to be processed as the center-2p-1) And a feature vector e (p) corresponding to the part-of-speech combination of the first word counted leftwards by taking the current word to be processed as the center and the current word to be processed-1p0) And a feature vector e (p) corresponding to the part-of-speech combination of the first word counted rightwards by taking the current word to be processed as the center and the current word to be processed0p1) Counting the feature vector e (p) corresponding to the part-of-speech combination of the second word and the first word to the right by taking the current word to be processed as the center1p2);
In the block analysis task, the basic features used for scoring each annotation type in each step generally include word and part-of-speech features in a certain window with the position of the current word to be annotated as a reference, and category features to which the annotated word in the certain window with the position of the current word to be annotated as a reference belongs. Generally, the current word is denoted as w0The ith word on the left is denoted w-iRight ith wordIs denoted as wi(ii) a The part of speech of the current word is denoted as p0The part of speech of the ith word on the left is denoted as p-iThe part of speech of the ith word on the right is denoted as pi(ii) a The category characteristics of the labeled words are different from those of the former two words, because all words and part-of-speech information of the whole sentence are known from the beginning of analysis, the window is generally expanded to two sides by taking the current word as a reference, and because the labeling process is from left to right, when one word to be labeled is labeled, only the labeling type of the word on the left of the current word is known, so that the word to be labeled can be expanded to the left by taking the current word as a reference, and the labeled type of the ith word on the left of the current word is labeled as t-i. The selection of i varies according to the size of the selected window, i is selected as 2 in example 1 (i.e. the window size is 5), and the corresponding basic features are shown in table 3, table 4 and table 5:
TABLE 3 basic word characteristics
TABLE 4 basic part-of-speech characteristics
TABLE 5 class characteristics to which words belong
It should be noted that the above-mentioned features based on words and parts of speech are well known to those skilled in the art and are widely used, so that no further description is made here, and reference may be made specifically to the following references: chen W, Zhang Y, Isahara H. an empirical study of Chinese bathing [ C ]// Proceedings of the coling/ACLon Main conference site sessions, Association for computational Linear regulations, 2006:97-104.
The category characteristics of the labeled words have the same meaning as those of the traditional models such as hidden Markov models, conditional random fields and the like, but the use modes are different: in the invention, the characteristic is treated as the same characteristic as the word and the part-of-speech characteristic, and the traditional model is treated by using a dynamic programming mode, compared with the traditional model in which the increase of i brings about rapid increase of time overhead, the mode based on state transition in the invention has little time overhead increase when i is increased, which is also an advantage of the mode based on state transition in the speed when the characteristic is blended;
step 2-1-1-2, generating an additional information feature vector: the additional information characteristic vector comprises a word characteristic vector and a part-of-speech characteristic vector which are related to the marked chunks in a certain window by taking the position of the current word to be marked as a reference, and the word characteristic vector and the part-of-speech characteristic vector of the current position to be marked, which are calculated by using a bidirectional long and short memory neural network model.
Step 2-1-1-2 comprises the following steps:
step 2-1-1-2-1, counting the second chunk to the left with the current word to be processed as the center, and respectively representing the first chunk as c-2、c-1Block ciIs denoted as start word (c)i) The last word is denoted end word (c)i) I-2, -1, the grammatical core word denoted head _ word (c)i) Block ciThe part of speech of the first word of (a) is denoted as start _ POS (c)i) The part of speech of the last word is denoted end _ POS (c)i) Part of speech of the grammatical core word is denoted as head _ POS (c)i) Generating word characteristic vectors and part-of-speech characteristic vectors which are related to the marked chunks in a certain window by taking the position of the current word to be marked as a reference: the chunk-level word feature vector includes: the feature vector e (start word (c)) of the first word of the second chunk counting to the left with the current word to be processed as the center-2) End word (c) of the last word of the second chunk to the left centered on the current word to be processed-2) In the current word to be processed is taken asFeature vector e (head _ word (c)) of a grammatical headword of the second chunk, heart to left-2) Starting word (c), the feature vector e of the first word of the first chunk counting to the left with the current word to be processed as the center (start word (c))-1) End word (c) of the last word of the first chunk counted to the left centering on the current word to be processed-1) A feature vector e (head _ word (c)) of a grammatical core word of the first chunk counting to the left with the current word to be processed as the center-1));
The part-of-speech feature vectors at the chunk level include: feature vector e of part of speech of first word of second chunk counting to left with current word to be processed as center (start _ POS (c)-2) Characteristic vector of part of speech of the last word of the second chunk counting to the left centering on the current word to be processed (end _ POS (c))-2) Characteristic vector e (head _ POS (c)) of part of speech of grammatical core word of second chunk counting leftwards with current word to be processed as center-2) Characteristic vector e (start _ POS (c)) of part of speech of the first word of the first chunk counting left with the current word to be processed as the center (c)-1) And a part-of-speech feature vector e (end _ POS) (c) of the last word of the first chunk counted to the left with the current word to be processed as the center-1) Characteristic vector e (head _ POS (c)) of part of speech of grammatical core word of the first chunk counting to the left with current word to be processed as center-1) ); the selection of i varies according to the size of the selected window, i is selected as 2 in embodiment 1, and the characteristics of the corresponding chunk level are shown in table 6:
TABLE 6 Block level word and part-of-speech features
It should be noted that the above features at the chunk level are not used as in the present invention because they are limited by the markov assumption under the conventional model such as conditional random field, but are used in a complex dynamic programming algorithm after pruning, and specifically, the following documents can be referred to: zhou J, Qu W, Zhang F. Exploiting chunk-level defects to improved graphics chunk [ C ]// Proceedings of the2012Joint Conference on Empirical Methods in Natural Language Processing and Natural Language learning for comparative linearity, 2012:557 567.
Step 2-1-1-2-2, calculating and generating word and part of speech information feature vectors of the current position to be labeled by using a bidirectional long and short memory neural network model: the input of the bidirectional long and short memory neural network model is all words in a sentence to be analyzed and parts of speech corresponding to all words in the sentence to be analyzed, and the output is a forward word feature vector, a forward part of speech feature vector, a backward word feature vector and a backward part of speech feature vector. Firstly, it is to be noted that tanh used in the following formula is a hyperbolic function, which is a real-valued function, and acts on a vector to represent that this operation is performed on each element in the vector, so as to obtain a target vector with the same dimension as the input vector; sigma is a sigmod function, which is a real-valued function, and acts on a vector to represent that each element in the vector is subjected to the operation, so that a target vector with the same dimension as the input vector is obtained; an all-digital multiplication operation is a point multiplication operation, i.e., two vectors with the same dimension are multiplied by bit to obtain a result vector with the same dimension. The calculation of these four feature vectors is as follows:
the forward word feature vector is sequentially represented as hf(w1),hf(w2),…,hf(wn),hf(wt) (t ═ 1, …, n) represents the t-th forward word feature vector, which is calculated as follows:
f t w f = σ ( W f h w f h f ( w t - 1 ) + W f x w f e ( w t ) + W f c w f c t - 1 w f + b f w f ) ,
i t w f = σ ( W i h w f h f ( w t - 1 ) + W i x w f e ( w t ) + W i c w f c t - 1 w f + b i w f ) ,
o t w f = σ ( W o h w f h f ( w t - 1 ) + W o x w f e ( w t ) + W o c w f c t w f + b o w f ) ,
wherein, the method is a well-trained model parameter matrix (the training process is completed in a mode in an additional description of a model parameter training method in a specification), the value of each element in the matrix is a real numerical value, the group of parameters is irrelevant to t, namely all calculation units in a calculation sequence share the same group of parameters;
the intermediate calculation results in the t-th calculation unit are all real value vectors;
e(wt)、hf(wt-1)、is the input of the t-th computing unit, which is a real-valued vector, e (w) of whicht) I.e. the word wtA corresponding feature vector; h isf(wt)、Is the output of the t-th computing unit,only h is finally used as the characteristic vector of the forward word for the auxiliary calculation result of the long and short memory neural network modelf(wt-1) Since this is a serialized computational model, the output h of the t-1 st computational unitf(wt-1)、The input is the input of the t calculating unit;
etc. are all matrix multiplication operations.
The forward part-of-speech feature vector is sequentially represented as hf(p2),…,hf(pn),hf(pt) (t ═ 1, …, n) represents the t-th forward part-of-speech feature vector, which is calculated as follows:
f t p f = σ ( W f h p f h f ( p t - 1 ) + W f x p f e ( p t ) + W f c p f c t - 1 p f + b f p f ) ,
i t p f = σ ( W i h p f h f ( p t - 1 ) + W i x p f e ( p t ) + W i c p f c t - 1 p f + b i p f ) ,
o t p f = σ ( W o h p f h f ( p t - 1 ) + W o x p f e ( p t ) + W o c p f c t p f + b o p f ) ,
wherein, the method is a well-trained model parameter matrix (the training process is completed in a mode in an additional description of a model parameter training method in a specification), the value of each element in the matrix is a real numerical value, the group of parameters is irrelevant to t, namely all calculation units in a calculation sequence share the same group of parameters;
the intermediate calculation results in the t-th calculation unit are all real value vectors;
e(pt)、hf(pt-1)、is the input of the t-th computing unit, which is a real-valued vector, e (p) of whicht) I.e. part of speech ptA corresponding feature vector; h isf(pt)、Is the output of the t-th computing unit,only h is finally used as the characteristic vector of the forward word for the auxiliary calculation result of the long and short memory neural network modelf(pt-1) Since this is a serialized computational model, the output h of the t-1 st computational unitf(pt-1)、The input is the input of the t calculating unit;
etc. are all matrix multiplication operations.
The backward word feature vector is sequentially represented as hb(w1),hb(w2),…,hb(wn),hb(wt) (t 1, …, n) represents the t-th backward word feature vector, which is calculated as follows:
f t w b = σ ( W f h w b h b ( w t + 1 ) + W f x w b e ( w t ) + W f c w b c t + 1 w b + b f w b ) ,
i t w b = σ ( W i h w b h b ( w t + 1 ) + W i x w b e ( w t ) + W i c w b c t + 1 w b + b i w b ) ,
o t w b = σ ( W o h w b h b ( w t + 1 ) + W o x w b e ( w t ) + W o c w b c t w b + b o w b ) ,
wherein, the method is a well-trained model parameter matrix (the training process is completed in a mode in an additional description of a model parameter training method in a specification), the value of each element in the matrix is a real numerical value, the group of parameters is irrelevant to t, namely all calculation units in a calculation sequence share the same group of parameters;
the intermediate calculation results in the t-th calculation unit are all real value vectors;
e(wt)、hb(wt+1)、is the input of the t-th computing unit, which is a real-valued vector, e (w) of whicht) I.e. the word wtA corresponding feature vector; h isb(wt)、Is the output of the t-th computing unit,only h is finally used as the characteristic vector of the forward word for the auxiliary calculation result of the long and short memory neural network modelb(wt-1) Since this is a serialized computational model,output h of t +1 th calculation unitb(wt-1)、The input is the input of the t calculating unit;
etc. are all matrix multiplication operations.
The backward part-of-speech feature vector is sequentially represented as hb(p1),hb(p2),…,hb(pn),hb(pt) (t 1, …, n) represents the t-th backward part-of-speech feature vector, which is calculated as follows:
f t p b = σ ( W f h p b h b ( p t + 1 ) + W f x p b e ( p t ) + W f c p b c t + 1 p b + b f p b ) ,
i t p b = σ ( W i h p b h b ( p t + 1 ) + W i x p b e ( p t ) + W i c p b c t + 1 p b + b i p b ) ,
o t p b = σ ( W o h p b h b ( p t + 1 ) + W o x p b e ( p t ) + W o c p b c t p b + b o p b ) ,
wherein, the method is a well-trained model parameter matrix (the training process is completed in a mode in an additional description of a model parameter training method in a specification), the value of each element in the matrix is a real numerical value, the group of parameters is irrelevant to t, namely all calculation units in a calculation sequence share the same group of parameters;
the intermediate calculation results in the t-th calculation unit are all real value vectors;
e(pt)、hb(pt+1)、is the input of the t-th computing unit, which is a real-valued vector, e (p) of whicht) I.e. part of speech ptA corresponding feature vector; h isb(pt)、Is the output of the t-th computing unit,only h is finally used as the characteristic vector of the forward word for the auxiliary calculation result of the long and short memory neural network modelb(pt+1) Since this is a serialized computational model, the output h of the t +1 th computational unitb(pt+1)、I.e. the tth calculation sheetEntry of elements
Etc. are all matrix multiplication operations.
In order to fully utilize the mode information of word strings and part-of-speech strings which are farther away from the current word to be labeled in a sentence, the invention adopts a bidirectional long and short memory neural network model to calculate the word and part-of-speech information characteristics of the position of the current word to be labeled. The specific calculation process is divided into two steps of forward and backward, wherein the forward direction is from left to right, the backward direction is from right to left, and the calculation modes are consistent, so that only the forward calculation process is explained in detail here: first, assuming that the sentence length is n, the words in the sentence are sequentially represented as w from left to right1,w2,…,wnThe corresponding feature vector is e (w) in sequence1),e(w2),…e(wn) (ii) a Parts of speech in a sentence are sequentially denoted as p from left to right1,p2,…,pnThe corresponding feature vector is e (p) in sequence1),e(p2),…e(pn) (ii) a In addition, the forward word feature vectors obtained by calculation are sequentially expressed as hf(w1),hf(w2),…,hf(wn) Sequentially expressing the forward part-of-speech feature vectors obtained by calculation as hf(p1),hf(p2),…,hf(pn) (ii) a It should be noted that these vectors are all real-valued vectors that have been trained, and their dimensions are set by the user, such as w in example 1tAnd ptIs set to 50, hf(wt) And hf(pt) Is set to 25.
In the step 2-1-2, the forward neural network is used for calculating and obtaining scores of all the labeling types, after the step 2-1-1 is finished, a real value vector formed by splicing vectors corresponding to all the features in the step 2-1-1 is obtained, the dimension size of the real value vector is the sum of the dimensions of all the feature vectors, the vector is used as the input of the forward neural network, and the calculation process of the whole forward neural network is carried out according to the following formula:
h=σ(W1x+b1),
o=W2h,
wherein, W1、b1、W2The method comprises the following steps of (1) obtaining a trained model parameter matrix, wherein the value of each element in the matrix is a real value; x is an input vector which is formed by splicing all the characteristic vectors obtained in the step 2-1-1, the dimensionality of the input vector is the sum of the dimensionalities of all the characteristic vectors generated in the step-1-1, and the value of each element is a real numerical value; h is a hidden layer vector of the neural network, is an intermediate calculation result unit, and is a vector, the dimension of which is defined in advance, for example, the dimension of which is 300 in embodiment 1; o is a calculation output, which is a real-valued vector, the dimension of which corresponds to the number of label types that can be selected when each word is labeled in the labeling process defined in the step 1-2, wherein the g-th value represents a score for labeling the current step as type g; w1x、W2h are all matrix multiplication operations.
Step 2-2 comprises the following steps:
and 2-2-1, giving each state in the previous state set, and scoring all the label types according to the mode in the step 2-1. Assumed state SxScore of (S) is score (S)x) Type of labelkScore of score (type)k) If all the label types are expanded, K new target states are obtained after the expansion and are expressed asK is the total number of all the labeled types, and the corresponding score of the kth state is calculated according to the following formula
s c o r e ( S i k t + 1 ) = s c o r e ( S i t ) + s c o r e ( type k ) ,
Wherein K is 1-K, and all the scores are real numerical values. Determining candidate marking type according to the mode in step 1-2, and setting state according to the candidate marking typeExpansion is performed assuming a set of states StIf there are c (i) candidate label types determined by the state in step 1-2, c (i) new states are obtained after the state is expanded and are expressed as
Step 2-2-2, assume state set StHaving z states, where z is a natural number, assembling the states into a set StWherein all the states are expanded in the mode of the step 2-2-1, and all the expanded states are
Step 2-2-3, extracting m states with highest scores from all the expanded states obtained in the step 2-2-2 in a column search mode to form a new state set
Has the advantages that: compared with the widely used method based on Markov assumption, the method based on state transition used by the Chinese chunking analysis method can more flexibly add characteristics at a chunking level, simultaneously, a neural network model adopted when scoring the candidate transition types of each state can automatically learn a combination mode among the characteristics, and in addition, the utilization of the bidirectional long-short memory neural network model introduces useful additional information characteristics, and the three are combined to improve the accuracy of the Chinese chunking analysis.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
Fig. 1 is a schematic diagram of a long-short memory neural network computing unit.
FIG. 2 is a schematic diagram of a network structure of a forward long-short memory neural network computation sequence.
Fig. 3 is a schematic diagram of a forward neural network structure.
Fig. 4 is a flow chart of the present invention.
Detailed Description
The invention provides a Chinese chunk analysis method based on state transition and a neural network. When each word in a sentence is labeled with a chunk type, relevant information characteristics are constructed according to existing information, then a neural network is used for scoring all candidate categories, and then state transition operation is executed. In the existing Chinese chunk analysis technology, due to the assumption of the model, the use of remote features is not sufficient enough, and a complicated feature template is required to be manually designed.
As shown in fig. 4, the present invention discloses a chinese chunking analysis method based on state transition and neural network, which can flexibly add chunking-level features, automatically learn a combination mode between features by using a neural network model, and introduce useful additional information features by using a bidirectional long and short memory neural network model, thereby improving the accuracy of chinese chunking analysis.
The complete Chinese chunk analysis process based on state transition and neural network comprises the following steps:
step 1, a computer reads a Chinese text file containing a sentence to be analyzed, defines the type of a Chinese chunk, divides the sentence to be analyzed, labels the part of speech of each word, and determines the part of speech label type which can be selected according to the current sentence state when labeling the part of speech;
and 2, performing block analysis on each read sentence by using a state transition and neural network-based method.
The method for defining the Chinese chunk type and the annotation type comprises the following steps:
step 1-1, the chunk type to be analyzed is defined. The type of the chunk is selected by a user according to specific targets, and a traditional Chinese chunk analysis task generally has two specific phrase identification tasks: firstly, only noun phrases are identified, and secondly, 12 types of chunks defined on the basis of the CTB4.0 version in the Bingzhou tree library are identified;
step 1-2, determining the selected labeling type when labeling each word in the labeling process. Each word in the sentence is labeled with a combination of chunk type and BIO or biees.
Firstly, assuming that the length of a sentence to be processed is n, defining a state of the sentence, recording which words of the current sentence are labeled, a labeling type corresponding to each labeled word and which words are not labeled, and representing a state set of the sentence before labeling the ith word as SiWherein the state is represented asThe size of a column in the adopted column searching method is set as m, and the analysis process aiming at the sentence comprises the following steps:
step 3-1, in a given state, scoring all the label types when processing the tth word;
step 3-2, a set of states S is giventFor each state in the set of states when processing the tth wordLabeling according to each candidate labeling type, expanding the states, selecting m new states with highest scores according to a column search mode, and obtaining a new state set St+1
And 3-3, iteratively executing the steps 3-1 and 3-2 for t being 1,2, …, n to obtain a final target state set Sn+1And extracting the state with the highest scoreAnd backtracking to obtain the whole labeling sequence of the sentence.
Wherein the invention gives the previous state set S when processing the t wordtIn one state, the label type set capable of labeling is defined by the step 1-2, the operation of scoring each label in the label set is completed by a forward neural network, and the neural network is used for scoring the current wordThe process of scoring the label type which can be labeled in a given state comprises two steps: firstly, generating characteristic information, namely generating neural network input; secondly, scoring all candidate categories by using a neural network, wherein the step 3-1 specifically comprises the following steps:
step 3-1-1, generating a forward neural network input;
and 3-1-2, as shown in fig. 3, calculating the feature vector input generated in the step 3-1-1 by using a forward neural network to obtain the scores of all candidate label types.
The generation of the forward neural network input comprises two steps, namely generation of basic information characteristics and generation of additional information characteristics. The step 3-1-1 comprises the following steps:
and 3-1-1-1, generating basic information characteristics. The method comprises the word and part-of-speech characteristics in a certain window based on the position of the current word to be labeled, and the class characteristics of the labeled word in the certain window based on the position of the current word to be labeled, wherein the word characteristics are e (w)-2),e(w-1),e(w0),e(w1),e(w2) They respectively represent the feature vectors corresponding to the second word and the first word counted leftwards by taking the current word to be processed as the center, the current word, and the first word and the second word counted rightwards by taking the current word as the center; the part of speech characteristic is e (p)-2),e(p-1),e(p0),e(p1),e(p2),e(p-2p-1),e(p-1p0),e(p0p1),e(p1p2),e(p-2p-1p0),e(p-1p0p1),e(p0p1p2) They respectively represent the corresponding feature vectors of the part of speech of the second word and the first word counted to the left with the current word to be processed as the center, the part of speech of the current word, the part of speech of the first word and the second word counted to the right with the current word as the center, the part of speech combination of the second word and the first word counted to the left, the part of speech combination of the first word counted to the left and the current word, and the like. The feature vectors are all alreadyAnd training the real-valued vector.
Step 3-1-1-2, generating additional information characteristics, comprising the following two steps:
and 3-1-1-2-1, generating words and part-of-speech characteristics related to the marked chunks in a certain window by taking the position of the current word to be marked as a reference. The chunk level word is characterized by e (start word (c)-2)),e(end_word(c-2)),e(head_word(c-2)),e(start_word(c-1),e(end_word(c-1)),e(head_word(c-1) Respectively representing a first word, a last word, a grammar headword, a first word, a last word and a grammar headword of a second chunk counted to the left by taking a current word to be processed as a center, and a first word, a last word and a grammar headword of a first chunk counted to the left by taking the current word as a center; part-of-speech at the chunk level is characterized by e (start _ POS (c)-2)),(end_POS(c-2)),e(head_POS(c-2)),e(start_POS(c-1),e(end_POS(c-1)),e(head_POS(c-1) Respectively representing the part of speech of the first word of the second chunk counted to the left with the current word to be processed as the center, the part of speech of the last word, the part of speech of the central word of the grammar, the part of speech of the first word of the first chunk counted to the left with the current word as the center, the part of speech of the last word, and the part of speech of the central word of the grammar. The feature vectors are trained real-valued vectors;
and 3-1-1-2-2, generating word and part-of-speech information characteristics of the current position to be labeled, which are calculated by using the two-way long and short memory neural network model. The input of this step is all words in the sentence, denoted w from left to right in turn1,w2,…,wn(ii) a And parts of speech corresponding to all words in the sentence, which are sequentially denoted as p from left to right1,p2,…,pn. The output is a forward word feature vector which is sequentially expressed as hf(w1),hf(w2),…,hf(wn) (ii) a Forward part-of-speech feature vector, denoted in turn as hf(p1),hf(p2),…,hf(pn) (ii) a Backward word feature vector, denoted as h in turnb(w1),hb(w2),…,hb(wn) (ii) a Backward part-of-speech feature vector, sequentially denoted as hb(p1),hb(p2),…,hb(pn). Since the backward and forward comparisons are only the difference in the calculation direction, and the calculation method is the same, only the forward calculation process will be described in detail here, and for each hf(x) (x may be w)tOr pt(t is 1,2, … n), except that the input and calculation parameters are different, the calculation mode is completely consistent, and it is abbreviated as hf) The calculation is carried out according to the following formula:
ft=σ(Wfhht-1+Wfxxt+Wfcct-1+bf),
it=σ(Wihht-1+Wixxt+Wicct-1+bi),
ct=ft⊙ct-1+it⊙tanh(Wchht-1+Wcxxt+bc),
ot=σ(Wohht-1+Woxxt+Wocct+bo),
ht=ot⊙tanh(ct),
wherein, Wfh、Wfx、Wfc、bf、Wih、Wix、Wic、bi、Wch、Wcx、bc、Woh、Wox、Woc、boIs a well-trained model parameter matrix (the training process is realized by combining the analysis method in the invention with the correct label sequence in the maximum likelihood training data set), the value of each element in the matrix is a real value, it needs to be pointed out that the group of parameters is irrelevant to t, that is, all the calculation units in a calculation sequence share the same group of parameters, because all the calculation units in a calculation sequence share the same group of parametersThe forward and reverse calculation sequences of words and parts of speech are related in the invention, so 4 groups of parameters are shared; f. oft、it、otThe intermediate calculation results in the t-th calculation unit are all real value vectors; h ist-1、ct-1、xtIs the input of the t-th computing unit, which is a real-valued vector, where xtIs e (w)t) Or e (p)t);ct、htIs the output of the t-th computing unit, but ctOnly h is used as the auxiliary calculation result of the long and short memory neural network model and finally used as the word or part-of-speech feature vectort,htI.e. the target feature vector hf(wt) Or hf(pt) Note that since this is a serialized computational model, the output h of the t-1 st computational unitt-1、ct-1I.e. the input of the t-th computing unit, tanh is a hyperbolic function which is a real-valued function and acts on a vector to represent that each element in the vector is subjected to the operation to obtain a target vector with the same dimension as the input vector, sigma is a sigmod function which is a real-valued function and acts on a vector to represent that each element in the vector is subjected to the operation to obtain a target vector with the same dimension as the input vector, ⊙ is a point multiplication operation, i.e. the two vectors with the same dimension are subjected to bit multiplication to obtain a result vector with the same dimension, and W is a linear vectorfhht-1、WfxxtEtc. are all matrix multiplication operations.
And 3-1-2, calculating the feature vector input generated in the step 3-1-1 by using a forward neural network to obtain scores of all the label types. After the step 3-1 is finished, a real value vector formed by splicing vectors corresponding to all the features in the step 3-1 is obtained, the dimension size of the real value vector is the sum of the dimensions of all the feature vectors, the vector is used as the input of the forward neural network, and the calculation process of the whole forward neural network is carried out according to the following formula:
h=σ(W1x+b)
o=W2h
wherein, W1、b、W2The method comprises the following steps of (1) obtaining a trained model parameter matrix, wherein the value of each element in the matrix is a real value; x is an input vector, the value of each element of which is a real number value; o is a calculation output, which is a real-valued vector, the dimension of which corresponds to the number of the labeling types that can be selected when labeling each word in the labeling process defined in the step 1-2, wherein the ith value represents the score for labeling the current step as the category i; w1x、W2h are all matrix multiplication operations.
Step 3-2, a set of states S is giventFor each state in the set of states when processing the tth wordLabeling according to each candidate labeling type, expanding the states, selecting m new states with highest scores according to a column search mode, and obtaining a new state set St+1. The method comprises the following steps:
step 3-2-1, give each state in the previous set of statesScoring all annotation types in the manner of step 3-1, assuming state SxScore of (S) is score (S)x) Type of labelkScore of score (type)k) If all types are expanded, K (K is the total number of all labeled types) new target states are obtained after expansion and are expressed asThe corresponding score is calculated according to the following formula:
s c o r e ( S i k t + 1 ) = s c o r e ( S i t ) + s c o r e ( type k ) ,
wherein the scores are all real values. Then determining candidate marking types according to the constraint rules in the step 1-2, and setting the state according to the marking typesExpansion is performed assuming a set of states StA certain state inIf there are c (i) candidate label types determined according to the constraint rule in step 1-2, then the state is matchedAfter expansion, c (i) new states are obtained and are expressed as
Step 3-2-2, gathering the state StAll the states (assuming m states) are expanded in the manner of step 3-2-1, and all the expanded states are
Step 3-2-3, taking out m states with highest score from all the states obtained in step 3-2-2, and forming new statesCollection
Step 3-3, namely, executing steps 3-1 and 3-2 to t 1,2, …, n to obtain the final target state set Sn+1And extracting the state with the highest scoreAnd backtracking to obtain the whole labeling sequence of the sentence, and further obtaining a chunk analysis result corresponding to the sentence.
The additional description of the model parameter training method used in the analysis process of the present invention is as follows:
as can be seen from step 2 of the analysis process, the parameters used in the analysis process of the present invention include the following components (hereinafter, these parameters are referred to as model parameter sets):
1. feature vectors corresponding to the features, denoted by e (, where denotes the basic word features and basic part-of-speech features in step 2-1-1-1 and the block-level word features and part-of-speech features in step 2-1-1-2-1, that is, all words and parts-of-speech appearing in the training expectation and the combination of two adjacent words and the combination of two adjacent parts-of-speech correspond to one set of feature vectors;
2. neural network parameters for calculating forward word sequences in step 2-1-1-2-2
3. Neural network parameters for calculating backward word sequences in step 2-1-1-2-2
4. Neural network parameters for calculating forward part-of-speech sequences in step 2-1-1-2-2
5. Neural network parameters for calculating backward part-of-speech sequences in step 2-1-1-2-2
6. Forward neural network parameter W used in step 2-1-21、W2
The training process is implemented using an iterative approach with the correct sequence labeled in the maximum likelihood training dataset. Before training begins, the parameters in the model parameter set are randomly sampled, for example, the values of embodiment 1 and embodiment 2 are randomly sampled according to the uniform distribution between-0.1 and 0.1. The labeled dataset (assuming dataset size D) is then used, dataest ═ sent1,sent2,…,sentDTraining parameters: first, a training objective is defined on the whole data set, which is also called a loss function, and is a function of all parameters in the whole model parameter set, and assumed to be L (dataset), and for each sentence sentrIs expressed as loss (send)r) The definition and the calculation process of the two are carried out according to the following modes:
when the t-th word of a sentence is processed in the way of step 2 in the analysis process, for each state in the previous state set, the expression method in step 2-2 is assumed to be expressed asThe score (type) obtained by scoring the kth annotation type in the current state can be known by the process of step 2-1k) In fact, it is a complex function of all parameters in sets 2-5 (assumed as Θ) and those feature vectors in set 1 taken at this state in steps 2-1-1-1 and 2-1-1-2-1. Assumed to be in a given stateAll the feature vectors extracted at the time of processing the t-th are expressed as a wholeSince here the score of the whole sentence is to be expressed, we will be in a given state for convenienceThe score obtained by scoring the kth annotation type when processing the t < th > is expressed asThen there are:
s c o r e ( S i t , t , type k ) = F ( &Theta; , E ( S i t , t ) ) ,
f is a composite function formed by compounding four long and short memory neural networks and a strong term neural network according to the process description in the step 2-1, and theta is all parameters in the sets 2-5 of the model parameter set.
From step 2, it can be seen that after processing a sentence in steps 2-3, the state set is Each of the statesAll the scores of (A) are all the parameters (expressed by Θ) in the sets 2 to 5 of the model parameter sets and the state from the beginning in the set 1 of the parametersExtend to StateEach word is processed in the whole path according to a composite function of all the feature vectors extracted in step 2-1-1-1 and step 2-1-1-2-1. Assume for the set of states Sn+1Each state ofIts slave stateExtend to StateIn the process of (2), the selected sequence of the annotation type isThe sequence of states experienced in the process is(Is thatIs that) Then state ofThe score of (a) is:
s c o r e ( S i n + 1 ) = &Sigma; j = 1 n s c o r e ( S i j - 1 j , j , type i j ) ,
since the training sentences are all labeled data, i.e. knowing the correct labeling sequence, the state set S is assumedn+1State of (1)Corresponding to the correct labeling sequence. Defining a loss function for the sentence:
l o s s ( sent r ) = - &Sigma; l = 1 m e s c o r e ( s g o l d n + 1 ) e s c o r e ( s l n + 1 ) ,
wherein exAn exponential function is represented and e represents a constant of the natural logarithm.
The loss function for the entire training data set is defined as:
L ( d a t a s e t ; &Theta; , E ) = &Sigma; l = 1 D l o s s ( sent l ) ,
wherein Θ and E denote that the loss function is a function of a parameter in the set of model parameters.
The objective of the whole training process is to minimize the above loss function, and various methods for minimizing the above loss function and obtaining the parameters are well known to practitioners in the art, such as the embodiment in which a random gradient descent method is used to solve the above loss function.
Example 1
First, the model parameters in this embodiment are trained on 9978 sentences of 728 files (file numbers from chtb _001.fid to chtb _899.ptb, note that the numbers are not consecutive, so there are only 110 files) in version ctb (the Chinese Penn treebank)4.0 in the bingzhou tree library in the manner described in the attached description of the model parameter training method in the specification.
The present embodiment performs a complete process of chinese chunk analysis on a sentence by using the chinese chunk analysis method based on state transition and neural network of the present invention as follows:
step 1-1, defining Chinese chunk types, and defining 12 types on the basis of Chinese version CTB4.0 of Bingzhou tree bank: ADJP, ADVP, CLP, DNP, DP, DVP, LCP, LST, NP, PP, QP, VP, the specific meanings of which are given in step 1-1 in the specification;
and 1-2, determining the selected labeling type when each word is labeled in the labeling process, and adopting a BIOES system. The finally determined labeling types are 27 types of B-ADJP, B-ADVP, B-CLP, B-DNP, B-DP, B-DVP, B-LCP, B-LST, B-NP, B-PP, B-QP, B-VP, ADJP, I, O, E, S-ADVP, S-CLP, S-DNP, S-DP, S-DVP, S-LCP, S-LST, S-NP, S-PP, S-QP and S-VP;
step 2-1, the computer reads a natural language text file containing the sentence to be analyzed. For convenience of explanation, a sentence "Shanghai/NR Pudong/NR development/NN and/CC method/NN construction/NN synchronization/VV" is read in;
step 3, at first, the initial state set is S1Wherein one state isThe state is an initial sentence, and then the following steps are executed;
step 3-1, processing the 1 st word "Shanghai", and executing the following steps:
step 3-1-1, generating the input of the forward network, and executing the following steps:
and 3-1-1-1, generating basic information characteristics. Since it is the first word, counting left without words, and adding a supplementary word, assumed to be "word start", and a supplementary part of speech, assumed to be "POS start", to its left as usual, the corresponding word is characterized by w-2=“word_start”、w-1=“word_start”、w0"shanghai", w1Pudong, w2When developed, the part of speech is characterized by p-2=“POS_start”、p-1=“POS_start”、p0=“NR”、p1=“NR”、p2=“NN”、p-2p-1=“POS_startPOS_start”、p-1p0=“POS_start NR”、p0p1=“NR NR”、p1p2The vector representation corresponding to these features is then taken, in this embodiment, the dimensions of these feature vectors are all set to 50, and they are all real-valued vectors, e.g., e (w NN)0) The values of the first 5 elements are-0.0999, 0.0599, 0.0669, -0.0786 and 0.0527;
and 3-1-1-2, generating additional information characteristics. The following steps are carried out:
and 3-1-1-2-1, generating related words and part-of-speech feature vectors of the chunks. Since this word is not preceded by yet analyzed chunks, also represented by the complementary word, respectively start word (c)-2)=“start_chunk_word_NULL”、end_word(c-2)=“end_chunk_word_NULL”、head_word(c-2)=“head_chunk_word_NULL”、start_word(c-1)=“start_chunk_word_NULL”、end_word(c-1)=“end_chunk_word_NULL”、head_word(c-1)=“head_chunk_word_NULL”、start_POS(c-2)=“start_chunk_POS_NULL”、end_POS(c-2)=“end_chunk_POS_NULL”、head_POS(c-2)=“head_chunk_POS_NULL”、start_POS(c-1)=“start_chunk_POS_NULL”、end_POS(c-1)=“end_chunk_POS_NULL”、head_POS(c-1) The vector representation corresponding to these features is then taken, and in this embodiment, the dimensions of these feature vectors are all set to 50, and they are all real-valued vectors;
and 3-1-1-2-2, as shown in fig. 1 and 2, generating a feature vector of the word and part of speech information features of the current position to be labeled, which is calculated by using a two-way long and short memory neural network model. For word feature vectors, the input is a vector representation for each word in the sentence, and for part-of-speech feature vectors, the input is a vector representation for each part-of-speech in the sentence, which vector representations are consistent with the corresponding vector representations of the same word or part-of-speech in step 3-1-1-1, e.g., e (w)0)(w0Shanghai-") still has values of-0.0999, 0.0599, 0.0669, -0.0786, 0.0527; for parameters in the long and short memory models, the values are all real values, such as a matrix W for calculating forward word vectorsfhThe first 5 parameter values in the first row in (A) are 0.13637, 0.11527, -0.06217, -0.19870, 0.03157; then calculating to obtain a feature vector h corresponding to each word and part of speechfAnd hbThey are all real-valued vectors, h being set in this embodimentfAnd hbAll dimensions of (2) are 25.
Step 3-1-2, concatenating all vectors obtained in step 3-1-1 to obtain a real value vector, which is 14 × 50+12 × 50+4 × 25 ═ 1400 dimensions in this example, and then obtaining scores of all 27 labeled types, which are 0.7898 (B-adpp), 0.4961(ADVP), -0.1281(B-CLP), -0.0817(B-DNP),0.5265(B-DP), -0.0789(B-DVP),0.4362(B-LCP), -0.2250(B-LST),2.9887(B-NP), -0.0726(B-PP),0.1320(B-QP),0.4636(B-VP),1.6294(E),1.8871(I), -0.3904(O),0.6985 (S-adpp), -0.1703 (S-adpp), -0.3287(S-CLP),0.1734(S-DNP),0.5694(S-DP),0.0990(S-DVP),0.0902(S-LCP), -1.0364(S-LST),2.0767(S-NP), -0.0179(S-PP), -0.0606(S-QP),0.0941 (S-VP);
step 3-2-1, the current given state set is S1Wherein only one state isAnd is provided withRemoving the label types I and E (score (I) 1.8871 and score (E) 1.6294) obtained in the step 3-1-2 according to the constraint rule 1 in the step 1-2 in the specification, and converting the states into the statesExpanding according to each remaining label type and calculating the score of the corresponding target stateBecause of the fact thatSo that there areFor example, have
Step 3-2-2, converting the state S1Each state in (a) is extended in the manner in step 3-2-1. Because among them onlySo 27-2-25 new states are obtained;
and 3-2-3, selecting 4 states with the highest scores from the 25 new states to form a new state set. The 4 highest scoring new states are sequentially From which a new set of states S is composed2It contains four new states, respectively:
1.a score 2.9887 indicating "Shanghai/NR _ B-NP Pudong/NR development/NN and/CC method/NN construction/NN synchronization/VV";
2.a score 2.0767 indicating "Shanghai/NR _ S-NP Pudong/NR development/NN and/CC method/NN construction/NN synchronization/VV";
3.a score of 0.7898, which indicates "Shanghai/NR _ S-ADJP Pudong/NR development/NN and/CC method/NN construction/NN synchronization/VV";
4.this indicates "Shanghai/NR _ B-QP Pudong/NR development/NN and/CC method/NN construction/NN synchronization/VV", score 0.6985.
Step 3-3, processing the remaining words according to the mode in the steps 3-1 and 3-2 to obtain a final target state set S8It contains four states, respectively:
1.indicating "Shanghai/NR _ B-NP Pudong/NR _ E development/NN _ B-NP and/CC _ I method/NN _ I construction/NN _ E synchronization/VV _ S-VP", score 24.6169;
2.indicating "Shanghai/NR _ B-NP Pudong/NR _ E development/NN _ B-NP and/CC _ I method/NN _ E construction/NN _ S-VP synchronization/VV _ S-VP", score 20.2407;
3.indicating "Shanghai/NR _ B-NP Pudong/NR _ E development/NN _ B-NP and/CC _ I method/NN _ I construction/NN _ E synchronization/VV _ B-VP", score 19.7653;
4.represents "Shanghai/NR _ B-NP Pudong/NR _ E development/NN _ B-NP and/CC _ I method/NN _ I construction/NN _ E synchronization/VV _ O", score 19.6299.
Taking out the state with the highest scoreBacktracking to obtain the wholeThe labeled sequence of the sentence is:
the analysis result of the corresponding blocks is [ NP Shanghai Pudong ] [ NP development and legal construction ] [ VP synchronization ].
Example 2
The algorithm used by the invention is written and realized by C + + language. The model used in the experiment of this embodiment is: intel (R) core (TM) i7-5930K processor with 3.50GHz main frequency and 64G memory. First, the model parameters in this embodiment are trained on 9978 sentences of 728 files (file numbers from chtb _001.fid to chtb _899.ptb, note that the numbers are not continuous, and therefore only 110 files) in version ctb (the chinese Penn treebank)4.0 in the bingzhou tree library, in the manner described in the attached description of the model parameter training method in the specification. The experimental test used data for chunking analysis using 5290 sentences of 110 files (file numbers from chtb _900.fid to chtb _1078.ptb, note that the numbers are not consecutive and therefore only 110 files) and the experimental results are shown in Table 7:
table 7 Experimental results show
Among them, MBL (Memory-based learning) is a learning method based on Memory, TBL (Transformation-based learning) is a learning method based on conversion, crf (conditional random field) is a conditional random field learning method, and svm (support Vector machine) is a support Vector machine learning method, which are four conventional common machine learning algorithms for processing the task. It should be noted that evaluating on the data set is a common way to evaluate the chinese chunk analysis method. It can be seen that the method of the present invention achieved a higher F1-score value on the data set, illustrating the effectiveness of the method.
The calculation of F1-score is described here: since the test set is a labeled data set, it is known that the labeling result is correct, and it is assumed that for the entire data set, the set s (gold) is composed of all chunks, and its size is count (gold); after each sentence in the data set is subjected to chunk analysis in the manner of embodiment 1, all chunks in the analysis results are taken out to form a prediction result set s (prediction), and the size of the prediction result set s (prediction) is assumed to be count (prediction); the set composed of the same chunks in S (gold) and S (predict) is S (correct), and the size thereof is count (correct); assuming that the prediction accuracy is expressed as precision and the prediction recall is expressed as recall, the calculation of the values is performed as follows:
p r e c i s i o n = c o u n t ( c o r r e c t ) c o u n t ( p r e d i c t ) ,
r e c a l l = c o u n t ( c o r r e c t ) c o u n t ( g o l d ) ,
F 1 - s c o r e = 2 &times; p r e c i s i o n &times; r e c a l l p r e c i s i o n + r e c a l l .

Claims (9)

1. A Chinese chunk analysis method based on state transition and a neural network is characterized by comprising the following steps:
step 1, a computer reads a Chinese text file containing a sentence to be analyzed, defines the type of a Chinese chunk, divides the sentence to be analyzed, labels the part of speech of each word, and determines the part of speech label type which can be selected according to the current sentence state when labeling the part of speech;
and 2, performing Chinese chunk analysis on the sentence to be analyzed by using a state transition and neural network-based method.
2. The method of claim 1, wherein step 1 comprises the steps of:
step 1-1, defining Chinese chunk types according to 12 phrase types defined in table 1;
TABLE 1
Type (B) Means of ADJP Adjective phrases ADVP Adverb phrase CLP Category-type phrases DNP Multiple limiting phrases DP Qualifier phrase DVP Ground word phrase LCP Phrase of orientation LST Sequence phrases NP Noun phrases PP Preposition phrase QP Quantitative word phrase VP Verb phrases
And step 1-2, determining the tagging type which can be selected when the part of speech tagging is carried out on each word to be tagged in the tagging process by adopting a mode of combining a BIOES tagging system with the Chinese chunk type defined in the step 1-1.
3. The method of claim 2 wherein in step 2, the chinese chunk analysis process is used as a serialized annotation task, and the type of annotation is generated by combining the chinese chunk type defined in step 1-1 with the biees annotation hierarchy used in step 1-2.
4. A method according to claim 3, characterized in that the length of the sentence to be analyzed is denoted by n throughout step 2, step 2 comprising the steps of:
step 2-1, in a given state, scoring all the label types when processing the tth word, wherein the given state is that the t-1 words in the front of the sentence to be analyzed are labeled and the corresponding label types are known, the tth to nth words are unlabeled words and the tth word is the next word to be processed;
step 2-2, a set of states S is giventWhen the tth word is processed, scoring all the label types of each state in the state set in the mode of step 2-1, wherein the scoring is carried outAfter the calculation is finished, each label type is endowed with a real numerical value which is called a score corresponding to the type, then candidate label types are generated according to the mode of the step 1-2, words are labeled according to each candidate label type so as to expand the state, m new states with the highest score are selected according to a column search mode, and a new state set S is obtainedt+1
Step 2-3, for t being 1,2, …, n, iteratively executing steps 2-1 and 2-2 to obtain a final target state set Sn+1And extracting the state with the highest scoreAnd backtracking from the state to obtain a labeling sequence with the highest score, completing the type labeling of all the words at the moment, and restoring the labeling sequence with the highest score into a corresponding chunk analysis result, wherein the result is the analysis result of the current sentence.
5. The method of claim 4, wherein step 2-1 comprises the steps of:
step 2-1-1, generating a feature vector, wherein the feature vector comprises a basic information feature vector and an additional information feature vector;
and 2-1-2, calculating the feature vectors generated in the step 2-1-1 by using a forward neural network to obtain scores of all candidate labeling types.
6. The method according to claim 5, characterized in that all words in the sentence to be analyzed are represented as w from left to right in turn in the whole step 2-1-11,w2,…,wn,wnRepresenting the nth word in the sentence to be analyzed, wherein the value of n is a natural number; parts of speech corresponding to all words in the sentence to be analyzed are sequentially represented as p from left to right1,p2,…,pn,pnRepresenting the part of speech corresponding to the nth word in the sentence to be analyzed; a feature vector corresponding to a feature is denoted as e (, and step 2-1-1 comprises the steps of:
step 2-1-1-1,generating a basic information characteristic vector, wherein the basic information characteristic vector comprises a word in a certain window with the position of the current word to be labeled as a reference, a characteristic vector corresponding to part-of-speech characteristics, and a characteristic vector corresponding to the category characteristics of the labeled word in the certain window with the position of the current word to be labeled as the reference; the specific process is as follows: the word feature vector in the basic information features comprises: characteristic vector e (w) corresponding to the second word counted leftwards by taking the current word to be processed as the center-2) And a feature vector e (w) corresponding to the first word counted to the left by taking the current word to be processed as the center-1) And the feature vector e (w) corresponding to the current word to be processed0) And counting the characteristic vector e (w) corresponding to the first word to the right by taking the current word to be processed as the center1) And a feature vector e (w) corresponding to the second word counted right with the current word to be processed as the center2);
The part-of-speech feature vector includes: characteristic vector e (p) corresponding to part of speech of second word counted to left by taking current word to be processed as center-2) And a characteristic vector e (p) corresponding to the part of speech of the first word counted to the left by taking the current word to be processed as the center-1) Characteristic vector e (p) corresponding to part of speech of current word to be processed0) And a characteristic vector e (p) corresponding to part of speech of the first word counted rightwards by taking the current word to be processed as the center1) And a characteristic vector e (p) corresponding to the part of speech of the second word counted rightwards by taking the current word to be processed as the center2) And a feature vector e (p) corresponding to the part-of-speech combination of the first word and the second word counted to the left by taking the current word to be processed as the center-2p-1) And a feature vector e (p) corresponding to the part-of-speech combination of the first word counted leftwards by taking the current word to be processed as the center and the current word to be processed-1p0) And a feature vector e (p) corresponding to the part-of-speech combination of the first word counted rightwards by taking the current word to be processed as the center and the current word to be processed0p1) Counting the feature vector e (p) corresponding to the part-of-speech combination of the second word and the first word to the right by taking the current word to be processed as the center1p2);
Step 2-1-1-2, generating an additional information feature vector: the additional information characteristic vector comprises a word characteristic vector and a part-of-speech characteristic vector which are related to the marked chunks in a certain window by taking the position of the current word to be marked as a reference, and the word characteristic vector and the part-of-speech characteristic vector of the current position to be marked, which are calculated by using a bidirectional long and short memory neural network model.
7. The method of claim 6, wherein step 2-1-1-2 comprises the steps of:
step 2-1-1-2-1, counting the second chunk to the left with the current word to be processed as the center, and respectively representing the first chunk as c-2、c-1Block ciIs denoted as start word (c)i) The last word is denoted end word (c)i) I-2, -1, the grammatical core word denoted head _ word (c)i) Block ciThe part of speech of the first word of (a) is denoted as start _ POS (c)i) The part of speech of the last word is denoted end _ POS (c)i) Part of speech of the grammatical core word is denoted as head _ POS (c)i) Generating word characteristic vectors and part-of-speech characteristic vectors which are related to the marked chunks in a certain window by taking the position of the current word to be marked as a reference:
the chunk-level word feature vector includes: the feature vector e (start word (c)) of the first word of the second chunk counting to the left with the current word to be processed as the center-2) End word (c) of the last word of the second chunk to the left centered on the current word to be processed-2) A feature vector e (head _ word (c)) of a grammatical core word of a second chunk counting to the left with the current word to be processed as the center-2) Starting word (c), the feature vector e of the first word of the first chunk counting to the left with the current word to be processed as the center (start word (c))-1) End word (c) of the last word of the first chunk counted to the left centering on the current word to be processed-1) A feature vector e (head _ word (c)) of a grammatical core word of the first chunk counting to the left with the current word to be processed as the center-1));
The part-of-speech feature vectors at the chunk level include: feature vector e of part of speech of first word of second chunk counting to left with current word to be processed as center (start _ POS (c)-2) Left centering on the current word to be processed)Feature vector e (end _ POS (c) of part of speech of last word of second chunk-2) Characteristic vector e (head _ POS (c)) of part of speech of grammatical core word of second chunk counting leftwards with current word to be processed as center-2) Characteristic vector e (start _ POS (c)) of part of speech of the first word of the first chunk counting left with the current word to be processed as the center (c)-1) And a part-of-speech feature vector e (end _ POS) (c) of the last word of the first chunk counted to the left with the current word to be processed as the center-1) Characteristic vector e (head _ POS (c)) of part of speech of grammatical core word of the first chunk counting to the left with current word to be processed as center-1));
Step 2-1-1-2-2, calculating and generating word and part of speech information feature vectors of the current position to be labeled by using a bidirectional long and short memory neural network model: the input of the two-way long and short memory neural network model is all words in a sentence to be analyzed and parts of speech corresponding to all words in the sentence to be analyzed, the output is a forward word characteristic vector, a forward part of speech characteristic vector, a backward word characteristic vector and a backward part of speech characteristic vector, tanh used in the following formula is a hyperbolic function and is a real-valued function, and the function of the tanh on one vector represents that the operation is carried out on each element in the vector, so that a target vector with the same dimension as the input vector is obtained; sigma is a sigmod function, which is a real-valued function, and acts on a vector to represent that each element in the vector is subjected to the operation, so that a target vector with the same dimension as the input vector is obtained; as a dot product operation, that is, two vectors with the same dimension are multiplied bit by bit to obtain a result vector with the same dimension, the four eigenvectors are calculated as follows:
the forward word feature vector is sequentially represented as hf(w1),hf(w2),…,hf(wn),hf(wt) The t-th forward word feature vector is represented, and the calculation mode is carried out according to the following formula:
f t w f = &sigma; ( W f h w f h f ( w t - 1 ) + W f x w f e ( w t ) + W f c w f c t - 1 w f + b f w f ) ,
i t w f = &sigma; ( W i h w f h f ( w t - 1 ) + W i x w f e ( w t ) + W i c w f c t - 1 w f + b i w f ) ,
o t w f = &sigma; ( W o h w f h f ( w t - 1 ) + W o x w f e ( w t ) + W o c w f c t w f + b o w f ) ,
wherein, the method is characterized in that the method is a trained model parameter matrix, the value of each element in the matrix is a real value, the group of parameters is irrelevant to t, namely all calculation units in a calculation sequence share the same group of parameters;
the intermediate calculation results in the t-th calculation unit are all real value vectors;
e(wt)、hf(wt-1)、is the input of the t-th computing unit, which is a real-valued vector, e (w) of whicht) I.e. the word wtA corresponding feature vector; h isf(wt)、Is the output of the t-th computing unit,only h is finally used as the characteristic vector of the forward word for the auxiliary calculation result of the long and short memory neural network modelf(wt-1) Since this is a serialized computational model, the output h of the t-1 st computational unitf(wt-1)、The input is the input of the t calculating unit;
the forward part-of-speech feature vector is sequentially represented as hf(p2),…,hf(pn),hf(pt) The t forward part-of-speech feature vector is represented, and the calculation mode is carried out according to the following formula:
f t p f = &sigma; ( W f h p f h f ( p t - 1 ) + W f x p f e ( p t ) + W f c p f c t - 1 p f + b f p f ) ,
i t p f = &sigma; ( W i h p f h f ( p t - 1 ) + W i x p f e ( p t ) + W i c p f c t - 1 p f + b i p f ) ,
o t p f = &sigma; ( W o h p f h f ( p t - 1 ) + W o x p f e ( p t ) + W o c p f c t p f + b o p f ) ,
wherein, the method is characterized in that the method is a trained model parameter matrix, the value of each element in the matrix is a real value, the group of parameters is irrelevant to t, namely all calculation units in a calculation sequence share the same group of parameters;
the intermediate calculation results in the t-th calculation unit are all real value vectors;
e(pt)、hf(pt-1)、is the input of the t-th computing unit, which is a real-valued vector, e (p) of whicht) I.e. part of speech ptA corresponding feature vector; h isf(pt)、Is the output of the t-th computing unit,only h is finally used as the characteristic vector of the forward word for the auxiliary calculation result of the long and short memory neural network modelf(pt-1) Since this is a serialized computational model, the output h of the t-1 st computational unitf(pt-1)、The input is the input of the t calculating unit;
the backward word feature vector is sequentially represented as hb(w1),hb(w2),…,hb(wn),hb(wt) The characteristic vector of the t-th backward word is represented, and the calculation mode is carried out according to the following formula:
f t w b = &sigma; ( W f h w b h b ( w t + 1 ) + W f x w b e ( w t ) + W f c w b c t + 1 w b + b f w b ) ,
i t w b = &sigma; ( W i h w b h b ( w t + 1 ) + W i x w b e ( w t ) + W i c w b c t + 1 w b + b i w b ) ,
o t w b = &sigma; ( W o h w b h b ( w t + 1 ) + W o x w b e ( w t ) + W o c w b c t w b + b o w b ) ,
wherein, the method is characterized in that the method is a trained model parameter matrix, the value of each element in the matrix is a real value, the group of parameters is irrelevant to t, namely all calculation units in a calculation sequence share the same group of parameters;
the intermediate calculation results in the t-th calculation unit are all real value vectors; e (w)t)、hb(t+1)、Is the input of the t-th computing unit, which is a real-valued vector, e (w) of whicht) I.e. the word wtA corresponding feature vector; h isb(wt)、Is the output of the t-th computing unit,only h is finally used as the characteristic vector of the forward word for the auxiliary calculation result of the long and short memory neural network modelb(wt-1) Since this is a serialized computational model, the output h of the t +1 th computational unitb(wt-1)、The input is the input of the t calculating unit;
the backward part-of-speech feature vector is sequentially represented as hb(p1),hb(p2),…,hb(pn),hb(pt) And (3) expressing the t-th backward part-of-speech feature vector, wherein the calculation mode is carried out according to the following formula:
f t p b = &sigma; ( W f h p b h b ( p t + 1 ) + W f x p b e ( p t ) + W f c p b c t + 1 p b + b f p b ) ,
i t p b = &sigma; ( W i h p b h b ( p t + 1 ) + W i x p b e ( p t ) + W i c p b c t + 1 p b + b i p b ) ,
o t p b = &sigma; ( W o h p b h b ( p t + 1 ) + W o x p b e ( p t ) + W o c p b c t p b + b o p b ) ,
wherein, the method is characterized in that the method is a trained model parameter matrix, the value of each element in the matrix is a real value, the group of parameters is irrelevant to t, namely all calculation units in a calculation sequence share the same group of parameters;
is an intermediate calculation junction in the t-th calculation unitFruits, all are real value vectors;
e(pt)、hb(pt+1)、is the input of the t-th computing unit, which is a real-valued vector, e (p) of whicht) I.e. part of speech ptA corresponding feature vector; h isb(pt)、Is the output of the t-th computing unit,only h is finally used as the characteristic vector of the forward word for the auxiliary calculation result of the long and short memory neural network modelb(pt+1) Since this is a serialized computational model, the output h of the t +1 th computational unitb(pt+1)、I.e. the input of the t-th calculation unit.
8. The method of claim 7, wherein the forward neural network is used in step 2-1-2 to calculate the scores of all the labeled types, and the calculation process of the whole forward neural network is performed according to the following formula:
h=σ(W1x+b1),
o=W2h,
wherein, W1、b1、W2The method comprises the following steps of (1) obtaining a trained model parameter matrix, wherein the value of each element in the matrix is a real value; x is an input vector which is formed by splicing all the characteristic vectors obtained in the step 2-1-1, the dimensionality of the input vector is the sum of the dimensionalities of all the characteristic vectors generated in the step-1-1, and the value of each element is a real numerical value; h is a hidden layer vector of the neural network and is an intermediate calculation result unit; o is the output of the calculation, is a realA value vector, the dimension of which corresponds to the number of labeling types that can be selected when labeling each word in the labeling process defined in step 1-2, wherein the g-th value represents a score for labeling the current step as type g, and the score is a real numerical value; w1x、W2h are all matrix multiplication operations.
9. The method of claim 8, wherein step 2-2 comprises the steps of:
step 2-2-1, each state in the previous state set is given, all the labeled types are scored according to the mode in the step 2-1, and a state S is assumedxScore of (S) is score (S)x) Type of labelkScore of score (type)k) If all the label types are expanded, K new states are obtained after the expansion, and are expressed asK is the total number of all the labeled types, and the corresponding score of the kth state is calculated according to the following formula
s c o r e ( S i k t + 1 ) = s c o r e ( S i t ) + s c o r e ( type k ) ,
Wherein K is 1-K, scores are all real numerical values, candidate marking types are determined according to the mode in the step 1-2, states are expanded according to the candidate marking types, and a state set S is assumedtIf there are c (i) candidate label types determined by the state in step 1-2, c (i) new states are obtained after the state is expanded and are expressed as
Step 2-2-2, assume state set StHaving z states, where z is a natural number, assembling the states into a set StWherein all the states are expanded in the mode of the step 2-2-1, and all the expanded states are
Step 2-2-3, extracting m states with highest scores from all the expanded states obtained in the step 2-2-2 in a column search mode to form a new state set
CN201610324281.5A 2016-05-16 2016-05-16 A kind of Chinese Chunk analysis method based on state transfer and neural network Active CN106021227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610324281.5A CN106021227B (en) 2016-05-16 2016-05-16 A kind of Chinese Chunk analysis method based on state transfer and neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610324281.5A CN106021227B (en) 2016-05-16 2016-05-16 A kind of Chinese Chunk analysis method based on state transfer and neural network

Publications (2)

Publication Number Publication Date
CN106021227A true CN106021227A (en) 2016-10-12
CN106021227B CN106021227B (en) 2018-08-21

Family

ID=57097925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610324281.5A Active CN106021227B (en) 2016-05-16 2016-05-16 A kind of Chinese Chunk analysis method based on state transfer and neural network

Country Status (1)

Country Link
CN (1) CN106021227B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547737A (en) * 2016-10-25 2017-03-29 复旦大学 Based on the sequence labelling method in the natural language processing of deep learning
CN106776869A (en) * 2016-11-28 2017-05-31 北京百度网讯科技有限公司 Chess game optimization method, device and search engine based on neutral net
CN107168955A (en) * 2017-05-23 2017-09-15 南京大学 Word insertion and the Chinese word cutting method of neutral net using word-based context
CN107247700A (en) * 2017-04-27 2017-10-13 北京捷通华声科技股份有限公司 A kind of method and device for adding text marking
CN107632981A (en) * 2017-09-06 2018-01-26 沈阳雅译网络技术有限公司 A kind of neural machine translation method of introducing source language chunk information coding
CN107992479A (en) * 2017-12-25 2018-05-04 北京牡丹电子集团有限责任公司数字电视技术中心 Word rank Chinese Text Chunking method based on transfer method
WO2018086519A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for identifying specific text information
CN108363695A (en) * 2018-02-23 2018-08-03 西南交通大学 A kind of user comment attribute extraction method based on bidirectional dependency syntax tree characterization
CN108446355A (en) * 2018-03-12 2018-08-24 深圳证券信息有限公司 Investment and financing event argument abstracting method, device and equipment
CN109086274A (en) * 2018-08-23 2018-12-25 电子科技大学 English social media short text time expression recognition method based on restricted model
CN109923557A (en) * 2016-11-03 2019-06-21 易享信息技术有限公司 Use continuous regularization training joint multitask neural network model
CN112052646A (en) * 2020-08-27 2020-12-08 安徽聚戎科技信息咨询有限公司 Text data labeling method
CN112651241A (en) * 2021-01-08 2021-04-13 昆明理工大学 Chinese parallel structure automatic identification method based on semi-supervised learning
CN116227497A (en) * 2022-11-29 2023-06-06 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546623A (en) * 2012-07-12 2014-01-29 百度在线网络技术(北京)有限公司 Method, device and equipment for sending voice information and text description information thereof

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546623A (en) * 2012-07-12 2014-01-29 百度在线网络技术(北京)有限公司 Method, device and equipment for sending voice information and text description information thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHRIS ALBERTI ET AL: ""Improved Transition-Based Parsing and Tagging with Neural Networks"", 《PROCEEDINGS OF THE 2015 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING》 *
DAVIDWEISS ET AL: ""Structured Training for Neural Network Transition-Based Parsing"", 《PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING》 *
HAO ZHOU ET AL: ""A Neural Probabilistic Structured-Prediction Model for Transition-Based Dependency Parsing"", 《PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING 》 *
YING LIU ET AL: ""Improving Chinese text Chunking"s precision using Transformnation-based Learning"", 《2005 YOUTH PROJECT OF ASIA RESEARCH CENTER》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547737A (en) * 2016-10-25 2017-03-29 复旦大学 Based on the sequence labelling method in the natural language processing of deep learning
CN106547737B (en) * 2016-10-25 2020-05-12 复旦大学 Sequence labeling method in natural language processing based on deep learning
CN109923557A (en) * 2016-11-03 2019-06-21 易享信息技术有限公司 Use continuous regularization training joint multitask neural network model
CN109923557B (en) * 2016-11-03 2024-03-19 硕动力公司 Training joint multitasking neural network model using continuous regularization
US11797825B2 (en) 2016-11-03 2023-10-24 Salesforce, Inc. Training a joint many-task neural network model using successive regularization
US11783164B2 (en) 2016-11-03 2023-10-10 Salesforce.Com, Inc. Joint many-task neural network model for multiple natural language processing (NLP) tasks
US11010554B2 (en) 2016-11-08 2021-05-18 Beijing Gridsum Technology Co., Ltd. Method and device for identifying specific text information
WO2018086519A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for identifying specific text information
CN106776869A (en) * 2016-11-28 2017-05-31 北京百度网讯科技有限公司 Chess game optimization method, device and search engine based on neutral net
CN106776869B (en) * 2016-11-28 2020-04-07 北京百度网讯科技有限公司 Search optimization method and device based on neural network and search engine
CN107247700A (en) * 2017-04-27 2017-10-13 北京捷通华声科技股份有限公司 A kind of method and device for adding text marking
CN107168955A (en) * 2017-05-23 2017-09-15 南京大学 Word insertion and the Chinese word cutting method of neutral net using word-based context
CN107168955B (en) * 2017-05-23 2019-06-04 南京大学 Utilize the Chinese word cutting method of the word insertion and neural network of word-based context
CN107632981B (en) * 2017-09-06 2020-11-03 沈阳雅译网络技术有限公司 Neural machine translation method introducing source language chunk information coding
CN107632981A (en) * 2017-09-06 2018-01-26 沈阳雅译网络技术有限公司 A kind of neural machine translation method of introducing source language chunk information coding
CN107992479A (en) * 2017-12-25 2018-05-04 北京牡丹电子集团有限责任公司数字电视技术中心 Word rank Chinese Text Chunking method based on transfer method
CN108363695A (en) * 2018-02-23 2018-08-03 西南交通大学 A kind of user comment attribute extraction method based on bidirectional dependency syntax tree characterization
CN108363695B (en) * 2018-02-23 2020-04-24 西南交通大学 User comment attribute extraction method based on bidirectional dependency syntax tree representation
CN108446355B (en) * 2018-03-12 2022-05-20 深圳证券信息有限公司 Investment and financing event element extraction method, device and equipment
CN108446355A (en) * 2018-03-12 2018-08-24 深圳证券信息有限公司 Investment and financing event argument abstracting method, device and equipment
CN109086274A (en) * 2018-08-23 2018-12-25 电子科技大学 English social media short text time expression recognition method based on restricted model
CN112052646A (en) * 2020-08-27 2020-12-08 安徽聚戎科技信息咨询有限公司 Text data labeling method
CN112052646B (en) * 2020-08-27 2024-03-29 安徽聚戎科技信息咨询有限公司 Text data labeling method
CN112651241A (en) * 2021-01-08 2021-04-13 昆明理工大学 Chinese parallel structure automatic identification method based on semi-supervised learning
CN116227497A (en) * 2022-11-29 2023-06-06 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network
CN116227497B (en) * 2022-11-29 2023-09-26 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network

Also Published As

Publication number Publication date
CN106021227B (en) 2018-08-21

Similar Documents

Publication Publication Date Title
CN106021227B (en) A kind of Chinese Chunk analysis method based on state transfer and neural network
Gupta et al. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi
CN106502994B (en) method and device for extracting keywords of text
Song et al. Named entity recognition based on conditional random fields
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN114911892A (en) Interaction layer neural network for search, retrieval and ranking
CN109086269B (en) Semantic bilingual recognition method based on semantic resource word representation and collocation relationship
CN111274829A (en) Sequence labeling method using cross-language information
CN113220864B (en) Intelligent question-answering data processing system
CN110717341A (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
de Sousa Neto et al. Htr-flor++ a handwritten text recognition system based on a pipeline of optical and language models
Li et al. DUTIR at the CCKS-2019 Task1: Improving Chinese clinical named entity recognition using stroke ELMo and transfer learning
CN114254645A (en) Artificial intelligence auxiliary writing system
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
Lo et al. Cool English: A grammatical error correction system based on large learner corpora
Yazar et al. Low-resource neural machine translation: A systematic literature review
Whittaker et al. TREC 2005 Question Answering Experiments at Tokyo Institute of Technology.
Abdolahi et al. Sentence matrix normalization using most likely n-grams vector
Tolegen et al. Voted-perceptron approach for Kazakh morphological disambiguation
Srinivasagan et al. An automated system for tamil named entity recognition using hybrid approach
CN114510569A (en) Chemical emergency news classification method based on Chinesebert model and attention mechanism
Mazitov et al. Named entity recognition in Russian using Multi-Task LSTM-CRF
Sarkar et al. Bengali noun phrase chunking based on conditional random fields
Seo et al. Performance Comparison of Passage Retrieval Models according to Korean Language Tokenization Methods
Shams et al. Lexical intent recognition in urdu queries using deep neural networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant