CN103678318B - Multi-word unit extraction method and equipment and artificial neural network training method and equipment - Google Patents
Multi-word unit extraction method and equipment and artificial neural network training method and equipment Download PDFInfo
- Publication number
- CN103678318B CN103678318B CN201210320806.XA CN201210320806A CN103678318B CN 103678318 B CN103678318 B CN 103678318B CN 201210320806 A CN201210320806 A CN 201210320806A CN 103678318 B CN103678318 B CN 103678318B
- Authority
- CN
- China
- Prior art keywords
- participle
- word unit
- speech
- block
- multi word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
This application discloses a kind of multi-word unit extraction method and equipment and artificial neural network training method and equipment.The method extracting multi word unit includes: for each participle block carried out by statement in multiple participle blocks that participle obtains, and obtains the linguistic feature of participle in each participle block as characteristic quantity;Characteristic quantity is input in artificial neural network as parameter;Use the second probability of the part that the participle in artificial neural networks each participle block is first probability of a part of multi word unit and this participle is not multi word unit, and judge that whether this participle is a part for multi word unit according to the first and second probabilities;Extract adjacent two or more and be judged as the participle of a part of multi word unit to form multi word unit;And obtain the result of judgement of the previous participle block adjacent with current participle block as feedback information, and feedback information is also served as the characteristic quantity of participle in current participle block.
Description
Technical field
Present invention relates generally to the field of natural language processing, the method particularly relating to extract the multi word unit in statement
The method and apparatus being used for extracting the artificial neural network of the multi word unit in statement with equipment and training.
Background technology
Classical natural language processing system usually assumes that each word is a semantic primitive, but this does not comprise many
The situation of word unit.Multi word unit crosses over the border of word, thus multi word unit has special deciphering method.Identify and extract many
Word unit is the principal concern of multi word unit process field, and is also considered as the bottleneck studied further.Multi word unit
It is commonplace in natural language processing and there is no a concept of explication.Typically, multi word unit refer to two or
Plural word unit simultaneously appears in the word combination that probability together is of a relatively high, and this word combination has complete language
Justice.Multi word unit is fairly common phenomenon in natural language processing field, and therefore identification and the extraction of multi word unit are the heaviest
Want.Owing to not having the word collocation knowledge of abundance, and word combination information dispersion is among each participle, therefore by separate root
Reconfigure to become independent semantic primitive according to original meaning, thus it is extremely difficult for obtaining original complete semanteme, especially locates
Manage the language as not having segmentation between Chinese this word.
The identification of multi word unit and extraction can be widely applied to machine translation, efficient syntactic analysis, optimization information retrieval and
The aspects such as word sense disambiguation.The method being widely used in identification and extraction multi word unit at present has sort method, local maximum side
Method (Local Maxima) and condition random field method (Conditional Random Fields) etc..Identifying and extracting many
The eigenvalue used during word unit includes mutual information between participle, t mark, entropy and co-occurrence frequency etc..It addition, identify and extract many words
Unit further relates to participle instrument, morphology annotation tool, part-of-speech tagging instrument and stops the use of vocabulary etc..
The method of identification of the prior art and extraction multi word unit substantially uses following process: carry out object statement
Participle and/or part-of-speech tagging;According to analyzing and/or the result corresponding eigenvalue of calculating of part-of-speech tagging, such as frequency, participle is common
Now rate and mutual information etc.;And use special algorithm or model that candidate's multi word unit is sieved according to the eigenvalue calculated
Choosing, thus obtain multi word unit more accurately.But, method of the prior art cannot ensure object statement is carried out participle
And/or the accuracy of part-of-speech tagging, thus often introduce error message, cause the information during training inherently to comprise mutually
The data of contradiction, or cause the eigenvalue in actual application itself and practical situation to have deviation.
Multi word unit is the concept different from phrase or word block, and therefore identification and the extracting method of multi word unit is different from short
Language or the identification of word block and extracting method.Specifically, some prepositional phrase in phrase does not have complete semanteme, therefore profit
Identification and extracting method with phrase identify and extract multi word unit and can not obtain good effect.It addition, word block is fixed
Justice, in syntax aspect, therefore needs to consider syntactic information and the part-of-speech information of composition word block when identifying and extract word block,
Strict requirements are not had for semantic integrity, so the identification of word block and extracting method to be applied to the knowledge of multi word unit
Other and extraction is also infeasible.
Accordingly, it is desired to provide the method and apparatus of a kind of multi word unit extracted in statement, it can improve multi word unit
Identification and the accuracy and efficiency of extraction.
Summary of the invention
Hereinafter will be given for the brief overview of the present invention, in order to provide basic about certain aspects of the invention
Understand.Should be appreciated that this general introduction is not that the exhaustive about the present invention is summarized.It is not intended to determine the pass of the present invention
Key or pith, nor is it intended to limit the scope of the present invention.Its purpose is only to provide some concept in simplified form,
In this, as the preamble in greater detail discussed after a while.
Artificial neural network is applied to identification and the extraction of multi word unit by the present invention.Artificial neural network is a kind of simulation
Animal nerve network behavior feature carries out the algorithm model of distributed parallel information processing.Artificial neural network relies on system
Complexity, by adjusting the interconnected relationship between internal great deal of nodes, reaches the purpose of process information.ANN
Network include substantial amounts of node and between be connected with each other.Each node in artificial neural network represents a kind of specific output
Function, connecting between node represents the weighted value corresponding to this connection, and referred to as weight, it is equivalent to artificial neural network
Memory.The output of artificial neural network is according to the difference of connected mode, weighted value and the output function of artificial neural network not
With.
According to embodiments of the invention, it is provided that a kind of method of multi word unit extracted in statement, including: for by language
Sentence carries out each participle block in multiple participle blocks that participle obtains, and obtains one or more languages of participle in each participle block
Speech learns feature as characteristic quantity;Characteristic quantity is input in artificial neural network as the parameter of artificial neural network;Use people
Artificial neural networks calculates the first probability of a part that the participle in each participle block is multi word unit and this participle is not many
Second probability of a part for word unit, and judge whether this participle is many according to the first probability and the second probability
A part for word unit;And extract the participle that adjacent two or more are judged as a part for multi word unit, with shape
Becoming multi word unit, wherein, the method also includes: obtain the result conduct of the judgement of the previous participle block adjacent with current participle block
Feedback information, and feedback information is also served as the characteristic quantity of participle in current participle block.
According to the method for the multi word unit in said extracted statement, also include: successively by N number of participle group adjacent in statement
Being combined into N tuple to form participle block, wherein N is the natural number more than or equal to 2.
According to the method for the multi word unit in said extracted statement, also include: the morphology of the participle in N tuple is replaced with
Corresponding part of speech, to obtain the extensive N tuple being mixed with morphology with part of speech;And the morphology according to the participle in extensive N tuple
Feature and part of speech feature, obtain the extraction that the participle in extensive N tuple is a part for multi word unit from the fault-tolerant template of part of speech
Probability is as part of speech fault tolerance information, and part of speech fault tolerance information also serves as the characteristic quantity of participle in N tuple.
According to another embodiment of the present invention, it is provided that the equipment of a kind of multi word unit extracted in statement, including language
Learning feature acquiring unit, it, for each participle block carried out by statement in multiple participle blocks that participle obtains, obtains each point
One or more linguistic feature of the participle in word block are as characteristic quantity;Input block, its using characteristic quantity as artificial neuron
The parameter of network is input in artificial neural network;Judging unit, it uses in artificial neural networks each participle block
Second probability of the part that participle is the first probability of a part for multi word unit and this participle is not multi word unit, and
And judge that whether this participle is a part for multi word unit according to the first probability and the second probability;And extraction unit,
It extracts the participle that adjacent two or more are judged as a part for multi word unit, to form multi word unit, wherein, and should
Equipment also includes: feedback information acquiring unit, and the result of its judgement obtaining the previous participle block adjacent with current participle block is made
For feedback information, and feedback information is also served as the characteristic quantity of current participle block.
According to the equipment of the multi word unit in said extracted statement, also including: assembled unit, it is successively by adjacent in statement
N number of participle be combined as N tuple to form participle block, wherein N is the natural number more than or equal to 2.
According to the equipment of the multi word unit in said extracted statement, also including: extensive unit, it is by the participle in N tuple
Morphology replace with corresponding part of speech, to obtain being mixed with the extensive N tuple of morphology and part of speech;And part of speech fault tolerance information obtains
Unit, it is according to the morphology feature of the participle in extensive N tuple and part of speech feature, obtains extensive N unit from the fault-tolerant template of part of speech
Participle in group be the extraction probability of a part for multi word unit as part of speech fault tolerance information, and part of speech fault tolerance information is also made
Characteristic quantity for the participle in N tuple.
According to still another embodiment of the invention, it is provided that a kind of method of training of human artificial neural networks, artificial neural network
For extracting the multi word unit in statement, the method includes: carry out, for by each training statement, multiple participles that participle obtains
Each participle block in block, obtains one or more linguistic feature of participle in each participle block as characteristic quantity, its
In, the multi word unit in training statement is marked;Characteristic quantity is input to artificial neuron as the parameter of artificial neural network
In network;Use the participle in artificial neural networks each participle block be multi word unit a part the first probability and
This participle is not the second probability of a part for multi word unit, and ties according to the comparison of the first probability and the second probability
Fruit judges that whether this participle is a part for multi word unit;And according to the result judged and the result of mark, carry out training of human
Artificial neural networks, wherein, the method also includes: obtain the result conduct of the judgement of the previous participle block adjacent with current participle block
Feedback information, and feedback information is also served as the characteristic quantity of participle in current participle block.
According to the method for above-mentioned a kind of training of human artificial neural networks, also include: successively by N number of point adjacent in training statement
Phrase is combined into N tuple to form participle block, and wherein N is the natural number more than or equal to 2.
According to the method for above-mentioned a kind of training of human artificial neural networks, also include: the morphology of the participle in N tuple is replaced with
Corresponding part of speech, to obtain the extensive N tuple being mixed with morphology with part of speech;And according in the result marked and extensive N tuple
The morphology feature of participle and part of speech feature, calculate the extraction probability that the participle in extensive N tuple is a part for multi word unit
As part of speech fault tolerance information, to generate the fault-tolerant template of part of speech.
According to one more embodiment of the present invention, it is provided that the equipment of a kind of training of human artificial neural networks, this ANN
Network is for extracting the multi word unit in statement, and this equipment includes: linguistic feature acquisition device, and it is for by each training statement
Carry out each participle block in multiple participle blocks that participle obtains, obtain one or more language of participle in each participle block
Speech feature is as characteristic quantity, and wherein, the multi word unit in training statement is marked;Input equipment, its using characteristic quantity as
The parameter of artificial neural network is input in artificial neural network;Judgment means, uses each participle of artificial neural networks
The second of the part that participle in block is the first probability of a part for multi word unit and this participle is not multi word unit can
Can property, and judge that whether this participle is of multi word unit according to the comparative result of the first probability and the second probability
Point;And training devices, its according to judge result and the result of mark, carry out training of human artificial neural networks, wherein, this equipment is also
Including: feedback information acquisition device, the result of the judgement of the previous participle block that its acquisition is adjacent with current participle block is as feedback
Information, and feedback information is also served as the characteristic quantity of participle in current participle block.
According to the present invention, by the artificial neural network with feedback configuration being applied to the identification of multi word unit and carrying
Take, the identification of multi word unit and the accuracy and efficiency of extraction can be improved.
Accompanying drawing explanation
The present invention can be by with reference to being better understood, wherein in institute below in association with the description given by accompanying drawing
Have in accompanying drawing and employ same or analogous reference to represent same or like parts.Described accompanying drawing is together with following
Describe the part comprising in this manual and being formed this specification together in detail, and be used for being further illustrated by this
The preferred embodiment of invention and the principle and advantage of the explanation present invention.In the accompanying drawings:
Fig. 1 is the schematic flow of the method illustrating the multi word unit extracted according to an embodiment of the invention in statement
Figure;
Fig. 2 is to illustrate to utilize the artificial neural network with feedback configuration to extract in statement according to an embodiment of the invention
The schematic diagram of multi word unit;
Fig. 3 is to illustrate to use N tuple to the method extracting the multi word unit in statement according to an embodiment of the invention
Indicative flowchart;
Fig. 4 is to illustrate to use N tuple to extract the schematic diagram of the multi word unit in statement according to an embodiment of the invention;
Fig. 5 is to illustrate to use N tuple to extract probability and/or part of speech extraction to obtain morphology according to an embodiment of the invention
The indicative flowchart of the method for probability;
Fig. 6 is to illustrate to use N tuple to carry out the schematic flow of the fault-tolerant method of part of speech according to an embodiment of the invention
Figure;
Fig. 7 is to illustrate to use N tuple to carry out the schematic diagram that part of speech is fault-tolerant according to an embodiment of the invention;
Fig. 8 is the schematic block diagram of the equipment illustrating the multi word unit extracted according to an embodiment of the invention in statement;
Fig. 9 is the schematic frame of the equipment illustrating the multi word unit extracted in statement according to another embodiment of the present invention
Figure;
Figure 10 is equipment schematic illustrating the multi word unit extracted in statement according to another embodiment of the present invention
Block diagram;
Figure 11 is equipment schematic illustrating the multi word unit extracted in statement according to another embodiment of the present invention
Block diagram;
Figure 12 is to illustrate the artificial neuron trained according to an embodiment of the invention for extracting the multi word unit in statement
The indicative flowchart of the method for network;
Figure 13 is to illustrate to use N tuple to train the multi word unit for extracting in statement according to an embodiment of the invention
The indicative flowchart of method of artificial neural network;
Figure 14 is to illustrate to use N tuple to generate morphology template and/or the side of part of speech template according to an embodiment of the invention
The indicative flowchart of method;
Figure 15 is to illustrate to use N tuple to generate method schematic of the fault-tolerant template of part of speech according to an embodiment of the invention
Flow chart;
Figure 16 is to illustrate to use N tuple to generate the schematic diagram of the fault-tolerant template of part of speech according to an embodiment of the invention;
Figure 17 is to illustrate the artificial neuron trained according to an embodiment of the invention for extracting the multi word unit in statement
The schematic block diagram of the equipment of network;
Figure 18 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement
The schematic block diagram of the equipment of neutral net;
Figure 19 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement
The schematic block diagram of the equipment of neutral net;
Figure 20 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement
The schematic block diagram of the equipment of neutral net;And
Figure 21 is to illustrate the schematic block diagram being used as implementing messaging device according to an embodiment of the invention.
Detailed description of the invention
Hereinafter in connection with accompanying drawing, the exemplary embodiment of the present invention is described.For clarity and conciseness,
All features of actual embodiment are not the most described.It should be understood, however, that in any this actual enforcement of exploitation
Can make during mode much specific to the decision of embodiment, in order to realize the objectives of developer, and
These decisions may change along with the difference of embodiment.
Here, also need to explanation a bit, in order to avoid having obscured the present invention because of unnecessary details, in the accompanying drawings
Illustrate only and according to the closely-related apparatus structure of the solution of the present invention, and eliminate other little with relation of the present invention
Details.
The side of the multi word unit extracted according to an embodiment of the invention in statement is described below in conjunction with Fig. 1 and Fig. 2
Method.Fig. 1 is the indicative flowchart of the method illustrating the multi word unit extracted according to an embodiment of the invention in statement, and schemes
2 is to illustrate to utilize the artificial neural network with feedback configuration to extract the multi word unit in statement according to an embodiment of the invention
Schematic diagram.
As it is shown in figure 1, this process starts at S100.Then, this process proceeds to S102.
At S102, for each participle block carried out by statement in multiple participle blocks that participle obtains, obtain each participle
One or more linguistic feature of the participle in block are as characteristic quantity.
Statement in language material is carried out participle, thus is multiple participle blocks by sentence segmentation, wherein participle block can wrap
Containing at least one participle.The participle in each participle block in multiple participle blocks that cutting is obtained according to it at original statement
In word order process successively.For example, it is possible to process to obtain the one or more of participle to the participle in participle block
Linguistic feature.Such as, the linguistic feature of participle can be following in one or more: the part of speech of participle, participle
Morphology, participle sequence number or participle probability of occurrence.It will be appreciated by those skilled in the art that the linguistic feature of participle is not limited to above
The example enumerated.After obtaining the linguistic feature of participle, can be using the linguistic feature of the participle of acquisition as characteristic quantity
For follow-up process.
Such as, for statement " step of initial application primer ", this statement is carried out participle, thus obtains following participle
Result " initially/use/draw/thing// step ", say, that it is following many by statement " step of initial application primer " cutting
Individual participle block " initially ", " using ", " drawing ", " thing ", " ", " step " }, the most each participle block comprises a participle.Connect
, in each participle block in the multiple participle blocks obtained participle " initially ", " using ", " drawing ", " thing ", " ", " step
Suddenly " } according to " initially " → " using " → " drawing " → " thing " → " " → the order of " step " processes successively.For example, it is possible to
To multiple participles " initially ", " using ", " drawing ", " thing ", " ", " step " carry out processing to respectively obtain each participle above-mentioned
Part of speech { " (initially) adjective ", " (using) verb ", " (drawing) noun ", " (thing) noun ", " () preposition ", " (step) name
Word " }.It will be appreciated by those skilled in the art that can also obtain above-mentioned multiple participle " initially ", " using ", " drawing ", " thing ",
" ", " step " other Languages feature, repeat no more here.
After S102, this process proceeds to S104.At S104, characteristic quantity is inputted as the parameter of artificial neural network
In artificial neural network.
As in figure 2 it is shown, each circle in artificial neural network 205 represents one or more neuron, it is used for processing circle
The information of mark in circle.Neuron in artificial neural network 205 is divided into three hierarchical combination together, is respectively as follows: input layer
202, concealment layer 203 and output layer 204.The value of the neuron of later layer is calculated by the value of the neuron of preceding layer.In Fig. 2
Black arrow representative's artificial neural networks 205 in the flow direction of information, adjacent two-layer neuron is fully connected, and
Information is flowed to later layer by preceding layer.Although the concealment layer 203 that it will be appreciated by those skilled in the art that in Fig. 2 illustrate only one
Layer, but according to actual needs, concealment layer 203 can include two-layer or more layers.
As in figure 2 it is shown, in the input layer 202 of artificial neural network 205, by t feature of the current participle just processed
Amount characteristic quantity 1, characteristic quantity 2 ..., characteristic quantity i ..., characteristic quantity t-1, characteristic quantity t} is as the parameter of artificial neural network 205
Being input in artificial neural network 205, wherein, i and t is the natural number more than or equal to 1, and 1≤i≤t.Can be by upper
State one or more linguistic feature of the participle extracted in step S102 as features described above amount.For example, it is possible to by participle
Part of speech, the morphology of participle, participle sequence number or participle probability of occurrence are as features described above amount.
Or as a example by statement " step of initial application primer ", for participle " initially ", such as, can obtain participle "
Part of speech " noun ", the morphology " initially " of participle " initially ", the sequence number " 1 " of participle " initially " and the appearance of participle " initially " just "
Probability " 0.43 " etc. are as the characteristic quantity of participle " initially ", and using the features described above amount of participle " initially " as ANN
The parameter of network 205 is input in artificial neural network 205.
After S104, this process proceeds to S106.At S106, use in artificial neural networks each participle block
Second probability of the part that participle is the first probability of a part for multi word unit and this participle is not multi word unit, and
And judge that whether this participle is a part for multi word unit according to the first probability and the second probability.
After characteristic quantity is input in artificial neural network 205 as the parameter of artificial neural network 205, manually god
Through network 205 be determined according to the following equation Current neural unit value:
F (x)=K ((∑iwi×gi(x))+biasW+biasV)
Wherein, K represents activation functions, such as can be byAs activation functions.wiRepresent Current neural unit
And the weight between the i-th neuron in preceding layer neuron, is represented by black line in fig. 2.giX () represents preceding layer god
It is connected to the value of all neurons of Current neural unit by black line in unit.BiasW and biasV represents Current neural unit respectively
Biasing weight and bias.It will be appreciated by those skilled in the art that above-mentioned activation functions and for determining the value of Current neural unit
Formula be only exemplary, it is also possible to the activation functions taken other form, or the formula taken other form determines
The value of Current neural unit.
In the artificial neural network 205 shown in Fig. 2, the value of the neuron in input layer 202 is exactly characteristic quantity itself
Value, one specific weight of each dark line shows.In addition to the neuron in input layer 202, concealment layer 203 and output layer
Neuron in 204 has biasing weight and bias.
As in figure 2 it is shown, the output layer 204 in artificial neural network 205 includes two neurons: represent currently processed dividing
Word is the neuron 206 of the first probability of a part for multi word unit, and represents that currently processed participle is not multi word unit
The neuron 207 of the second probability of a part.Specifically, the value of neuron 206 is represented and is counted by artificial neural network 205
Calculate and obtain the probability of a part or the probability that the participle of the most settled pre-treatment is multi word unit.Such as, if neuron 206
Value be 0.9, then it represents that by calculating, artificial neural network 205 determines that currently processed participle is a part for multi word unit
Probability or probability are 0.9.Similarly, the value of neuron 207 represent by artificial neural network 205 calculated determine work as
The participle of pre-treatment is not probability or the probability of a part for multi word unit.Such as, if the value of neuron 207 is 0.6, then
Represent by calculating, artificial neural network 205 determines that currently processed participle is not the probability of a part for multi word unit or general
Rate is 0.6.
It is being calculated the first probability represented by the value of neuron 206 and represented second by the value of neuron 207
After probability, as shown in Fig. 2 208, the first probability and the second probability can be compared.If first can
Property can be more than or equal to the second probability, then as shown in Fig. 2 210, it is judged that currently processed participle is the one of multi word unit
Part.If the first probability is less than the second probability, then as shown in Fig. 2 209, it is judged that currently processed participle is not
A part for multi word unit.Such as, for the participle of current process, if the first probability represented by the value of neuron 206
Be 0.9, and the second probability represented by the value of neuron 207 is 0.6, then may more than second due to the first probability 0.9
Property 0.6, so judging the part that currently processed participle is multi word unit.It is then possible to by participle at the 211 of Fig. 2
Sequence number n adds 1 and obtains the participle of serial number n+1, in order to process the participle of serial number n+1.
After S106, this process proceeds to S108.At S108, extract adjacent two or more and be judged as many words
The participle of a part for unit, to form multi word unit.
Or as a example by statement " step of initial application primer ", participle in multiple participle blocks that participle obtains "
Just ", " using ", " drawing ", " thing ", " ", " step " } in, it is assumed that participle " draws " and participle " thing " is judged as being multi word unit
A part, and " draw " due to participle and participle " thing " is adjacent two participles, therefore extract participle and " draw " and participle
" thing " is to form multi word unit " primer ".If the adjacent participle having more than two is judged as being of multi word unit
Point, the most also extract to form multi word unit by the adjacent participle of such more than two.
After S108, this process proceeds to S110.At S110, obtain the previous participle block adjacent with current participle block
The result judged is as feedback information, and feedback information also serves as the characteristic quantity of participle in current participle block.
As illustrated in fig. 2, it is assumed that the sequence number of the participle block handled by the expression such as n and n+1, then when having processed dividing of serial number n
After word block, and then sequence number is added 1 to process next participle block (i.e. the participle block of serial number n+1).Now, serial number n+
The participle block of 1 becomes current participle block, and the participle block of serial number n is the previous participle block adjacent with current participle block.Because
The previous participle block of serial number n is processed, so the participle obtained in the previous participle block of serial number n
It it is the judged result of an a part also whether part for multi word unit for multi word unit.Therefore, as in figure 2 it is shown, can be by sequence
Number feed back to the input layer 202 of artificial neural network 205 as feedback information for the judged result of previous participle block of n, and
And when the current participle block of serial number n+1 is processed, this feedback information is also served as the current participle block of serial number n+1
In the characteristic quantity of participle be input in artificial neural network 205.It is to say, make the judgement of the previous participle block of serial number n
Result participates in the judgement process of the current participle block of serial number n+1.
Owing to artificial neural network 205 has feedback configuration, i.e. artificial neural network 205 is in judging current participle block
When whether participle is multi word unit a part of, it is also contemplated that whether the participle in previous participle block adjacent with current participle block is
A part for multi word unit, thus artificial neural network 205 judge participle be whether a part for multi word unit accuracy and
Efficiency can be improved to a great extent.
Finally, this process terminates at S112.
According to the method for the present embodiment, by the artificial neural network with feedback configuration being applied to the knowledge of multi word unit
Not and extract, the identification of multi word unit and the accuracy and efficiency of extraction can be improved.
Describing below in conjunction with Fig. 3 and Fig. 4 uses N tuple to the many words extracting in statement according to an embodiment of the invention
The method of unit.Fig. 3 is to illustrate to use N tuple to the method extracting the multi word unit in statement according to an embodiment of the invention
Indicative flowchart, and Fig. 4 be illustrate according to an embodiment of the invention use N tuple to the multi word unit extracting in statement
Schematic diagram.
As it is shown on figure 3, this process starts at S300.Then, this process proceeds to S302.
At S302, successively N number of participle adjacent in statement is combined as N tuple to form participle block, wherein N for more than or
Natural number equal to 2.
N number of participle adjacent in statement can be combined as N tuple to form participle block, and enter in units of N tuple
The process that row is follow-up.For example, it is possible to two participles adjacent with current participle left and right and current participle are combined as tlv triple.Right
In the participle at beginning of the sentence, first element of tlv triple is empty;For the participle at sentence tail, last element of tlv triple is
Empty.
Or as a example by statement " step of initial application primer ", can be as shown in the dark square in Fig. 4, successively by above-mentioned
Participle " initially " and " using " in statement are combined as tlv triple<NULL initially, uses>, by participle " initially ", " using " and
" draw " and be combined as tlv triple<initially, use, draw>..., by participle " " and " step " be combined as tlv triple<, step,
NULL >, wherein, NULL represents empty.It is easy to understand, here, the one that tlv triple is i.e. the participle block including three participles is shown
Example.
After determining N tuple, the linguistic feature of each element in N tuple can be obtained.For example, it is possible to use part of speech
Analytical tool obtains the part of speech of each element in N tuple.It is, for example possible to use Stamford part of speech analytical tool obtains N unit
The part of speech of each element in group.As shown in Figure 4, for tlv triple<initially, use, draw>, therein the can be obtained respectively
The part of speech of one element " initially " is adjective JJ, and the part of speech that second element " is used " is verb VBG, and the 3rd element
The part of speech " drawn " is noun NN.Alternatively, it is also possible to use corresponding instrument to obtain the other Languages of each element in N tuple
Feature, repeats no more here.
After the linguistic feature of each element in obtaining N tuple, can be by the language of each element of acquisition
Learn feature all as this attribute of an element.Such as, as shown in Figure 4, for each element in N tuple, m genus is altogether listed
Property attribute 1, and attribute 2, attribute 3 ..., attribute m}, wherein m is the natural number more than or equal to 1.Above-mentioned m attribute is the most permissible
It is the part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence etc., but is not limited to this.Such as, for tlv triple
First element " initially " in<initially, use, draw>, the value that can obtain its attribute 1 is " 1 ", and the value of attribute 2 is " 2 ", belongs to
The value of property 3 is " 23 " ..., the value of attribute m is "false".
In units of N tuple, successively m attribute of each element in N tuple can be input to people as characteristic quantity
Calculating in artificial neural networks (ANN) 205, to judge this element part whether as multi word unit, it specifically judged
Journey and subsequent treatment are similar with the process of step S106 in Fig. 1 to step S110, simply participle included in participle block
Quantity is different, and therefore its detail does not repeats them here.Cross in Fig. 4 represents that corresponding element is judged as not being
A part for multi word unit, and check the number and represent the part that corresponding element is judged as being multi word unit.Two or two with
Upper continuous print is checked the number and is represented a complete multi word unit.As shown in Figure 4, because element " draws " corresponding check mark, element " thing " is also
Corresponding check mark, and element " draws " and " thing " is adjacent one another are, therefore " primer " is extracted as multi word unit.
Finally, this process terminates at S304.
According to the method for the present embodiment, the multi word unit processing to extract in statement can be carried out in units of N tuple, from
And improve the identification of multi word unit and the accuracy and efficiency of extraction further.
Below in conjunction with Fig. 5 describe according to an embodiment of the invention use N tuple to obtain morphology extract probability and/or
Part of speech extracts the method for probability.Fig. 5 is to illustrate to use N tuple to extract probability to obtain morphology according to an embodiment of the invention
And/or the indicative flowchart of the method for part of speech extraction probability.
As it is shown in figure 5, this process starts from S500.Then, this process proceeds to S502.
In step S502, according to the morphology feature of the participle in N tuple, from morphology template, obtain the participle in N tuple
It is that the morphology of a part for multi word unit extracts probability, and morphology is extracted probability also serves as the feature of the participle in N tuple
Amount.
Such as, for tlv triple<initially, use, draw>, the morphology of the participle in this tlv triple<initially, use, draw>is special
Levy as " initially, use, draw "." initially, can use, draw " word searching correspondence in morphology template according to above-mentioned morphology feature
Shape, thus obtain the morphology corresponding with this morphology and extract probability, this morphology extracts probability and represents that this tlv triple < initially, is used, drawn
Participle " initially ", " using " or " drawing " in > is the probability of a part for multi word unit.It is then possible to the morphology of acquisition is carried
Take probability to also serve as the characteristic quantity of the participle in this tlv triple<initially, use, draw>and be input in artificial neural network 205.As
Fruit does not find morphology and extracts probability, then process according to default default probability.Morphology template has prestored N unit
The morphology of group and the morphology of correspondence thereof extract probability, and this morphology extracts probability and represents that the participle in this N tuple is multi word unit
The probability of a part.It will be understood by those skilled in the art that morphology template can preset.It addition, as an alternative, morphology mould
Plate can also be by being trained generating to artificial neural network 205.As nonrestrictive example, hereinafter will be to how
It is described in detail by being trained artificial neural network 205 generating morphology template.
After S502, this process proceeds to S504.At S504, according to the part of speech feature of the participle in N tuple, from part of speech
Template obtains the part of speech extraction probability that the participle in N tuple is a part for multi word unit, and part of speech is extracted probability also
Characteristic quantity as the participle in N tuple.
Similarly, such as, for tlv triple<initially, use, draw>, the participle in this tlv triple<initially, use, draw>
Part of speech is characterized as " adjective, verb, noun ".Can be according to above-mentioned part of speech feature " adjective, verb, noun " in part of speech template
The middle part of speech searching correspondence, thus obtain the part of speech corresponding with this part of speech and extract probability, this part of speech is extracted probability and is represented this ternary
Participle " initially ", " using " or " drawing " in group<initially, use, draw>is the probability of a part for multi word unit.It is then possible to
The characteristic quantity that the part of speech extraction probability of acquisition also serves as the participle in this tlv triple<initially, use, draw>is input to artificial god
In network 205.Extract probability without finding part of speech, then process according to default default probability.Part of speech template
In prestored the part of speech of the part of speech of N tuple and correspondence thereof and extracted probability, this part of speech extract probability represent in this N tuple point
Word is the probability of a part for multi word unit.It will be understood by those skilled in the art that part of speech template can preset.It addition,
As an alternative, part of speech template can also be by being trained generating to artificial neural network 205.Show as nonrestrictive
Example, hereinafter to how will be described in detail by being trained artificial neural network 205 generating part of speech template.
Finally, this process terminates at S506.
It will be appreciated by those skilled in the art that step S502 shown in Fig. 5 and S504 can sequentially perform, it is also possible to and
Row performs, or can only perform any one in step S502 and S504.According to the method for the present embodiment, can be according to N unit
Group obtains morphology from morphology template and part of speech template and extracts probability and/or part of speech extraction probability, to utilize relevant multi word unit
Existing knowledge and the characteristic quantity that is input in artificial neural network of increase, thus further increase the identification of multi word unit
With the accuracy and efficiency extracted.
Describing below in conjunction with Fig. 6 and Fig. 7 uses N tuple to carry out the side that part of speech is fault-tolerant according to an embodiment of the invention
Method.Fig. 6 is to illustrate to use N tuple to carry out the indicative flowchart of the fault-tolerant method of part of speech according to an embodiment of the invention, and
Fig. 7 is to illustrate to use N tuple to carry out the schematic diagram that part of speech is fault-tolerant according to an embodiment of the invention.
As shown in Figure 6, this process starts from S600.Then, this process proceeds to S602.
In step S602, the morphology of the participle in N tuple is replaced with corresponding part of speech, to obtain being mixed with morphology and word
The extensive N tuple of property.
Describing below in conjunction with Fig. 7 uses N tuple to carry out the method that part of speech is fault-tolerant according to an embodiment of the invention.Such as figure
Shown in 7, at 702, select the N tuple that may comprise mistake part of speech carrying out processing.Such as, for statement, " antigen discharges
Thing released antigen " carry out participle after multiple participles { " antigen ", " release ", " thing ", " release ", " antigen " } of obtaining, can point
Word " antigen ", " release " and " thing " is formed as a tlv triple<antigen, release, thing>, and wherein the part of speech of participle " antigen " is marked
Note is " noun ", and the part of speech that participle " discharges " is noted as " verb ", and the part of speech of participle " thing " is noted as " noun ".Assume to want
The tlv triple processed is<antigen, release, thing>, and " antigen releasing device " should be a multi word unit, but due to wherein
The part of speech that " discharges " of participle be labeled as verb mistakenly, so will not label it as when analyzing " release " this participle
A part for multi word unit, thus cannot correctly identify whole multi-words expression " antigen releasing device ".
As it is shown in fig. 7, it is extensive to carry out N tuple at 704.The extensive process of N tuple is described below in conjunction with Figure 16.Such as figure
Shown in 16, determine at 1602 and need extensive N tuple, and determine number N of element in this N tuple.At 1604, choosing
Selecting number x of the element wanting extensive, any x participle, typically from the beginning of 1, is generalized for part of speech according to the value of x by x.At 1606,
Value according to x is from treating to select extensive N tuple x element, and lists all possible combination, by this element with its part of speech generation
Put back in N tuple for morphology, and store all possible extensive after N tuple.At 1608, judge whether x is equal to N, if
No, then at 1610, x is added 1, to obtain new x value at 1612.Then, according to new x value repetition 1604,1606 and 1608
The process at place, till x is equal to N.
Or the multiple participles obtained after carrying out participle with statement " antigen releasing device released antigen " " antigen ", " release ",
" thing ", " release ", " antigen " } as a example by, it is assumed that tlv triple<antigen, release, thing>is carried out extensive, then the unit in this tlv triple
Number N of element is 3, and x can be 1,2 or 3.When x is 1, by the morphology of an element in tlv triple<antigen, release, thing>
Replace with part of speech, such that it is able to obtain following extensive after tlv triple:<noun, release, thing>,<antigen, verb, thing>,<anti-
Former, release, noun >.When x is 2, the morphology of two elements in tlv triple<antigen, release, thing>is replaced with part of speech, from
And can obtain following extensive after tlv triple:<noun, verb, thing>,<antigen, verb, noun>,<noun, release, name
Word >.When x is 3, the morphology of three elements in tlv triple<antigen, release, thing>is replaced with part of speech, such that it is able to obtain
Tlv triple after as follows extensive:<noun, verb, noun>.
After S602, this process proceeds to S604.At S604, according to the morphology feature of the participle in extensive N tuple and
Part of speech feature, obtains the extraction probability work that the participle in extensive N tuple is a part for multi word unit from the fault-tolerant template of part of speech
For part of speech fault tolerance information, and part of speech fault tolerance information is also served as the characteristic quantity of participle in N tuple.
By the process of above-mentioned steps S602 can obtain all possible extensive after N tuple.Then, as it is shown in fig. 7,
At 706, can according to all possible extensive after N tuple, search in the fault-tolerant template of part of speech respectively correspondence extensive N unit
Group, thus obtain the extraction probability corresponding with extensive N tuple as part of speech fault tolerance information, this extraction probability represents that this extensive N is first
Participle in group is the probability of a part for multi word unit.The part of speech fault tolerance information of acquisition can be also served as dividing in N tuple
The characteristic quantity of word is input in artificial neural network 205, and the further feature amount being combined in the artificial neural network at 708 is entered
Row training, thus at 710, make the artificial neural network strengthening impact on judged result.Therefore, as described at 712, can
When mistake part of speech occurs in object element, to reduce the deviation that part of speech mistake causes, thus it be fault-tolerant to realize part of speech.
Without finding the extraction probability as part of speech fault tolerance information, then according to default default probability at
Reason.Having prestored the extraction probability of extensive N tuple and correspondence thereof in the fault-tolerant template of part of speech, this extraction probability represents this extensive N
Participle in tuple is the probability of a part for multi word unit.It will be understood by those skilled in the art that the fault-tolerant template of part of speech is permissible
Preset.It addition, as an alternative, the fault-tolerant template of part of speech can also be by being trained generating to artificial neural network 205.
As nonrestrictive example, hereinafter will be to the most fault-tolerant by artificial neural network 205 is trained generating part of speech
Template is described in detail.
Or as a example by above-mentioned tlv triple<antigen, release, thing>, a series of extensive three can be obtained by extensive
Tuple:<noun, release, thing>,<antigen, verb, thing>,<antigen, release, noun>,<noun, verb, thing>,<antigen, dynamic
Word, noun>,<noun, release, noun>,<noun, verb, noun>.Every according in above-mentioned a series of extensive tlv triple
Individual, in the fault-tolerant template of part of speech, search the extensive tlv triple of correspondence respectively, thus obtain in tlv triple<antigen, release, thing>
Participle is that the extraction probability of a part for multi word unit is as part of speech fault tolerance information.
Finally, this process terminates at S606.
According to the method for the present embodiment, the deviation of the eigenvalue caused by part-of-speech tagging mistake can be alleviated, even if therefore
Error message is refer to, it is also possible to correctly identify and extract the multi word unit in statement during part-of-speech tagging, thus can
To improve identification and the accuracy and efficiency of extraction of multi word unit further.
Illustrate to extract according to an embodiment of the invention setting of multi word unit in statement below in conjunction with Fig. 8 to Figure 11
Standby.
Fig. 8 is the schematic block diagram of the equipment illustrating the multi word unit extracted according to an embodiment of the invention in statement.
As shown in Figure 8, the equipment 800 extracting the multi word unit in statement includes: linguistic feature acquiring unit 802, it is for by language
Sentence carries out each participle block in multiple participle blocks that participle obtains, and obtains one or more languages of participle in each participle block
Speech learns feature as characteristic quantity;Input block 804, characteristic quantity is input to artificial neuron as the parameter of artificial neural network by it
In network;Judging unit 806, it uses the participle in artificial neural networks each participle block to be the part of multi word unit
The first probability and this participle be not second probability of a part of multi word unit, and according to the first probability and second
Probability judges that whether this participle is a part for multi word unit;Extraction unit 808, it extracts adjacent two or more
It is judged as the participle of a part for multi word unit, to form multi word unit;And feedback information acquiring unit 810, it obtains
The result of the judgement of the previous participle block adjacent with current participle block is as feedback information, and is also served as currently by feedback information
The characteristic quantity of the participle in participle block.
It is pointed out that at the relational language involved with the embodiment of device-dependent or state with above basis
Term used in the embodiment elaboration of the method for embodiments of the invention or statement correspondence, do not repeat them here.
Fig. 9 is the schematic frame of the equipment illustrating the multi word unit extracted in statement according to another embodiment of the present invention
Figure.As it is shown in figure 9, the equipment 900 of the multi word unit in extraction statement includes linguistic feature acquiring unit 802, input block
804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 and assembled unit 902.Extract the many words in statement
Linguistic feature acquiring unit 802 in the equipment 900 of unit, input block 804, judging unit 806, extraction unit 808 and
Feedback information acquiring unit 810 and the linguistic feature acquiring unit 802 in the equipment 800 of the multi word unit extracted in statement,
Input block 804, judging unit 806, extraction unit 808 are identical with feedback information acquiring unit 810, and its details is the most superfluous at this
State.It addition, assembled unit 902 in the equipment 900 of multi word unit in extraction statement is for successively by adjacent N number of in statement
Participle is combined as N tuple to form participle block, and wherein N is the natural number more than or equal to 2.
Figure 10 is equipment schematic illustrating the multi word unit extracted in statement according to another embodiment of the present invention
Block diagram.As shown in Figure 10, the equipment 1000 extracting the multi word unit in statement includes linguistic feature acquiring unit 802, input
Unit 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810, assembled unit 902, morphology extract probability
Acquiring unit 1002 and part of speech extract probability acquiring unit 1004.Extract the language in the equipment 1000 of the multi word unit in statement
Learn feature acquiring unit 802, input block 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 and group
Close unit 902 and the linguistic feature acquiring unit 802 in the equipment 900 of the multi word unit extracted in statement, input block
804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 are identical with assembled unit 902, and its details is at this not
Repeat again.It addition, the morphology in the equipment 1000 of multi word unit in extraction statement extracts probability acquiring unit 1002, its basis
The morphology feature of the participle in N tuple, obtains the morphology that the participle in N tuple is a part for multi word unit from morphology template
Extract probability, and morphology is extracted probability also serve as the characteristic quantity of the participle in N tuple;Part of speech extracts probability acquiring unit
1004, it is according to the part of speech feature of the participle in N tuple, and the participle obtained in N tuple from part of speech template is multi word unit
The part of speech of a part extracts probability, and part of speech is extracted probability also serves as the characteristic quantity of the participle in N tuple.
Figure 11 is equipment schematic illustrating the multi word unit extracted in statement according to another embodiment of the present invention
Block diagram.As shown in figure 11, the equipment 1100 extracting the multi word unit in statement includes linguistic feature acquiring unit 802, input
Unit 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810, assembled unit 902, extensive unit 1102
With part of speech fault tolerance information acquiring unit 1104.Extract the linguistic feature in the equipment 1100 of the multi word unit in statement and obtain single
Unit 802, input block 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 and assembled unit 902 with
Extract the linguistic feature acquiring unit 802 in the equipment 900 of the multi word unit in statement, input block 804, judging unit
806, extraction unit 808, feedback information acquiring unit 810 are identical with assembled unit 902, and its details does not repeats them here.It addition,
The morphology of the participle in N tuple is replaced with accordingly by the extensive unit 1102 extracted in the equipment 1100 of the multi word unit in statement
Part of speech, to obtain being mixed with the extensive template of morphology and part of speech;Part of speech fault tolerance information acquiring unit 1104 obtains extensive template
In the probability of the part that middle participle is multi word unit as part of speech fault tolerance information, and part of speech fault tolerance information is also served as
The characteristic quantity of each participle in N tuple.
Each device and/or unit in above-mentioned Fig. 8 to Figure 11 such as may be configured to according to the phase in correlation method
The working method answering step operates.Details sees above-mentioned for the enforcement illustrated according to the method for embodiments herein
Example.Do not repeat them here.
The multi word unit trained according to an embodiment of the invention for extracting in statement is described below in conjunction with Figure 12
The method of artificial neural network.Figure 12 is to illustrate the many words list trained according to an embodiment of the invention for extracting in statement
The indicative flowchart of the method for the artificial neural network of unit.
As shown in figure 12, this process starts at S1200.Then, this process proceeds to S1202.
At S1202, for each participle block each training statement carried out in multiple participle blocks that participle obtains, obtain
One or more linguistic feature of participle in each participle block is as characteristic quantity, wherein, and the many words list in training statement
Unit is marked.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1202 and Fig. 1
In the process of S102 essentially identical, its detail does not repeats them here.It addition, about training statement, to therein
Multi word unit is marked.
After S1202, this process proceeds to S1204.At S1204, using characteristic quantity as the parameter of artificial neural network
It is input in artificial neural network.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1204 and Fig. 1
In the process of S104 essentially identical, its detail does not repeats them here.
After S1204, this process proceeds to S1206.At S1206, use artificial neural networks each participle block
In the part that participle is first probability of a part of multi word unit and this participle is not multi word unit second may
Property, and judge that whether this participle is a part for multi word unit according to the first probability and the second probability.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1206 and Fig. 1
In the process of S106 essentially identical, its detail does not repeats them here.
After S1206, this process proceeds to S1208.At S1208, according to the result judged and the result of mark, instruct
Practice artificial neural network.
The training process of artificial neural network 205 is exactly the process solving the weights in artificial neural network 205.
The present invention uses BP(Back Propagation, error back propagation) algorithm carries out the training of artificial neural network 205.
Detailed process is as follows:
A) initialize artificial neural network 205, select the weight randomly generated;
B) project of the training data with expected value is input in artificial neural network 205 one by one, and calculates defeated
Go out value;
C) difference between output valve and expected value, the mistake of each neuron in calculating artificial neural network 205 are compared
Difference;
D) adjust weight and reduce error;
E) repeated execution of steps b)-d), till error is less than predetermined threshold value.Those skilled in the art should manage
Solve, based on experience value or above-mentioned predetermined threshold value can be set according to experiment.
The process of training of human artificial neural networks 205 is carried out to concealment layer neuron weight one by one by output layer neuron weights
Solve, calculate the variable quantity of each weight respectively.First, the error of each output layer neuron is solved according to equation below:Wherein,It is the desired output valve of i-th neuron,It it is i-th
The real output value of neuron,It it is the derivative of activation functions.The mistake of concealment layer neuron is calculated according to equation below
Difference:Wherein, wijIt is jth output layer neuron and i-th concealment layer god
Weights between unit,It is the error of i-th output layer neuron,It is the actual output of i-th concealment layer neuron
Value, wherein h represents that this neuron is concealment layer neuron.The input value of input layer is output valve, does not the most miss
Difference.
After calculating the error of each neuron, the adjustment amplitude of weight can be calculated: Δ w=ρ × δi×ni, wherein ρ is
Learning rate, δiIt is the error of i-th neuron, niIt it is the value of Current neural unit.New weight is exactly that present weight is plus Δ w.
The method that it will be appreciated by those skilled in the art that above-mentioned training of human artificial neural networks 205 is only exemplary, also may be used
Training of human artificial neural networks 205 is carried out with the method using other.
After S1208, this process proceeds to S1210.At S1210, obtain the previous participle adjacent with current participle block
The result of the judgement of block is as feedback information, and feedback information also serves as the characteristic quantity of current participle.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1210 and Fig. 1
In the process of S110 essentially identical, its detail does not repeats them here.
Finally, this process terminates at S1212.
According to the method for the present embodiment, the artificial neural network with feedback configuration can be obtained by training, will training
The artificial neural network obtained is applied to identification and the extraction of multi word unit, can improve the identification of multi word unit and the standard of extraction
Really property and efficiency.
Describing below in conjunction with Figure 13 uses N tuple to train for extracting in statement according to an embodiment of the invention
The method of the artificial neural network of multi word unit.Figure 13 is to illustrate to use N tuple to train use according to an embodiment of the invention
The indicative flowchart of the method for the artificial neural network of the multi word unit in extraction statement.
As shown in figure 13, this process starts at S1300.Then, this process proceeds to S1302.
At S1302, successively adjacent N number of participle in training statement being combined as N tuple to form participle block, wherein N is
Natural number more than or equal to 2.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1302 and Fig. 3
In the process of S302 essentially identical, its detail does not repeats them here.
Finally, this process terminates at S1304.
According to the method for the present embodiment, knowledge, morphology combination knowledge etc. can be combined according to the such as part of speech of N tuple existing
Knowledge carrys out training of human artificial neural networks, and artificial neural network training obtained is applied to extract the multi word unit in statement, can
To improve identification and the accuracy and efficiency of extraction of multi word unit further.
Describing below in conjunction with Figure 14 uses N tuple to generate morphology template and/or part of speech according to an embodiment of the invention
The method of template.Figure 14 illustrates and uses N tuple to generate morphology template and/or part of speech template according to an embodiment of the invention
The indicative flowchart of method.
As shown in figure 14, this process starts from S1400.Then, this process proceeds to S1402.
In step S1402, according to the morphology feature of the participle in the result marked and N tuple, calculate the participle in N tuple
The morphology of the part being noted as multi word unit extracts probability, to generate morphology template.
Such as, for tlv triple<initially, use, draw>, participle therein " initially " and " using " are noted as not being many
A part for word unit, and participle therein " draws " and is noted as being the part of multi word unit, and this tlv triple is < initially,
Use, draw > in the morphology of participle be characterized as " initially, use, draw ".Artificial neural network can be passed through according to above-mentioned information
205 participles " initially ", " using " or " drawing " calculated in this tlv triple<initially, use, draw>are marked the one of multi word unit
The morphology of part extracts probability, and stores the tlv triple corresponding to this morphology extraction probability and current participle explicitly, from
And generate morphology template.
In step S1404, according to the part of speech feature of the participle in the result marked and N tuple, calculate the participle in N tuple
It is the part of speech extraction probability of a part for multi word unit, to generate part of speech template.
Similarly, such as, for tlv triple<initially, use, draw>, participle therein " initially " and " using " are noted as
It not a part for multi word unit, and participle therein " draws " and is noted as being the part of multi word unit, and this tlv triple <
Initially, use, draw > in the part of speech of participle be characterized as " adjective, verb, noun ".Can be according to above-mentioned information, by manually
The participle " initially ", " using " or " drawing " that neutral net 205 calculates in this tlv triple<initially, use, draw>is marked many words
The part of speech of a part for unit extracts probability, and stores three corresponding to this part of speech extraction probability and current participle explicitly
Tuple, thus generate part of speech template.
Finally, this process terminates at S1406.
It will be appreciated by those skilled in the art that step S1402 shown in Figure 14 and S1404 can sequentially perform, it is possible to
With executed in parallel, or can only perform any one in step S1402 and S1404.According to the method for the present embodiment, can adopt
Training of human artificial neural networks is carried out to generate morphology template or part of speech template, by the morphology template generated and part of speech template by N tuple
It is applied to identification and the extraction of multi word unit, the identification of multi word unit and the accuracy and efficiency of extraction can be improved further.
Describing below in conjunction with Figure 15 and Figure 16 uses N tuple to generate the fault-tolerant template of part of speech according to an embodiment of the invention
Method.Figure 15 is to illustrate to use N tuple to generate method schematic of the fault-tolerant template of part of speech according to an embodiment of the invention
Flow chart.Figure 16 is to illustrate to use N tuple to generate the schematic diagram of the fault-tolerant template of part of speech according to an embodiment of the invention.
As shown in figure 15, this process starts from S1500.Then, this process proceeds to S1502.
In step S1502, the morphology of the participle in N tuple is replaced with corresponding part of speech, with obtain being mixed with morphology with
The extensive N tuple of part of speech.
In addition to being to process each training statement is carried out multiple participles that participle obtains, in the process of S1502 and Fig. 6
The process of S602 essentially identical, its detail does not repeats them here.
After S1502, this process proceeds to S1504.At S1504, according to dividing in the result marked and extensive N tuple
The morphology feature of word and part of speech feature, calculate the extraction probability that the participle in extensive N tuple is marked a part for multi word unit
As part of speech fault tolerance information, to generate the fault-tolerant template of part of speech.
By the process of above-mentioned steps S1502 can obtain all possible extensive after N tuple.It is then possible to according to
Mark result and all possible extensive after N tuple, the participle calculated respectively in extensive N tuple is marked multi word unit
The extraction probability of a part is as part of speech fault tolerance information.
Or as a example by above-mentioned tlv triple<antigen, release, thing>, wherein participle " antigen ", " release " and " thing " is all marked
Note is for being the part of multi word unit, and above-mentioned tlv triple can obtain a series of extensive tlv triple by extensive: < name
Word, release, thing>,<antigen, verb, thing>,<antigen, release, noun>,<noun, verb, thing>,<antigen, verb, noun>,<
Noun, release, noun>,<noun, verb, noun>.Therefore, as shown in figure 16, at 1614, according to the result of above-mentioned mark
Each with in above-mentioned a series of extensive tlv triple, the participle calculated respectively in above-mentioned extensive tlv triple is noted as many words list
The extraction probability of a part for unit is as part of speech fault tolerance information, and stores this part of speech fault tolerance information and current participle explicitly
Corresponding tlv triple, thus generate the fault-tolerant template of part of speech.
Owing to the major part fault-tolerant template of part of speech all comprising in part-of-speech information and morphology information, and N tuple template not only
Comprise current goal participle and also comprise participle information before and after current participle, it is possible to greatly weaken single error part of speech institute
The impact caused, when mistake part of speech being input in artificial neural network, the participle in the fault-tolerant template of part of speech is multi word unit
The probability of a part can suppress the mistake part of speech impact on final judged result by the calculating of artificial neural network.
Finally, this process terminates at S1506.
According to the method for the present embodiment, can alleviate during training of human artificial neural networks and be drawn by part-of-speech tagging mistake
The deviation of the eigenvalue risen, and generate the fault-tolerant template of part of speech, if fault-tolerant for the part of speech of generation template is applied to multi word unit
Identification and extraction, even if then refer to error message during part-of-speech tagging, it is also possible to correctly identify and extract statement
In multi word unit, such that it is able to improve further identification and the accuracy and efficiency of extraction of multi word unit.
Illustrate to train the many words for extracting in statement according to an embodiment of the invention below in conjunction with Figure 17 to Figure 20
The equipment of the artificial neural network of unit.
Figure 17 is to illustrate the artificial neuron trained according to an embodiment of the invention for extracting the multi word unit in statement
The schematic block diagram of the equipment of network.As shown in figure 17, training is for extracting the artificial neural network of the multi word unit in statement
Equipment 1700 include: linguistic feature acquisition device 1702, they are multiple for what each training statement carried out participle obtains
Each participle block in participle block, obtains one or more linguistic feature of participle in each participle block as feature
Amount, wherein, the multi word unit in training statement is marked;Input equipment 1704, its using characteristic quantity as artificial neural network
Parameter be input in artificial neural network;Judgment means 1706, it uses in artificial neural networks each participle block
Second probability of the part that participle is the first probability of a part for multi word unit and this participle is not multi word unit, and
And judge that whether this participle is a part for multi word unit according to the comparative result of the first probability and the second probability;Training
Device 1708, it, according to the result judged and the result of mark, carrys out training of human artificial neural networks;And feedback information acquisition device
1710, it obtains the result of judgement of the previous participle block adjacent with current participle block as feedback information, and by feedback letter
Breath also serves as the characteristic quantity of the participle in current participle block.
It is pointed out that at the relational language involved with the embodiment of device-dependent or state with above basis
Term used in the embodiment elaboration of the method for embodiments of the invention or statement correspondence, do not repeat them here.
Figure 18 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement
The schematic block diagram of the equipment of neutral net.As shown in figure 18, training is for extracting the artificial neuron of the multi word unit in statement
The equipment 1800 of network includes linguistic feature acquisition device 1702, input equipment 1704, judgment means 1706, training devices
1708, feedback information acquisition device 1710 and combination unit 1802.Training is for extracting the artificial god of the multi word unit in statement
Linguistic feature acquisition device 1702 in the equipment 1800 of network, input equipment 1704, judgment means 1706, training devices
1708 and feedback information acquisition device 1710 with training for extracting the equipment of the artificial neural network of the multi word unit in statement
Linguistic feature acquisition device 1702, input equipment 1704, judgment means 1706, training devices 1708 and feedback letter in 1700
Breath acquisition device 1710 is identical, and its details does not repeats them here.It addition, training is for extracting the artificial of multi word unit in statement
Combination unit 1802 in the equipment 1800 of neutral net successively by training statement in adjacent N number of participle be combined as N tuple with
Forming participle block, wherein N is the natural number more than or equal to 2.
Figure 19 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement
The schematic block diagram of the equipment of neutral net.As shown in figure 19, training is for extracting the artificial neuron of the multi word unit in statement
The equipment 1900 of network includes linguistic feature acquisition device 1702, input equipment 1704, judgment means 1706, training devices
1708, feedback information acquisition device 1710, combination unit 1802, morphology template generation device 1902 and part of speech template generation device
1904.Training linguistic feature in the equipment 1900 extracting the artificial neural network of the multi word unit in statement obtains dress
Put 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination unit
1802 obtain with training linguistic feature in the equipment 1800 extracting the artificial neural network of the multi word unit in statement
Device 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination dress
Putting 1802 identical, its details does not repeats them here.It addition, training is for extracting the artificial neural network of the multi word unit in statement
Equipment 1900 in part of speech template generation device 1902, it is according to the morphology feature of participle in the result of mark and N tuple,
Calculate the morphology extraction probability that the participle in N tuple is a part for multi word unit, to generate morphology template;And/or part of speech mould
Plate generating means 1904, it according to the part of speech feature of the participle in the result marked and N tuple, the participle calculated in N tuple is
The part of speech of a part for multi word unit extracts probability, to generate part of speech template.
Figure 20 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement
The schematic block diagram of the equipment of neutral net.As shown in figure 20, training is for extracting the artificial neuron of the multi word unit in statement
The equipment 2000 of network includes linguistic feature acquisition device 1702, input equipment 1704, judgment means 1706, training devices
1708, feedback information acquisition device 1710, combination unit 1802, extensive device 2002 and part of speech fault-tolerant template generation device
2004.Training linguistic feature in the equipment 2000 extracting the artificial neural network of the multi word unit in statement obtains dress
Put 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination unit
1802 obtain with training linguistic feature in the equipment 1800 extracting the artificial neural network of the multi word unit in statement
Device 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination dress
Putting 1802 identical, its details does not repeats them here.It addition, training is for extracting the artificial neural network of the multi word unit in statement
Equipment 2000 in extensive device 2002, the morphology of the participle in N tuple is replaced with corresponding part of speech, to be mixed with
Morphology and the extensive N tuple of part of speech;Part of speech fault-tolerant template generation device 2004, it is according in the result marked and extensive N tuple
The morphology feature of participle and part of speech feature, calculate the extraction probability that the participle in extensive N tuple is a part for multi word unit
As part of speech fault tolerance information, to generate the fault-tolerant template of part of speech.
It will be appreciated by those skilled in the art that the many words list extracted in statement according to various embodiments of the present invention described above
Each functional unit in the equipment of each step in the method for unit or the multi word unit in extraction statement, can be according to actual need
Combine arbitrarily, i.e. the process step in the embodiment of the method for a multi word unit extracted in statement can be with it
The process step that it extracts in the embodiment of the method for the multi word unit in statement is combined, or, one is extracted in statement
Functional unit in the apparatus embodiments of multi word unit can extract in the apparatus embodiments of the multi word unit in statement with other
Functional unit be combined, in order to realize desired technical purpose.Similarly, described above according to each reality of the present invention
Each step in the method for the training of human artificial neural networks executing example or each function list in the equipment of training of human artificial neural networks
Unit, can combine the most arbitrarily, i.e. the process in the embodiment of the method for a training of human artificial neural networks
Step can be combined with the process step in the embodiment of the method for other training of human artificial neural networks, or, a training
Functional unit in the apparatus embodiments of artificial neural network can with in the apparatus embodiments of other training of human artificial neural networks
Functional unit be combined, in order to realize desired technical purpose
Additionally, embodiments herein also proposed a kind of program product, this program product carrying executable finger of machine
Order, when performing instruction on messaging device, instruction makes messaging device perform the enforcement according to the invention described above
The method of the multi word unit extracted in statement of example.Similarly, embodiments herein also proposed a kind of program product, this journey
The carrying executable instruction of machine of sequence product, when performing instruction on messaging device, instruction makes messaging device
The method performing the training of human artificial neural networks according to embodiments of the invention described above.
Additionally, embodiments herein also proposed a kind of storage medium, this storage medium includes machine-readable program
Code, when performing program code on messaging device, program code makes messaging device perform according to above-mentioned
The method of the multi word unit extracted in statement of inventive embodiment.Similarly, embodiments herein also proposed one and deposits
Storage media, this storage medium includes machine-readable program code, when performing program code on messaging device, program
Code makes the method that messaging device performs the training of human artificial neural networks according to embodiments of the invention described above.
Correspondingly, the storage medium being used for carrying the program product of the instruction code that above-mentioned storage has machine-readable also wraps
Include in disclosure of the invention.Storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc..
The equipment and each component units thereof that extract the multi word unit in statement according to an embodiment of the invention can pass through
The mode of software, firmware, hardware or a combination thereof configures.Similarly, artificial neuron is trained according to an embodiment of the invention
The equipment of network and each component units thereof also can configure by the way of software, firmware, hardware or a combination thereof.Configuration can
The specific means or the mode that use are well known to those skilled in the art, and do not repeat them here.Realized by software or firmware
In the case of, from storage medium or network to messaging device (such as general shown in Figure 21 with specialized hardware structure
Computer 2100) program constituting this software is installed, this computer is when being provided with various program, it is possible to perform various function
Deng.
In figure 21, CPU (CPU) 2101 according in read only memory (ROM) 2102 storage program or from
Storage part 2108 is loaded into the program of random-access memory (ram) 2103 and performs various process.In RAM 2103, also root
According to the data that needs storage is required when CPU 2101 performs various process etc..CPU 2101, ROM 2102 and RAM 2103
It is connected to each other via bus 2104.Input/output interface 2105 is also connected to bus 2104.
Components described below is connected to input/output interface 2105: importation 2106(includes keyboard, mouse etc.), output
Part 2107(includes display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and speaker etc.), storage part
Point 2108(includes hard disk etc.), communications portion 2109(include NIC such as LAN card, modem etc.).Communication unit
2109 are divided to perform communication process via network such as the Internet.As required, driver 2110 can be connected to input/output and connects
Mouth 2105.Detachable media 2111 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed in as required and drive
On dynamic device 2110 so that the computer program read out is installed to store in part 2108 as required.
In the case of realizing above-mentioned series of processes by software, the most removable from network such as the Internet or storage medium
Unload medium 2111 and the program constituting software is installed.
It will be understood by those of skill in the art that this storage medium is not limited to the wherein storage shown in Figure 21 and has journey
Sequence and equipment distribute the detachable media 2111 of the program that provides a user with separately.The example bag of detachable media 2111
Containing disk (comprising floppy disk (registered trade mark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)),
Magneto-optic disk (comprising mini-disk (MD) (registered trade mark)) and semiconductor memory.Or, storage medium can be ROM 2102, deposit
Hard disk comprised in storage part 2108 etc., wherein computer program stored, and it is distributed to user together with the equipment comprising them.
When instruction code is read by machine and performs, above-mentioned method according to embodiments of the present invention can be performed.
Finally, in addition it is also necessary to explanation, term " includes ", " comprising " or its any other variant are intended to non-exclusive
Comprising of property, so that include that the process of a series of key element, method, article or equipment not only include those key elements, and
Also include other key elements being not expressly set out, or also include intrinsic for this process, method, article or equipment
Key element.Additionally, in the case of there is no more restriction, statement " including ... " key element limited, it is not excluded that at bag
Include and the process of key element, method, article or equipment there is also other identical element.Furthermore, by wording " first ", " the
Two ", technical characteristic that " 3rd " etc. limits or parameter, do not have because of the use of these wording specific order or
Person's priority or importance degree.In other words, the use of these wording is intended merely to distinguish or identify these technical characteristics
Or parameter and there is no any other restriction implication.
Being not difficult to find out by above description, the technical scheme that embodiments of the invention provide includes but not limited to:
Remarks 1, a kind of method of multi word unit extracted in statement, including:
For each participle block carried out by statement in multiple participle blocks that participle obtains, obtain participle in each participle block
One or more linguistic feature as characteristic quantity;
Described characteristic quantity is input in described artificial neural network as the parameter of artificial neural network;
Use the participle in described artificial neural networks each participle block be multi word unit a part first can
Energy property and this participle are not the second probabilities of a part for multi word unit, and may according to described first probability and second
Property judges that whether this participle is a part for multi word unit;And
Extract the participle that adjacent two or more are judged as a part for multi word unit, to form multi word unit,
Wherein, described method also includes: obtain the result conduct of the judgement of the previous participle block adjacent with current participle block
Feedback information, and described feedback information is also served as the characteristic quantity of participle in described current participle block.
Remarks 2, according to the method described in remarks 1, wherein, described linguistic feature be following in one or more:
The part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence.
Remarks 3, according to the method according to any one of remarks 1-2, also include:
Successively N number of participle adjacent in described statement is combined as N tuple to form participle block, wherein N for more than or etc.
In the natural number of 2.
Remarks 4, according to the method described in remarks 3, also include:
According to the morphology feature of the participle in described N tuple, the participle obtained from morphology template in described N tuple is many
The morphology of a part for word unit extracts probability, and described morphology extraction probability also serves as participle in described N tuple
Characteristic quantity;And/or
According to the part of speech feature of the participle in described N tuple, the participle obtained from part of speech template in described N tuple is many
The part of speech of a part for word unit extracts probability, and described part of speech extraction probability also serves as participle in described N tuple
Characteristic quantity.
Remarks 5, according to the method described in remarks 4, also include:
The morphology of the participle in described N tuple is replaced with corresponding part of speech, to obtain being mixed with the general of morphology and part of speech
Change N tuple;And
Morphology feature according to the participle in described extensive N tuple and part of speech feature, obtain institute from the fault-tolerant template of part of speech
State the extraction probability of a part that the participle in extensive N tuple is multi word unit as part of speech fault tolerance information, and by institute's predicate
Property fault tolerance information also serves as the characteristic quantity of the participle in described N tuple.
Remarks 6, the equipment of a kind of multi word unit extracted in statement, including:
Linguistic feature acquiring unit, it is for each participle carried out by statement in multiple participle blocks that participle obtains
Block, obtains one or more linguistic feature of participle in each participle block as characteristic quantity;
Input block, described characteristic quantity is input to described artificial neural network as the parameter of artificial neural network by it
In;
Judging unit, it uses the participle in described artificial neural networks each participle block to be of multi word unit
The first probability divided and this participle are not the second probabilities of a part for multi word unit, and according to described first probability
Judge that whether this participle is a part for multi word unit with the second probability;And
Extraction unit, it extracts the participle that adjacent two or more are judged as a part for multi word unit, with shape
Become multi word unit,
Wherein, described equipment also includes: feedback information acquiring unit, and it obtains the previous participle adjacent with current participle block
The result of the judgement of block is as feedback information, and described feedback information also serves as the spy of participle in described current participle block
The amount of levying.
Remarks 7, according to the equipment described in remarks 6, wherein, described linguistic feature be following in one or more:
The part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence.
Remarks 8, according to the equipment according to any one of remarks 6-7, also include:
Assembled unit, N number of participle adjacent in described statement is combined as N tuple to form participle block, wherein N by successively
For the natural number more than or equal to 2.
Remarks 9, according to the equipment described in remarks 8, also include:
Morphology extracts probability acquiring unit, and it, according to the morphology feature of the participle in described N tuple, obtains from morphology template
Take the morphology extraction probability that the participle in described N tuple is a part for multi word unit, and described morphology is extracted probability also
Characteristic quantity as the participle in described N tuple;And/or
Part of speech extracts probability acquiring unit, and it, according to the part of speech feature of the participle in described N tuple, obtains from part of speech template
Take the part of speech extraction probability that the participle in described N tuple is a part for multi word unit, and described part of speech is extracted probability also
Characteristic quantity as the participle in described N tuple.
Remarks 10, according to the equipment described in remarks 8, also include:
Extensive unit, the morphology of the participle in described N tuple is replaced with corresponding part of speech by it, to obtain being mixed with morphology
Extensive N tuple with part of speech;And
Part of speech fault tolerance information acquiring unit, it is according to the morphology feature of the participle in described extensive N tuple and part of speech feature,
The extraction probability of a part that the participle in described extensive N tuple is multi word unit is obtained as part of speech from the fault-tolerant template of part of speech
Fault tolerance information, and described part of speech fault tolerance information is also served as the characteristic quantity of each participle in described N tuple.
Remarks 11, a kind of method of training of human artificial neural networks, described artificial neural network is many for extracting in statement
Word unit, described method includes:
For each participle block carried out by each training statement in multiple participle blocks that participle obtains, obtain each participle
One or more linguistic feature of participle in block is as characteristic quantity, and wherein, the multi word unit in described training statement is
It is marked;
Described characteristic quantity is input in described artificial neural network as the parameter of artificial neural network;
Use the participle in described artificial neural networks each participle block be multi word unit a part first can
Energy property and this participle are not the second probabilities of a part for multi word unit, and may according to described first probability and second
The comparative result of property judges that whether this participle is a part for multi word unit;And
According to the result judged and the result of mark, train described artificial neural network,
Wherein, described method also includes: obtain the result conduct of the judgement of the previous participle block adjacent with current participle block
Feedback information, and described feedback information is also served as the characteristic quantity of participle in described current participle block.
Remarks 12, according to the method described in remarks 11, wherein, described linguistic feature be following in one or more
Individual: the part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence.
Remarks 13, according to the method described in remarks 11 or 12, also include:
N number of participle adjacent in described training statement being combined as N tuple successively to form participle block, wherein N is for being more than
Or the natural number equal to 2.
Remarks 14, according to the method described in remarks 13, also include:
According to the morphology feature of the participle in the result marked and described N tuple, the participle calculated in described N tuple is many
The morphology of a part for word unit extracts probability, to generate morphology template;And/or
According to the part of speech feature of the participle in the result marked and described N tuple, the participle calculated in described N tuple is many
The part of speech of a part for word unit extracts probability, to generate part of speech template.
Remarks 15, according to the method described in remarks 13, also include:
The morphology of the participle in described N tuple is replaced with corresponding part of speech, to obtain being mixed with the general of morphology and part of speech
Change N tuple;And
Morphology feature according to the participle in the result marked and described extensive N tuple and part of speech feature, calculate described general
Change the extraction probability of a part that the participle in N tuple is multi word unit as part of speech fault tolerance information, to generate the fault-tolerant mould of part of speech
Plate.
Remarks 16, the equipment of a kind of training of human artificial neural networks, described artificial neural network is many for extracting in statement
Word unit, described equipment includes:
Linguistic feature acquisition device, it is every for carried out by each training statement in multiple participle blocks that participle obtains
Individual participle block, obtains one or more linguistic feature of participle in each participle block as characteristic quantity, wherein, and described instruction
The multi word unit practiced in statement is marked;
Input equipment, described characteristic quantity is input to described artificial neural network as the parameter of artificial neural network by it
In;
Judgment means, using the participle in described artificial neural networks each participle block is the part of multi word unit
The first probability and this participle be not second probability of a part of multi word unit, and according to described first probability and
The comparative result of the second probability judges that whether this participle is a part for multi word unit;And
Training devices, its according to judge result and the result of mark, train described artificial neural network,
Wherein, described equipment also includes: feedback information acquisition device, and it obtains the previous participle adjacent with current participle block
The result of the judgement of block is as feedback information, and described feedback information also serves as the spy of participle in described current participle block
The amount of levying.
Remarks 17, according to the equipment described in remarks 16, wherein, described linguistic feature be following in one or more
Individual: the part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence.
Remarks 18, according to the equipment described in remarks 16 or 17, also include:
Combination unit, N number of participle adjacent in described training statement is combined as N tuple to form participle block by successively,
Wherein N is the natural number more than or equal to 2.
Remarks 19, according to the equipment described in remarks 18, also include:
Morphology template generation device, it calculates institute according to the morphology feature of the participle in the result marked and described N tuple
State the morphology extraction probability that the participle in N tuple is a part for multi word unit, to generate morphology template;And/or
Part of speech template generation device, it calculates institute according to the part of speech feature of the participle in the result marked and described N tuple
State the part of speech extraction probability that the participle in N tuple is a part for multi word unit, to generate part of speech template.
Remarks 20, according to the equipment described in remarks 18, also include:
Extensive device, the morphology of the participle in described N tuple is replaced with corresponding morphology by it, to obtain being mixed with morphology
Extensive N tuple with part of speech;And
Part of speech fault-tolerant template generation device, it is special according to the morphology of the participle in the result marked and described extensive N tuple
Part of speech of seeking peace feature, calculate the participle in described extensive N tuple be multi word unit a part extraction probability as part of speech hold
Wrong information, to generate the fault-tolerant template of part of speech.
While a preferred embodiment of the present invention be shown and described, it is contemplated that those skilled in the art can be in institute
The design various amendments to the present invention in attached spirit and scope by the claims.
Claims (10)
1. the method extracting multi word unit in statement, including:
For each participle block carried out by statement in multiple participle blocks that participle obtains, obtain participle in each participle block
One or more linguistic feature are as characteristic quantity;
Described characteristic quantity is input in described artificial neural network as the parameter of artificial neural network;
Use the first probability that the participle in described artificial neural networks each participle block is a part for multi word unit
With the second probability of the part that this participle is not multi word unit, and come according to described first probability and the second probability
Judge that whether this participle is a part for multi word unit;And
Extract the participle that adjacent two or more are judged as a part for multi word unit, to form multi word unit,
Wherein, described method also includes: obtain the result of judgement of the previous participle block adjacent with current participle block as feedback
Information, and described feedback information is also served as the characteristic quantity of participle in described current participle block.
2., according to the method described in claim 1, also include:
Successively N number of participle adjacent in described statement is combined as N tuple to form described participle block, wherein N for more than or etc.
In the natural number of 2.
Method the most according to claim 2, also includes:
The morphology of the participle in described N tuple is replaced with corresponding part of speech, to obtain the extensive N unit being mixed with morphology with part of speech
Group;And
Morphology feature according to the participle in described extensive N tuple and part of speech feature, obtain described general from the fault-tolerant template of part of speech
Change the extraction probability of a part that the participle in N tuple is multi word unit as part of speech fault tolerance information, and described part of speech is held
Wrong information also serves as the characteristic quantity of the participle in described N tuple.
4. extract an equipment for multi word unit in statement, including:
Linguistic feature acquiring unit, it, for each participle block carried out by statement in multiple participle blocks that participle obtains, obtains
Take one or more linguistic feature of participle in each participle block as characteristic quantity;
Input block, described characteristic quantity is input in described artificial neural network by it as the parameter of artificial neural network;
Judging unit, it uses the participle in described artificial neural networks each participle block to be the part of multi word unit
First probability and this participle are not the second probabilities of a part for multi word unit, and according to described first probability and the
Two probabilities judge that whether this participle is a part for multi word unit;And
Extraction unit, it extracts the participle that adjacent two or more are judged as a part for multi word unit, many to be formed
Word unit,
Wherein, described equipment also includes: feedback information acquiring unit, and it obtains the previous participle block adjacent with current participle block
The result judged is as feedback information, and described feedback information also serves as the feature of participle in described current participle block
Amount.
Equipment the most according to claim 4, also includes:
Assembled unit, N number of participle adjacent in described statement is combined as N tuple to form described participle block, wherein N by successively
For the natural number more than or equal to 2.
Equipment the most according to claim 5, also includes:
Extensive unit, the morphology of the participle in described N tuple is replaced with corresponding part of speech by it, to obtain being mixed with morphology and word
The extensive N tuple of property;And
Part of speech fault tolerance information acquiring unit, it is according to the morphology feature of the participle in described extensive N tuple and part of speech feature, from word
Obtaining the participle in described extensive N tuple in the fault-tolerant template of property is that the extraction probability of a part of multi word unit is fault-tolerant as part of speech
Information, and described part of speech fault tolerance information is also served as the characteristic quantity of participle in described N tuple.
7. a method for training of human artificial neural networks, described artificial neural network is for extracting the multi word unit in statement, institute
The method of stating includes:
For each participle block each training statement carried out in multiple participle blocks that participle obtains, obtain in each participle block
One or more linguistic feature of participle as characteristic quantity, wherein, the multi word unit in described training statement is marked
Note;
Described characteristic quantity is input in described artificial neural network as the parameter of artificial neural network;
Use the first probability that the participle in described artificial neural networks each participle block is a part for multi word unit
With the second probability of the part that this participle is not multi word unit, and according to described first probability and the second probability
Comparative result judges that whether this participle is a part for multi word unit;And
According to the result judged and the result of mark, train described artificial neural network,
Wherein, described method also includes: obtain the result of judgement of the previous participle block adjacent with current participle block as feedback
Information, and described feedback information is also served as the characteristic quantity of participle in described current participle block.
Method the most according to claim 7, also includes:
N number of participle adjacent in described training statement being combined as N tuple successively to form described participle block, wherein N is for being more than
Or the natural number equal to 2.
Method the most according to claim 8, also includes:
The morphology of the participle in described N tuple is replaced with corresponding part of speech, to obtain the extensive N unit being mixed with morphology with part of speech
Group;And
Morphology feature according to the participle in the result marked and described extensive N tuple and part of speech feature, calculate described extensive N unit
Participle in group be the extraction probability of a part for multi word unit as part of speech fault tolerance information, to generate the fault-tolerant template of part of speech.
10. an equipment for training of human artificial neural networks, described artificial neural network is for extracting the multi word unit in statement, institute
The equipment of stating includes:
Linguistic feature acquisition device, it is for each point carried out by each training statement in multiple participle blocks that participle obtains
Word block, obtains one or more linguistic feature of participle in each participle block as characteristic quantity, wherein, and described training language
Multi word unit in Ju is marked;
Input equipment, described characteristic quantity is input in described artificial neural network by it as the parameter of artificial neural network;
Judgment means, use that the participle in described artificial neural networks each participle block is a part for multi word unit
One probability and this participle are not the second probabilities of a part for multi word unit, and according to described first probability and second
The comparative result of probability judges that whether this participle is a part for multi word unit;And
Training devices, its according to judge result and the result of mark, train described artificial neural network,
Wherein, described equipment also includes: feedback information acquisition device, and it obtains the previous participle block adjacent with current participle block
The result judged is as feedback information, and described feedback information also serves as the characteristic quantity of described current participle block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210320806.XA CN103678318B (en) | 2012-08-31 | 2012-08-31 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210320806.XA CN103678318B (en) | 2012-08-31 | 2012-08-31 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103678318A CN103678318A (en) | 2014-03-26 |
CN103678318B true CN103678318B (en) | 2016-12-21 |
Family
ID=50315921
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210320806.XA Expired - Fee Related CN103678318B (en) | 2012-08-31 | 2012-08-31 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103678318B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105404632B (en) * | 2014-09-15 | 2020-07-31 | 深港产学研基地 | System and method for carrying out serialized annotation on biomedical text based on deep neural network |
CN107301454B (en) * | 2016-04-15 | 2021-01-22 | 中科寒武纪科技股份有限公司 | Artificial neural network reverse training device and method supporting discrete data representation |
CN107977352A (en) * | 2016-10-21 | 2018-05-01 | 富士通株式会社 | Information processor and method |
CN107273356B (en) | 2017-06-14 | 2020-08-11 | 北京百度网讯科技有限公司 | Artificial intelligence based word segmentation method, device, server and storage medium |
CN109829162B (en) * | 2019-01-30 | 2022-04-08 | 新华三大数据技术有限公司 | Text word segmentation method and device |
CN110532551A (en) * | 2019-08-15 | 2019-12-03 | 苏州朗动网络科技有限公司 | Method, equipment and the storage medium that text key word automatically extracts |
CN111291195B (en) * | 2020-01-21 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Data processing method, device, terminal and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101093504A (en) * | 2006-03-24 | 2007-12-26 | 国际商业机器公司 | System for extracting new compound word |
CN101187921A (en) * | 2007-12-20 | 2008-05-28 | 腾讯科技(深圳)有限公司 | Chinese compound words extraction method and system |
CN101354712A (en) * | 2008-09-05 | 2009-01-28 | 北京大学 | System and method for automatically extracting Chinese technical terms |
-
2012
- 2012-08-31 CN CN201210320806.XA patent/CN103678318B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101093504A (en) * | 2006-03-24 | 2007-12-26 | 国际商业机器公司 | System for extracting new compound word |
CN101187921A (en) * | 2007-12-20 | 2008-05-28 | 腾讯科技(深圳)有限公司 | Chinese compound words extraction method and system |
CN101354712A (en) * | 2008-09-05 | 2009-01-28 | 北京大学 | System and method for automatically extracting Chinese technical terms |
Non-Patent Citations (4)
Title |
---|
A study on multi-word extraction from Chinese documents;Wen Zhang等;《Advanced Web and Network Technologies, and Applications》;20080428;42-53 * |
Improving word representations via global context and multiple word prototypes;Eric H. Huang等;《Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics》;20120714;873-882 * |
基于神经网络汉语分词模型的优化;何嘉等;《成都信息工程学院学报》;20061231;812-815 * |
神经网络和匹配融合的中文分词研究;李华;《心智与计算》;20100630;117-127 * |
Also Published As
Publication number | Publication date |
---|---|
CN103678318A (en) | 2014-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103678318B (en) | Multi-word unit extraction method and equipment and artificial neural network training method and equipment | |
CN110222163B (en) | Intelligent question-answering method and system integrating CNN and bidirectional LSTM | |
US7873584B2 (en) | Method and system for classifying users of a computer network | |
CN108446271B (en) | Text emotion analysis method of convolutional neural network based on Chinese character component characteristics | |
CN108573047A (en) | A kind of training method and device of Module of Automatic Chinese Documents Classification | |
CN107025284A (en) | The recognition methods of network comment text emotion tendency and convolutional neural networks model | |
CN108363816A (en) | Open entity relation extraction method based on sentence justice structural model | |
CN107590134A (en) | Text sentiment classification method, storage medium and computer | |
CN110222178A (en) | Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing | |
CN111797898B (en) | Online comment automatic reply method based on deep semantic matching | |
CN108170848B (en) | Chinese mobile intelligent customer service-oriented conversation scene classification method | |
Le et al. | Text classification: Naïve bayes classifier with sentiment Lexicon | |
CN105022754A (en) | Social network based object classification method and apparatus | |
CN111460157B (en) | Cyclic convolution multitask learning method for multi-field text classification | |
CN109783794A (en) | File classification method and device | |
CN106997341A (en) | A kind of innovation scheme matching process, device, server and system | |
CN112966525B (en) | Law field event extraction method based on pre-training model and convolutional neural network algorithm | |
CN113254643B (en) | Text classification method and device, electronic equipment and text classification program | |
CN108647191A (en) | It is a kind of based on have supervision emotion text and term vector sentiment dictionary construction method | |
CN113033610B (en) | Multi-mode fusion sensitive information classification detection method | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
CN112215629B (en) | Multi-target advertisement generating system and method based on construction countermeasure sample | |
CN113326374B (en) | Short text emotion classification method and system based on feature enhancement | |
CN110728144A (en) | Extraction type document automatic summarization method based on context semantic perception | |
CN111368524A (en) | Microblog viewpoint sentence recognition method based on self-attention bidirectional GRU and SVM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161221 Termination date: 20180831 |