CN103678318B - Multi-word unit extraction method and equipment and artificial neural network training method and equipment - Google Patents

Multi-word unit extraction method and equipment and artificial neural network training method and equipment Download PDF

Info

Publication number
CN103678318B
CN103678318B CN201210320806.XA CN201210320806A CN103678318B CN 103678318 B CN103678318 B CN 103678318B CN 201210320806 A CN201210320806 A CN 201210320806A CN 103678318 B CN103678318 B CN 103678318B
Authority
CN
China
Prior art keywords
participle
word unit
speech
block
multi word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210320806.XA
Other languages
Chinese (zh)
Other versions
CN103678318A (en
Inventor
付亦雯
葛乃晟
郑仲光
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201210320806.XA priority Critical patent/CN103678318B/en
Publication of CN103678318A publication Critical patent/CN103678318A/en
Application granted granted Critical
Publication of CN103678318B publication Critical patent/CN103678318B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

This application discloses a kind of multi-word unit extraction method and equipment and artificial neural network training method and equipment.The method extracting multi word unit includes: for each participle block carried out by statement in multiple participle blocks that participle obtains, and obtains the linguistic feature of participle in each participle block as characteristic quantity;Characteristic quantity is input in artificial neural network as parameter;Use the second probability of the part that the participle in artificial neural networks each participle block is first probability of a part of multi word unit and this participle is not multi word unit, and judge that whether this participle is a part for multi word unit according to the first and second probabilities;Extract adjacent two or more and be judged as the participle of a part of multi word unit to form multi word unit;And obtain the result of judgement of the previous participle block adjacent with current participle block as feedback information, and feedback information is also served as the characteristic quantity of participle in current participle block.

Description

Multi-word unit extraction method and equipment and artificial neural network training method and equipment
Technical field
Present invention relates generally to the field of natural language processing, the method particularly relating to extract the multi word unit in statement The method and apparatus being used for extracting the artificial neural network of the multi word unit in statement with equipment and training.
Background technology
Classical natural language processing system usually assumes that each word is a semantic primitive, but this does not comprise many The situation of word unit.Multi word unit crosses over the border of word, thus multi word unit has special deciphering method.Identify and extract many Word unit is the principal concern of multi word unit process field, and is also considered as the bottleneck studied further.Multi word unit It is commonplace in natural language processing and there is no a concept of explication.Typically, multi word unit refer to two or Plural word unit simultaneously appears in the word combination that probability together is of a relatively high, and this word combination has complete language Justice.Multi word unit is fairly common phenomenon in natural language processing field, and therefore identification and the extraction of multi word unit are the heaviest Want.Owing to not having the word collocation knowledge of abundance, and word combination information dispersion is among each participle, therefore by separate root Reconfigure to become independent semantic primitive according to original meaning, thus it is extremely difficult for obtaining original complete semanteme, especially locates Manage the language as not having segmentation between Chinese this word.
The identification of multi word unit and extraction can be widely applied to machine translation, efficient syntactic analysis, optimization information retrieval and The aspects such as word sense disambiguation.The method being widely used in identification and extraction multi word unit at present has sort method, local maximum side Method (Local Maxima) and condition random field method (Conditional Random Fields) etc..Identifying and extracting many The eigenvalue used during word unit includes mutual information between participle, t mark, entropy and co-occurrence frequency etc..It addition, identify and extract many words Unit further relates to participle instrument, morphology annotation tool, part-of-speech tagging instrument and stops the use of vocabulary etc..
The method of identification of the prior art and extraction multi word unit substantially uses following process: carry out object statement Participle and/or part-of-speech tagging;According to analyzing and/or the result corresponding eigenvalue of calculating of part-of-speech tagging, such as frequency, participle is common Now rate and mutual information etc.;And use special algorithm or model that candidate's multi word unit is sieved according to the eigenvalue calculated Choosing, thus obtain multi word unit more accurately.But, method of the prior art cannot ensure object statement is carried out participle And/or the accuracy of part-of-speech tagging, thus often introduce error message, cause the information during training inherently to comprise mutually The data of contradiction, or cause the eigenvalue in actual application itself and practical situation to have deviation.
Multi word unit is the concept different from phrase or word block, and therefore identification and the extracting method of multi word unit is different from short Language or the identification of word block and extracting method.Specifically, some prepositional phrase in phrase does not have complete semanteme, therefore profit Identification and extracting method with phrase identify and extract multi word unit and can not obtain good effect.It addition, word block is fixed Justice, in syntax aspect, therefore needs to consider syntactic information and the part-of-speech information of composition word block when identifying and extract word block, Strict requirements are not had for semantic integrity, so the identification of word block and extracting method to be applied to the knowledge of multi word unit Other and extraction is also infeasible.
Accordingly, it is desired to provide the method and apparatus of a kind of multi word unit extracted in statement, it can improve multi word unit Identification and the accuracy and efficiency of extraction.
Summary of the invention
Hereinafter will be given for the brief overview of the present invention, in order to provide basic about certain aspects of the invention Understand.Should be appreciated that this general introduction is not that the exhaustive about the present invention is summarized.It is not intended to determine the pass of the present invention Key or pith, nor is it intended to limit the scope of the present invention.Its purpose is only to provide some concept in simplified form, In this, as the preamble in greater detail discussed after a while.
Artificial neural network is applied to identification and the extraction of multi word unit by the present invention.Artificial neural network is a kind of simulation Animal nerve network behavior feature carries out the algorithm model of distributed parallel information processing.Artificial neural network relies on system Complexity, by adjusting the interconnected relationship between internal great deal of nodes, reaches the purpose of process information.ANN Network include substantial amounts of node and between be connected with each other.Each node in artificial neural network represents a kind of specific output Function, connecting between node represents the weighted value corresponding to this connection, and referred to as weight, it is equivalent to artificial neural network Memory.The output of artificial neural network is according to the difference of connected mode, weighted value and the output function of artificial neural network not With.
According to embodiments of the invention, it is provided that a kind of method of multi word unit extracted in statement, including: for by language Sentence carries out each participle block in multiple participle blocks that participle obtains, and obtains one or more languages of participle in each participle block Speech learns feature as characteristic quantity;Characteristic quantity is input in artificial neural network as the parameter of artificial neural network;Use people Artificial neural networks calculates the first probability of a part that the participle in each participle block is multi word unit and this participle is not many Second probability of a part for word unit, and judge whether this participle is many according to the first probability and the second probability A part for word unit;And extract the participle that adjacent two or more are judged as a part for multi word unit, with shape Becoming multi word unit, wherein, the method also includes: obtain the result conduct of the judgement of the previous participle block adjacent with current participle block Feedback information, and feedback information is also served as the characteristic quantity of participle in current participle block.
According to the method for the multi word unit in said extracted statement, also include: successively by N number of participle group adjacent in statement Being combined into N tuple to form participle block, wherein N is the natural number more than or equal to 2.
According to the method for the multi word unit in said extracted statement, also include: the morphology of the participle in N tuple is replaced with Corresponding part of speech, to obtain the extensive N tuple being mixed with morphology with part of speech;And the morphology according to the participle in extensive N tuple Feature and part of speech feature, obtain the extraction that the participle in extensive N tuple is a part for multi word unit from the fault-tolerant template of part of speech Probability is as part of speech fault tolerance information, and part of speech fault tolerance information also serves as the characteristic quantity of participle in N tuple.
According to another embodiment of the present invention, it is provided that the equipment of a kind of multi word unit extracted in statement, including language Learning feature acquiring unit, it, for each participle block carried out by statement in multiple participle blocks that participle obtains, obtains each point One or more linguistic feature of the participle in word block are as characteristic quantity;Input block, its using characteristic quantity as artificial neuron The parameter of network is input in artificial neural network;Judging unit, it uses in artificial neural networks each participle block Second probability of the part that participle is the first probability of a part for multi word unit and this participle is not multi word unit, and And judge that whether this participle is a part for multi word unit according to the first probability and the second probability;And extraction unit, It extracts the participle that adjacent two or more are judged as a part for multi word unit, to form multi word unit, wherein, and should Equipment also includes: feedback information acquiring unit, and the result of its judgement obtaining the previous participle block adjacent with current participle block is made For feedback information, and feedback information is also served as the characteristic quantity of current participle block.
According to the equipment of the multi word unit in said extracted statement, also including: assembled unit, it is successively by adjacent in statement N number of participle be combined as N tuple to form participle block, wherein N is the natural number more than or equal to 2.
According to the equipment of the multi word unit in said extracted statement, also including: extensive unit, it is by the participle in N tuple Morphology replace with corresponding part of speech, to obtain being mixed with the extensive N tuple of morphology and part of speech;And part of speech fault tolerance information obtains Unit, it is according to the morphology feature of the participle in extensive N tuple and part of speech feature, obtains extensive N unit from the fault-tolerant template of part of speech Participle in group be the extraction probability of a part for multi word unit as part of speech fault tolerance information, and part of speech fault tolerance information is also made Characteristic quantity for the participle in N tuple.
According to still another embodiment of the invention, it is provided that a kind of method of training of human artificial neural networks, artificial neural network For extracting the multi word unit in statement, the method includes: carry out, for by each training statement, multiple participles that participle obtains Each participle block in block, obtains one or more linguistic feature of participle in each participle block as characteristic quantity, its In, the multi word unit in training statement is marked;Characteristic quantity is input to artificial neuron as the parameter of artificial neural network In network;Use the participle in artificial neural networks each participle block be multi word unit a part the first probability and This participle is not the second probability of a part for multi word unit, and ties according to the comparison of the first probability and the second probability Fruit judges that whether this participle is a part for multi word unit;And according to the result judged and the result of mark, carry out training of human Artificial neural networks, wherein, the method also includes: obtain the result conduct of the judgement of the previous participle block adjacent with current participle block Feedback information, and feedback information is also served as the characteristic quantity of participle in current participle block.
According to the method for above-mentioned a kind of training of human artificial neural networks, also include: successively by N number of point adjacent in training statement Phrase is combined into N tuple to form participle block, and wherein N is the natural number more than or equal to 2.
According to the method for above-mentioned a kind of training of human artificial neural networks, also include: the morphology of the participle in N tuple is replaced with Corresponding part of speech, to obtain the extensive N tuple being mixed with morphology with part of speech;And according in the result marked and extensive N tuple The morphology feature of participle and part of speech feature, calculate the extraction probability that the participle in extensive N tuple is a part for multi word unit As part of speech fault tolerance information, to generate the fault-tolerant template of part of speech.
According to one more embodiment of the present invention, it is provided that the equipment of a kind of training of human artificial neural networks, this ANN Network is for extracting the multi word unit in statement, and this equipment includes: linguistic feature acquisition device, and it is for by each training statement Carry out each participle block in multiple participle blocks that participle obtains, obtain one or more language of participle in each participle block Speech feature is as characteristic quantity, and wherein, the multi word unit in training statement is marked;Input equipment, its using characteristic quantity as The parameter of artificial neural network is input in artificial neural network;Judgment means, uses each participle of artificial neural networks The second of the part that participle in block is the first probability of a part for multi word unit and this participle is not multi word unit can Can property, and judge that whether this participle is of multi word unit according to the comparative result of the first probability and the second probability Point;And training devices, its according to judge result and the result of mark, carry out training of human artificial neural networks, wherein, this equipment is also Including: feedback information acquisition device, the result of the judgement of the previous participle block that its acquisition is adjacent with current participle block is as feedback Information, and feedback information is also served as the characteristic quantity of participle in current participle block.
According to the present invention, by the artificial neural network with feedback configuration being applied to the identification of multi word unit and carrying Take, the identification of multi word unit and the accuracy and efficiency of extraction can be improved.
Accompanying drawing explanation
The present invention can be by with reference to being better understood, wherein in institute below in association with the description given by accompanying drawing Have in accompanying drawing and employ same or analogous reference to represent same or like parts.Described accompanying drawing is together with following Describe the part comprising in this manual and being formed this specification together in detail, and be used for being further illustrated by this The preferred embodiment of invention and the principle and advantage of the explanation present invention.In the accompanying drawings:
Fig. 1 is the schematic flow of the method illustrating the multi word unit extracted according to an embodiment of the invention in statement Figure;
Fig. 2 is to illustrate to utilize the artificial neural network with feedback configuration to extract in statement according to an embodiment of the invention The schematic diagram of multi word unit;
Fig. 3 is to illustrate to use N tuple to the method extracting the multi word unit in statement according to an embodiment of the invention Indicative flowchart;
Fig. 4 is to illustrate to use N tuple to extract the schematic diagram of the multi word unit in statement according to an embodiment of the invention;
Fig. 5 is to illustrate to use N tuple to extract probability and/or part of speech extraction to obtain morphology according to an embodiment of the invention The indicative flowchart of the method for probability;
Fig. 6 is to illustrate to use N tuple to carry out the schematic flow of the fault-tolerant method of part of speech according to an embodiment of the invention Figure;
Fig. 7 is to illustrate to use N tuple to carry out the schematic diagram that part of speech is fault-tolerant according to an embodiment of the invention;
Fig. 8 is the schematic block diagram of the equipment illustrating the multi word unit extracted according to an embodiment of the invention in statement;
Fig. 9 is the schematic frame of the equipment illustrating the multi word unit extracted in statement according to another embodiment of the present invention Figure;
Figure 10 is equipment schematic illustrating the multi word unit extracted in statement according to another embodiment of the present invention Block diagram;
Figure 11 is equipment schematic illustrating the multi word unit extracted in statement according to another embodiment of the present invention Block diagram;
Figure 12 is to illustrate the artificial neuron trained according to an embodiment of the invention for extracting the multi word unit in statement The indicative flowchart of the method for network;
Figure 13 is to illustrate to use N tuple to train the multi word unit for extracting in statement according to an embodiment of the invention The indicative flowchart of method of artificial neural network;
Figure 14 is to illustrate to use N tuple to generate morphology template and/or the side of part of speech template according to an embodiment of the invention The indicative flowchart of method;
Figure 15 is to illustrate to use N tuple to generate method schematic of the fault-tolerant template of part of speech according to an embodiment of the invention Flow chart;
Figure 16 is to illustrate to use N tuple to generate the schematic diagram of the fault-tolerant template of part of speech according to an embodiment of the invention;
Figure 17 is to illustrate the artificial neuron trained according to an embodiment of the invention for extracting the multi word unit in statement The schematic block diagram of the equipment of network;
Figure 18 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement The schematic block diagram of the equipment of neutral net;
Figure 19 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement The schematic block diagram of the equipment of neutral net;
Figure 20 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement The schematic block diagram of the equipment of neutral net;And
Figure 21 is to illustrate the schematic block diagram being used as implementing messaging device according to an embodiment of the invention.
Detailed description of the invention
Hereinafter in connection with accompanying drawing, the exemplary embodiment of the present invention is described.For clarity and conciseness, All features of actual embodiment are not the most described.It should be understood, however, that in any this actual enforcement of exploitation Can make during mode much specific to the decision of embodiment, in order to realize the objectives of developer, and These decisions may change along with the difference of embodiment.
Here, also need to explanation a bit, in order to avoid having obscured the present invention because of unnecessary details, in the accompanying drawings Illustrate only and according to the closely-related apparatus structure of the solution of the present invention, and eliminate other little with relation of the present invention Details.
The side of the multi word unit extracted according to an embodiment of the invention in statement is described below in conjunction with Fig. 1 and Fig. 2 Method.Fig. 1 is the indicative flowchart of the method illustrating the multi word unit extracted according to an embodiment of the invention in statement, and schemes 2 is to illustrate to utilize the artificial neural network with feedback configuration to extract the multi word unit in statement according to an embodiment of the invention Schematic diagram.
As it is shown in figure 1, this process starts at S100.Then, this process proceeds to S102.
At S102, for each participle block carried out by statement in multiple participle blocks that participle obtains, obtain each participle One or more linguistic feature of the participle in block are as characteristic quantity.
Statement in language material is carried out participle, thus is multiple participle blocks by sentence segmentation, wherein participle block can wrap Containing at least one participle.The participle in each participle block in multiple participle blocks that cutting is obtained according to it at original statement In word order process successively.For example, it is possible to process to obtain the one or more of participle to the participle in participle block Linguistic feature.Such as, the linguistic feature of participle can be following in one or more: the part of speech of participle, participle Morphology, participle sequence number or participle probability of occurrence.It will be appreciated by those skilled in the art that the linguistic feature of participle is not limited to above The example enumerated.After obtaining the linguistic feature of participle, can be using the linguistic feature of the participle of acquisition as characteristic quantity For follow-up process.
Such as, for statement " step of initial application primer ", this statement is carried out participle, thus obtains following participle Result " initially/use/draw/thing// step ", say, that it is following many by statement " step of initial application primer " cutting Individual participle block " initially ", " using ", " drawing ", " thing ", " ", " step " }, the most each participle block comprises a participle.Connect , in each participle block in the multiple participle blocks obtained participle " initially ", " using ", " drawing ", " thing ", " ", " step Suddenly " } according to " initially " → " using " → " drawing " → " thing " → " " → the order of " step " processes successively.For example, it is possible to To multiple participles " initially ", " using ", " drawing ", " thing ", " ", " step " carry out processing to respectively obtain each participle above-mentioned Part of speech { " (initially) adjective ", " (using) verb ", " (drawing) noun ", " (thing) noun ", " () preposition ", " (step) name Word " }.It will be appreciated by those skilled in the art that can also obtain above-mentioned multiple participle " initially ", " using ", " drawing ", " thing ", " ", " step " other Languages feature, repeat no more here.
After S102, this process proceeds to S104.At S104, characteristic quantity is inputted as the parameter of artificial neural network In artificial neural network.
As in figure 2 it is shown, each circle in artificial neural network 205 represents one or more neuron, it is used for processing circle The information of mark in circle.Neuron in artificial neural network 205 is divided into three hierarchical combination together, is respectively as follows: input layer 202, concealment layer 203 and output layer 204.The value of the neuron of later layer is calculated by the value of the neuron of preceding layer.In Fig. 2 Black arrow representative's artificial neural networks 205 in the flow direction of information, adjacent two-layer neuron is fully connected, and Information is flowed to later layer by preceding layer.Although the concealment layer 203 that it will be appreciated by those skilled in the art that in Fig. 2 illustrate only one Layer, but according to actual needs, concealment layer 203 can include two-layer or more layers.
As in figure 2 it is shown, in the input layer 202 of artificial neural network 205, by t feature of the current participle just processed Amount characteristic quantity 1, characteristic quantity 2 ..., characteristic quantity i ..., characteristic quantity t-1, characteristic quantity t} is as the parameter of artificial neural network 205 Being input in artificial neural network 205, wherein, i and t is the natural number more than or equal to 1, and 1≤i≤t.Can be by upper State one or more linguistic feature of the participle extracted in step S102 as features described above amount.For example, it is possible to by participle Part of speech, the morphology of participle, participle sequence number or participle probability of occurrence are as features described above amount.
Or as a example by statement " step of initial application primer ", for participle " initially ", such as, can obtain participle " Part of speech " noun ", the morphology " initially " of participle " initially ", the sequence number " 1 " of participle " initially " and the appearance of participle " initially " just " Probability " 0.43 " etc. are as the characteristic quantity of participle " initially ", and using the features described above amount of participle " initially " as ANN The parameter of network 205 is input in artificial neural network 205.
After S104, this process proceeds to S106.At S106, use in artificial neural networks each participle block Second probability of the part that participle is the first probability of a part for multi word unit and this participle is not multi word unit, and And judge that whether this participle is a part for multi word unit according to the first probability and the second probability.
After characteristic quantity is input in artificial neural network 205 as the parameter of artificial neural network 205, manually god Through network 205 be determined according to the following equation Current neural unit value:
F (x)=K ((∑iwi×gi(x))+biasW+biasV)
Wherein, K represents activation functions, such as can be byAs activation functions.wiRepresent Current neural unit And the weight between the i-th neuron in preceding layer neuron, is represented by black line in fig. 2.giX () represents preceding layer god It is connected to the value of all neurons of Current neural unit by black line in unit.BiasW and biasV represents Current neural unit respectively Biasing weight and bias.It will be appreciated by those skilled in the art that above-mentioned activation functions and for determining the value of Current neural unit Formula be only exemplary, it is also possible to the activation functions taken other form, or the formula taken other form determines The value of Current neural unit.
In the artificial neural network 205 shown in Fig. 2, the value of the neuron in input layer 202 is exactly characteristic quantity itself Value, one specific weight of each dark line shows.In addition to the neuron in input layer 202, concealment layer 203 and output layer Neuron in 204 has biasing weight and bias.
As in figure 2 it is shown, the output layer 204 in artificial neural network 205 includes two neurons: represent currently processed dividing Word is the neuron 206 of the first probability of a part for multi word unit, and represents that currently processed participle is not multi word unit The neuron 207 of the second probability of a part.Specifically, the value of neuron 206 is represented and is counted by artificial neural network 205 Calculate and obtain the probability of a part or the probability that the participle of the most settled pre-treatment is multi word unit.Such as, if neuron 206 Value be 0.9, then it represents that by calculating, artificial neural network 205 determines that currently processed participle is a part for multi word unit Probability or probability are 0.9.Similarly, the value of neuron 207 represent by artificial neural network 205 calculated determine work as The participle of pre-treatment is not probability or the probability of a part for multi word unit.Such as, if the value of neuron 207 is 0.6, then Represent by calculating, artificial neural network 205 determines that currently processed participle is not the probability of a part for multi word unit or general Rate is 0.6.
It is being calculated the first probability represented by the value of neuron 206 and represented second by the value of neuron 207 After probability, as shown in Fig. 2 208, the first probability and the second probability can be compared.If first can Property can be more than or equal to the second probability, then as shown in Fig. 2 210, it is judged that currently processed participle is the one of multi word unit Part.If the first probability is less than the second probability, then as shown in Fig. 2 209, it is judged that currently processed participle is not A part for multi word unit.Such as, for the participle of current process, if the first probability represented by the value of neuron 206 Be 0.9, and the second probability represented by the value of neuron 207 is 0.6, then may more than second due to the first probability 0.9 Property 0.6, so judging the part that currently processed participle is multi word unit.It is then possible to by participle at the 211 of Fig. 2 Sequence number n adds 1 and obtains the participle of serial number n+1, in order to process the participle of serial number n+1.
After S106, this process proceeds to S108.At S108, extract adjacent two or more and be judged as many words The participle of a part for unit, to form multi word unit.
Or as a example by statement " step of initial application primer ", participle in multiple participle blocks that participle obtains " Just ", " using ", " drawing ", " thing ", " ", " step " } in, it is assumed that participle " draws " and participle " thing " is judged as being multi word unit A part, and " draw " due to participle and participle " thing " is adjacent two participles, therefore extract participle and " draw " and participle " thing " is to form multi word unit " primer ".If the adjacent participle having more than two is judged as being of multi word unit Point, the most also extract to form multi word unit by the adjacent participle of such more than two.
After S108, this process proceeds to S110.At S110, obtain the previous participle block adjacent with current participle block The result judged is as feedback information, and feedback information also serves as the characteristic quantity of participle in current participle block.
As illustrated in fig. 2, it is assumed that the sequence number of the participle block handled by the expression such as n and n+1, then when having processed dividing of serial number n After word block, and then sequence number is added 1 to process next participle block (i.e. the participle block of serial number n+1).Now, serial number n+ The participle block of 1 becomes current participle block, and the participle block of serial number n is the previous participle block adjacent with current participle block.Because The previous participle block of serial number n is processed, so the participle obtained in the previous participle block of serial number n It it is the judged result of an a part also whether part for multi word unit for multi word unit.Therefore, as in figure 2 it is shown, can be by sequence Number feed back to the input layer 202 of artificial neural network 205 as feedback information for the judged result of previous participle block of n, and And when the current participle block of serial number n+1 is processed, this feedback information is also served as the current participle block of serial number n+1 In the characteristic quantity of participle be input in artificial neural network 205.It is to say, make the judgement of the previous participle block of serial number n Result participates in the judgement process of the current participle block of serial number n+1.
Owing to artificial neural network 205 has feedback configuration, i.e. artificial neural network 205 is in judging current participle block When whether participle is multi word unit a part of, it is also contemplated that whether the participle in previous participle block adjacent with current participle block is A part for multi word unit, thus artificial neural network 205 judge participle be whether a part for multi word unit accuracy and Efficiency can be improved to a great extent.
Finally, this process terminates at S112.
According to the method for the present embodiment, by the artificial neural network with feedback configuration being applied to the knowledge of multi word unit Not and extract, the identification of multi word unit and the accuracy and efficiency of extraction can be improved.
Describing below in conjunction with Fig. 3 and Fig. 4 uses N tuple to the many words extracting in statement according to an embodiment of the invention The method of unit.Fig. 3 is to illustrate to use N tuple to the method extracting the multi word unit in statement according to an embodiment of the invention Indicative flowchart, and Fig. 4 be illustrate according to an embodiment of the invention use N tuple to the multi word unit extracting in statement Schematic diagram.
As it is shown on figure 3, this process starts at S300.Then, this process proceeds to S302.
At S302, successively N number of participle adjacent in statement is combined as N tuple to form participle block, wherein N for more than or Natural number equal to 2.
N number of participle adjacent in statement can be combined as N tuple to form participle block, and enter in units of N tuple The process that row is follow-up.For example, it is possible to two participles adjacent with current participle left and right and current participle are combined as tlv triple.Right In the participle at beginning of the sentence, first element of tlv triple is empty;For the participle at sentence tail, last element of tlv triple is Empty.
Or as a example by statement " step of initial application primer ", can be as shown in the dark square in Fig. 4, successively by above-mentioned Participle " initially " and " using " in statement are combined as tlv triple<NULL initially, uses>, by participle " initially ", " using " and " draw " and be combined as tlv triple<initially, use, draw>..., by participle " " and " step " be combined as tlv triple<, step, NULL >, wherein, NULL represents empty.It is easy to understand, here, the one that tlv triple is i.e. the participle block including three participles is shown Example.
After determining N tuple, the linguistic feature of each element in N tuple can be obtained.For example, it is possible to use part of speech Analytical tool obtains the part of speech of each element in N tuple.It is, for example possible to use Stamford part of speech analytical tool obtains N unit The part of speech of each element in group.As shown in Figure 4, for tlv triple<initially, use, draw>, therein the can be obtained respectively The part of speech of one element " initially " is adjective JJ, and the part of speech that second element " is used " is verb VBG, and the 3rd element The part of speech " drawn " is noun NN.Alternatively, it is also possible to use corresponding instrument to obtain the other Languages of each element in N tuple Feature, repeats no more here.
After the linguistic feature of each element in obtaining N tuple, can be by the language of each element of acquisition Learn feature all as this attribute of an element.Such as, as shown in Figure 4, for each element in N tuple, m genus is altogether listed Property attribute 1, and attribute 2, attribute 3 ..., attribute m}, wherein m is the natural number more than or equal to 1.Above-mentioned m attribute is the most permissible It is the part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence etc., but is not limited to this.Such as, for tlv triple First element " initially " in<initially, use, draw>, the value that can obtain its attribute 1 is " 1 ", and the value of attribute 2 is " 2 ", belongs to The value of property 3 is " 23 " ..., the value of attribute m is "false".
In units of N tuple, successively m attribute of each element in N tuple can be input to people as characteristic quantity Calculating in artificial neural networks (ANN) 205, to judge this element part whether as multi word unit, it specifically judged Journey and subsequent treatment are similar with the process of step S106 in Fig. 1 to step S110, simply participle included in participle block Quantity is different, and therefore its detail does not repeats them here.Cross in Fig. 4 represents that corresponding element is judged as not being A part for multi word unit, and check the number and represent the part that corresponding element is judged as being multi word unit.Two or two with Upper continuous print is checked the number and is represented a complete multi word unit.As shown in Figure 4, because element " draws " corresponding check mark, element " thing " is also Corresponding check mark, and element " draws " and " thing " is adjacent one another are, therefore " primer " is extracted as multi word unit.
Finally, this process terminates at S304.
According to the method for the present embodiment, the multi word unit processing to extract in statement can be carried out in units of N tuple, from And improve the identification of multi word unit and the accuracy and efficiency of extraction further.
Below in conjunction with Fig. 5 describe according to an embodiment of the invention use N tuple to obtain morphology extract probability and/or Part of speech extracts the method for probability.Fig. 5 is to illustrate to use N tuple to extract probability to obtain morphology according to an embodiment of the invention And/or the indicative flowchart of the method for part of speech extraction probability.
As it is shown in figure 5, this process starts from S500.Then, this process proceeds to S502.
In step S502, according to the morphology feature of the participle in N tuple, from morphology template, obtain the participle in N tuple It is that the morphology of a part for multi word unit extracts probability, and morphology is extracted probability also serves as the feature of the participle in N tuple Amount.
Such as, for tlv triple<initially, use, draw>, the morphology of the participle in this tlv triple<initially, use, draw>is special Levy as " initially, use, draw "." initially, can use, draw " word searching correspondence in morphology template according to above-mentioned morphology feature Shape, thus obtain the morphology corresponding with this morphology and extract probability, this morphology extracts probability and represents that this tlv triple < initially, is used, drawn Participle " initially ", " using " or " drawing " in > is the probability of a part for multi word unit.It is then possible to the morphology of acquisition is carried Take probability to also serve as the characteristic quantity of the participle in this tlv triple<initially, use, draw>and be input in artificial neural network 205.As Fruit does not find morphology and extracts probability, then process according to default default probability.Morphology template has prestored N unit The morphology of group and the morphology of correspondence thereof extract probability, and this morphology extracts probability and represents that the participle in this N tuple is multi word unit The probability of a part.It will be understood by those skilled in the art that morphology template can preset.It addition, as an alternative, morphology mould Plate can also be by being trained generating to artificial neural network 205.As nonrestrictive example, hereinafter will be to how It is described in detail by being trained artificial neural network 205 generating morphology template.
After S502, this process proceeds to S504.At S504, according to the part of speech feature of the participle in N tuple, from part of speech Template obtains the part of speech extraction probability that the participle in N tuple is a part for multi word unit, and part of speech is extracted probability also Characteristic quantity as the participle in N tuple.
Similarly, such as, for tlv triple<initially, use, draw>, the participle in this tlv triple<initially, use, draw> Part of speech is characterized as " adjective, verb, noun ".Can be according to above-mentioned part of speech feature " adjective, verb, noun " in part of speech template The middle part of speech searching correspondence, thus obtain the part of speech corresponding with this part of speech and extract probability, this part of speech is extracted probability and is represented this ternary Participle " initially ", " using " or " drawing " in group<initially, use, draw>is the probability of a part for multi word unit.It is then possible to The characteristic quantity that the part of speech extraction probability of acquisition also serves as the participle in this tlv triple<initially, use, draw>is input to artificial god In network 205.Extract probability without finding part of speech, then process according to default default probability.Part of speech template In prestored the part of speech of the part of speech of N tuple and correspondence thereof and extracted probability, this part of speech extract probability represent in this N tuple point Word is the probability of a part for multi word unit.It will be understood by those skilled in the art that part of speech template can preset.It addition, As an alternative, part of speech template can also be by being trained generating to artificial neural network 205.Show as nonrestrictive Example, hereinafter to how will be described in detail by being trained artificial neural network 205 generating part of speech template.
Finally, this process terminates at S506.
It will be appreciated by those skilled in the art that step S502 shown in Fig. 5 and S504 can sequentially perform, it is also possible to and Row performs, or can only perform any one in step S502 and S504.According to the method for the present embodiment, can be according to N unit Group obtains morphology from morphology template and part of speech template and extracts probability and/or part of speech extraction probability, to utilize relevant multi word unit Existing knowledge and the characteristic quantity that is input in artificial neural network of increase, thus further increase the identification of multi word unit With the accuracy and efficiency extracted.
Describing below in conjunction with Fig. 6 and Fig. 7 uses N tuple to carry out the side that part of speech is fault-tolerant according to an embodiment of the invention Method.Fig. 6 is to illustrate to use N tuple to carry out the indicative flowchart of the fault-tolerant method of part of speech according to an embodiment of the invention, and Fig. 7 is to illustrate to use N tuple to carry out the schematic diagram that part of speech is fault-tolerant according to an embodiment of the invention.
As shown in Figure 6, this process starts from S600.Then, this process proceeds to S602.
In step S602, the morphology of the participle in N tuple is replaced with corresponding part of speech, to obtain being mixed with morphology and word The extensive N tuple of property.
Describing below in conjunction with Fig. 7 uses N tuple to carry out the method that part of speech is fault-tolerant according to an embodiment of the invention.Such as figure Shown in 7, at 702, select the N tuple that may comprise mistake part of speech carrying out processing.Such as, for statement, " antigen discharges Thing released antigen " carry out participle after multiple participles { " antigen ", " release ", " thing ", " release ", " antigen " } of obtaining, can point Word " antigen ", " release " and " thing " is formed as a tlv triple<antigen, release, thing>, and wherein the part of speech of participle " antigen " is marked Note is " noun ", and the part of speech that participle " discharges " is noted as " verb ", and the part of speech of participle " thing " is noted as " noun ".Assume to want The tlv triple processed is<antigen, release, thing>, and " antigen releasing device " should be a multi word unit, but due to wherein The part of speech that " discharges " of participle be labeled as verb mistakenly, so will not label it as when analyzing " release " this participle A part for multi word unit, thus cannot correctly identify whole multi-words expression " antigen releasing device ".
As it is shown in fig. 7, it is extensive to carry out N tuple at 704.The extensive process of N tuple is described below in conjunction with Figure 16.Such as figure Shown in 16, determine at 1602 and need extensive N tuple, and determine number N of element in this N tuple.At 1604, choosing Selecting number x of the element wanting extensive, any x participle, typically from the beginning of 1, is generalized for part of speech according to the value of x by x.At 1606, Value according to x is from treating to select extensive N tuple x element, and lists all possible combination, by this element with its part of speech generation Put back in N tuple for morphology, and store all possible extensive after N tuple.At 1608, judge whether x is equal to N, if No, then at 1610, x is added 1, to obtain new x value at 1612.Then, according to new x value repetition 1604,1606 and 1608 The process at place, till x is equal to N.
Or the multiple participles obtained after carrying out participle with statement " antigen releasing device released antigen " " antigen ", " release ", " thing ", " release ", " antigen " } as a example by, it is assumed that tlv triple<antigen, release, thing>is carried out extensive, then the unit in this tlv triple Number N of element is 3, and x can be 1,2 or 3.When x is 1, by the morphology of an element in tlv triple<antigen, release, thing> Replace with part of speech, such that it is able to obtain following extensive after tlv triple:<noun, release, thing>,<antigen, verb, thing>,<anti- Former, release, noun >.When x is 2, the morphology of two elements in tlv triple<antigen, release, thing>is replaced with part of speech, from And can obtain following extensive after tlv triple:<noun, verb, thing>,<antigen, verb, noun>,<noun, release, name Word >.When x is 3, the morphology of three elements in tlv triple<antigen, release, thing>is replaced with part of speech, such that it is able to obtain Tlv triple after as follows extensive:<noun, verb, noun>.
After S602, this process proceeds to S604.At S604, according to the morphology feature of the participle in extensive N tuple and Part of speech feature, obtains the extraction probability work that the participle in extensive N tuple is a part for multi word unit from the fault-tolerant template of part of speech For part of speech fault tolerance information, and part of speech fault tolerance information is also served as the characteristic quantity of participle in N tuple.
By the process of above-mentioned steps S602 can obtain all possible extensive after N tuple.Then, as it is shown in fig. 7, At 706, can according to all possible extensive after N tuple, search in the fault-tolerant template of part of speech respectively correspondence extensive N unit Group, thus obtain the extraction probability corresponding with extensive N tuple as part of speech fault tolerance information, this extraction probability represents that this extensive N is first Participle in group is the probability of a part for multi word unit.The part of speech fault tolerance information of acquisition can be also served as dividing in N tuple The characteristic quantity of word is input in artificial neural network 205, and the further feature amount being combined in the artificial neural network at 708 is entered Row training, thus at 710, make the artificial neural network strengthening impact on judged result.Therefore, as described at 712, can When mistake part of speech occurs in object element, to reduce the deviation that part of speech mistake causes, thus it be fault-tolerant to realize part of speech.
Without finding the extraction probability as part of speech fault tolerance information, then according to default default probability at Reason.Having prestored the extraction probability of extensive N tuple and correspondence thereof in the fault-tolerant template of part of speech, this extraction probability represents this extensive N Participle in tuple is the probability of a part for multi word unit.It will be understood by those skilled in the art that the fault-tolerant template of part of speech is permissible Preset.It addition, as an alternative, the fault-tolerant template of part of speech can also be by being trained generating to artificial neural network 205. As nonrestrictive example, hereinafter will be to the most fault-tolerant by artificial neural network 205 is trained generating part of speech Template is described in detail.
Or as a example by above-mentioned tlv triple<antigen, release, thing>, a series of extensive three can be obtained by extensive Tuple:<noun, release, thing>,<antigen, verb, thing>,<antigen, release, noun>,<noun, verb, thing>,<antigen, dynamic Word, noun>,<noun, release, noun>,<noun, verb, noun>.Every according in above-mentioned a series of extensive tlv triple Individual, in the fault-tolerant template of part of speech, search the extensive tlv triple of correspondence respectively, thus obtain in tlv triple<antigen, release, thing> Participle is that the extraction probability of a part for multi word unit is as part of speech fault tolerance information.
Finally, this process terminates at S606.
According to the method for the present embodiment, the deviation of the eigenvalue caused by part-of-speech tagging mistake can be alleviated, even if therefore Error message is refer to, it is also possible to correctly identify and extract the multi word unit in statement during part-of-speech tagging, thus can To improve identification and the accuracy and efficiency of extraction of multi word unit further.
Illustrate to extract according to an embodiment of the invention setting of multi word unit in statement below in conjunction with Fig. 8 to Figure 11 Standby.
Fig. 8 is the schematic block diagram of the equipment illustrating the multi word unit extracted according to an embodiment of the invention in statement. As shown in Figure 8, the equipment 800 extracting the multi word unit in statement includes: linguistic feature acquiring unit 802, it is for by language Sentence carries out each participle block in multiple participle blocks that participle obtains, and obtains one or more languages of participle in each participle block Speech learns feature as characteristic quantity;Input block 804, characteristic quantity is input to artificial neuron as the parameter of artificial neural network by it In network;Judging unit 806, it uses the participle in artificial neural networks each participle block to be the part of multi word unit The first probability and this participle be not second probability of a part of multi word unit, and according to the first probability and second Probability judges that whether this participle is a part for multi word unit;Extraction unit 808, it extracts adjacent two or more It is judged as the participle of a part for multi word unit, to form multi word unit;And feedback information acquiring unit 810, it obtains The result of the judgement of the previous participle block adjacent with current participle block is as feedback information, and is also served as currently by feedback information The characteristic quantity of the participle in participle block.
It is pointed out that at the relational language involved with the embodiment of device-dependent or state with above basis Term used in the embodiment elaboration of the method for embodiments of the invention or statement correspondence, do not repeat them here.
Fig. 9 is the schematic frame of the equipment illustrating the multi word unit extracted in statement according to another embodiment of the present invention Figure.As it is shown in figure 9, the equipment 900 of the multi word unit in extraction statement includes linguistic feature acquiring unit 802, input block 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 and assembled unit 902.Extract the many words in statement Linguistic feature acquiring unit 802 in the equipment 900 of unit, input block 804, judging unit 806, extraction unit 808 and Feedback information acquiring unit 810 and the linguistic feature acquiring unit 802 in the equipment 800 of the multi word unit extracted in statement, Input block 804, judging unit 806, extraction unit 808 are identical with feedback information acquiring unit 810, and its details is the most superfluous at this State.It addition, assembled unit 902 in the equipment 900 of multi word unit in extraction statement is for successively by adjacent N number of in statement Participle is combined as N tuple to form participle block, and wherein N is the natural number more than or equal to 2.
Figure 10 is equipment schematic illustrating the multi word unit extracted in statement according to another embodiment of the present invention Block diagram.As shown in Figure 10, the equipment 1000 extracting the multi word unit in statement includes linguistic feature acquiring unit 802, input Unit 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810, assembled unit 902, morphology extract probability Acquiring unit 1002 and part of speech extract probability acquiring unit 1004.Extract the language in the equipment 1000 of the multi word unit in statement Learn feature acquiring unit 802, input block 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 and group Close unit 902 and the linguistic feature acquiring unit 802 in the equipment 900 of the multi word unit extracted in statement, input block 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 are identical with assembled unit 902, and its details is at this not Repeat again.It addition, the morphology in the equipment 1000 of multi word unit in extraction statement extracts probability acquiring unit 1002, its basis The morphology feature of the participle in N tuple, obtains the morphology that the participle in N tuple is a part for multi word unit from morphology template Extract probability, and morphology is extracted probability also serve as the characteristic quantity of the participle in N tuple;Part of speech extracts probability acquiring unit 1004, it is according to the part of speech feature of the participle in N tuple, and the participle obtained in N tuple from part of speech template is multi word unit The part of speech of a part extracts probability, and part of speech is extracted probability also serves as the characteristic quantity of the participle in N tuple.
Figure 11 is equipment schematic illustrating the multi word unit extracted in statement according to another embodiment of the present invention Block diagram.As shown in figure 11, the equipment 1100 extracting the multi word unit in statement includes linguistic feature acquiring unit 802, input Unit 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810, assembled unit 902, extensive unit 1102 With part of speech fault tolerance information acquiring unit 1104.Extract the linguistic feature in the equipment 1100 of the multi word unit in statement and obtain single Unit 802, input block 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 and assembled unit 902 with Extract the linguistic feature acquiring unit 802 in the equipment 900 of the multi word unit in statement, input block 804, judging unit 806, extraction unit 808, feedback information acquiring unit 810 are identical with assembled unit 902, and its details does not repeats them here.It addition, The morphology of the participle in N tuple is replaced with accordingly by the extensive unit 1102 extracted in the equipment 1100 of the multi word unit in statement Part of speech, to obtain being mixed with the extensive template of morphology and part of speech;Part of speech fault tolerance information acquiring unit 1104 obtains extensive template In the probability of the part that middle participle is multi word unit as part of speech fault tolerance information, and part of speech fault tolerance information is also served as The characteristic quantity of each participle in N tuple.
Each device and/or unit in above-mentioned Fig. 8 to Figure 11 such as may be configured to according to the phase in correlation method The working method answering step operates.Details sees above-mentioned for the enforcement illustrated according to the method for embodiments herein Example.Do not repeat them here.
The multi word unit trained according to an embodiment of the invention for extracting in statement is described below in conjunction with Figure 12 The method of artificial neural network.Figure 12 is to illustrate the many words list trained according to an embodiment of the invention for extracting in statement The indicative flowchart of the method for the artificial neural network of unit.
As shown in figure 12, this process starts at S1200.Then, this process proceeds to S1202.
At S1202, for each participle block each training statement carried out in multiple participle blocks that participle obtains, obtain One or more linguistic feature of participle in each participle block is as characteristic quantity, wherein, and the many words list in training statement Unit is marked.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1202 and Fig. 1 In the process of S102 essentially identical, its detail does not repeats them here.It addition, about training statement, to therein Multi word unit is marked.
After S1202, this process proceeds to S1204.At S1204, using characteristic quantity as the parameter of artificial neural network It is input in artificial neural network.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1204 and Fig. 1 In the process of S104 essentially identical, its detail does not repeats them here.
After S1204, this process proceeds to S1206.At S1206, use artificial neural networks each participle block In the part that participle is first probability of a part of multi word unit and this participle is not multi word unit second may Property, and judge that whether this participle is a part for multi word unit according to the first probability and the second probability.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1206 and Fig. 1 In the process of S106 essentially identical, its detail does not repeats them here.
After S1206, this process proceeds to S1208.At S1208, according to the result judged and the result of mark, instruct Practice artificial neural network.
The training process of artificial neural network 205 is exactly the process solving the weights in artificial neural network 205. The present invention uses BP(Back Propagation, error back propagation) algorithm carries out the training of artificial neural network 205. Detailed process is as follows:
A) initialize artificial neural network 205, select the weight randomly generated;
B) project of the training data with expected value is input in artificial neural network 205 one by one, and calculates defeated Go out value;
C) difference between output valve and expected value, the mistake of each neuron in calculating artificial neural network 205 are compared Difference;
D) adjust weight and reduce error;
E) repeated execution of steps b)-d), till error is less than predetermined threshold value.Those skilled in the art should manage Solve, based on experience value or above-mentioned predetermined threshold value can be set according to experiment.
The process of training of human artificial neural networks 205 is carried out to concealment layer neuron weight one by one by output layer neuron weights Solve, calculate the variable quantity of each weight respectively.First, the error of each output layer neuron is solved according to equation below:Wherein,It is the desired output valve of i-th neuron,It it is i-th The real output value of neuron,It it is the derivative of activation functions.The mistake of concealment layer neuron is calculated according to equation below Difference:Wherein, wijIt is jth output layer neuron and i-th concealment layer god Weights between unit,It is the error of i-th output layer neuron,It is the actual output of i-th concealment layer neuron Value, wherein h represents that this neuron is concealment layer neuron.The input value of input layer is output valve, does not the most miss Difference.
After calculating the error of each neuron, the adjustment amplitude of weight can be calculated: Δ w=ρ × δi×ni, wherein ρ is Learning rate, δiIt is the error of i-th neuron, niIt it is the value of Current neural unit.New weight is exactly that present weight is plus Δ w.
The method that it will be appreciated by those skilled in the art that above-mentioned training of human artificial neural networks 205 is only exemplary, also may be used Training of human artificial neural networks 205 is carried out with the method using other.
After S1208, this process proceeds to S1210.At S1210, obtain the previous participle adjacent with current participle block The result of the judgement of block is as feedback information, and feedback information also serves as the characteristic quantity of current participle.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1210 and Fig. 1 In the process of S110 essentially identical, its detail does not repeats them here.
Finally, this process terminates at S1212.
According to the method for the present embodiment, the artificial neural network with feedback configuration can be obtained by training, will training The artificial neural network obtained is applied to identification and the extraction of multi word unit, can improve the identification of multi word unit and the standard of extraction Really property and efficiency.
Describing below in conjunction with Figure 13 uses N tuple to train for extracting in statement according to an embodiment of the invention The method of the artificial neural network of multi word unit.Figure 13 is to illustrate to use N tuple to train use according to an embodiment of the invention The indicative flowchart of the method for the artificial neural network of the multi word unit in extraction statement.
As shown in figure 13, this process starts at S1300.Then, this process proceeds to S1302.
At S1302, successively adjacent N number of participle in training statement being combined as N tuple to form participle block, wherein N is Natural number more than or equal to 2.
Except be process each training statement is carried out multiple participle blocks that participle obtains in addition to, the process of S1302 and Fig. 3 In the process of S302 essentially identical, its detail does not repeats them here.
Finally, this process terminates at S1304.
According to the method for the present embodiment, knowledge, morphology combination knowledge etc. can be combined according to the such as part of speech of N tuple existing Knowledge carrys out training of human artificial neural networks, and artificial neural network training obtained is applied to extract the multi word unit in statement, can To improve identification and the accuracy and efficiency of extraction of multi word unit further.
Describing below in conjunction with Figure 14 uses N tuple to generate morphology template and/or part of speech according to an embodiment of the invention The method of template.Figure 14 illustrates and uses N tuple to generate morphology template and/or part of speech template according to an embodiment of the invention The indicative flowchart of method.
As shown in figure 14, this process starts from S1400.Then, this process proceeds to S1402.
In step S1402, according to the morphology feature of the participle in the result marked and N tuple, calculate the participle in N tuple The morphology of the part being noted as multi word unit extracts probability, to generate morphology template.
Such as, for tlv triple<initially, use, draw>, participle therein " initially " and " using " are noted as not being many A part for word unit, and participle therein " draws " and is noted as being the part of multi word unit, and this tlv triple is < initially, Use, draw > in the morphology of participle be characterized as " initially, use, draw ".Artificial neural network can be passed through according to above-mentioned information 205 participles " initially ", " using " or " drawing " calculated in this tlv triple<initially, use, draw>are marked the one of multi word unit The morphology of part extracts probability, and stores the tlv triple corresponding to this morphology extraction probability and current participle explicitly, from And generate morphology template.
In step S1404, according to the part of speech feature of the participle in the result marked and N tuple, calculate the participle in N tuple It is the part of speech extraction probability of a part for multi word unit, to generate part of speech template.
Similarly, such as, for tlv triple<initially, use, draw>, participle therein " initially " and " using " are noted as It not a part for multi word unit, and participle therein " draws " and is noted as being the part of multi word unit, and this tlv triple < Initially, use, draw > in the part of speech of participle be characterized as " adjective, verb, noun ".Can be according to above-mentioned information, by manually The participle " initially ", " using " or " drawing " that neutral net 205 calculates in this tlv triple<initially, use, draw>is marked many words The part of speech of a part for unit extracts probability, and stores three corresponding to this part of speech extraction probability and current participle explicitly Tuple, thus generate part of speech template.
Finally, this process terminates at S1406.
It will be appreciated by those skilled in the art that step S1402 shown in Figure 14 and S1404 can sequentially perform, it is possible to With executed in parallel, or can only perform any one in step S1402 and S1404.According to the method for the present embodiment, can adopt Training of human artificial neural networks is carried out to generate morphology template or part of speech template, by the morphology template generated and part of speech template by N tuple It is applied to identification and the extraction of multi word unit, the identification of multi word unit and the accuracy and efficiency of extraction can be improved further.
Describing below in conjunction with Figure 15 and Figure 16 uses N tuple to generate the fault-tolerant template of part of speech according to an embodiment of the invention Method.Figure 15 is to illustrate to use N tuple to generate method schematic of the fault-tolerant template of part of speech according to an embodiment of the invention Flow chart.Figure 16 is to illustrate to use N tuple to generate the schematic diagram of the fault-tolerant template of part of speech according to an embodiment of the invention.
As shown in figure 15, this process starts from S1500.Then, this process proceeds to S1502.
In step S1502, the morphology of the participle in N tuple is replaced with corresponding part of speech, with obtain being mixed with morphology with The extensive N tuple of part of speech.
In addition to being to process each training statement is carried out multiple participles that participle obtains, in the process of S1502 and Fig. 6 The process of S602 essentially identical, its detail does not repeats them here.
After S1502, this process proceeds to S1504.At S1504, according to dividing in the result marked and extensive N tuple The morphology feature of word and part of speech feature, calculate the extraction probability that the participle in extensive N tuple is marked a part for multi word unit As part of speech fault tolerance information, to generate the fault-tolerant template of part of speech.
By the process of above-mentioned steps S1502 can obtain all possible extensive after N tuple.It is then possible to according to Mark result and all possible extensive after N tuple, the participle calculated respectively in extensive N tuple is marked multi word unit The extraction probability of a part is as part of speech fault tolerance information.
Or as a example by above-mentioned tlv triple<antigen, release, thing>, wherein participle " antigen ", " release " and " thing " is all marked Note is for being the part of multi word unit, and above-mentioned tlv triple can obtain a series of extensive tlv triple by extensive: < name Word, release, thing>,<antigen, verb, thing>,<antigen, release, noun>,<noun, verb, thing>,<antigen, verb, noun>,< Noun, release, noun>,<noun, verb, noun>.Therefore, as shown in figure 16, at 1614, according to the result of above-mentioned mark Each with in above-mentioned a series of extensive tlv triple, the participle calculated respectively in above-mentioned extensive tlv triple is noted as many words list The extraction probability of a part for unit is as part of speech fault tolerance information, and stores this part of speech fault tolerance information and current participle explicitly Corresponding tlv triple, thus generate the fault-tolerant template of part of speech.
Owing to the major part fault-tolerant template of part of speech all comprising in part-of-speech information and morphology information, and N tuple template not only Comprise current goal participle and also comprise participle information before and after current participle, it is possible to greatly weaken single error part of speech institute The impact caused, when mistake part of speech being input in artificial neural network, the participle in the fault-tolerant template of part of speech is multi word unit The probability of a part can suppress the mistake part of speech impact on final judged result by the calculating of artificial neural network.
Finally, this process terminates at S1506.
According to the method for the present embodiment, can alleviate during training of human artificial neural networks and be drawn by part-of-speech tagging mistake The deviation of the eigenvalue risen, and generate the fault-tolerant template of part of speech, if fault-tolerant for the part of speech of generation template is applied to multi word unit Identification and extraction, even if then refer to error message during part-of-speech tagging, it is also possible to correctly identify and extract statement In multi word unit, such that it is able to improve further identification and the accuracy and efficiency of extraction of multi word unit.
Illustrate to train the many words for extracting in statement according to an embodiment of the invention below in conjunction with Figure 17 to Figure 20 The equipment of the artificial neural network of unit.
Figure 17 is to illustrate the artificial neuron trained according to an embodiment of the invention for extracting the multi word unit in statement The schematic block diagram of the equipment of network.As shown in figure 17, training is for extracting the artificial neural network of the multi word unit in statement Equipment 1700 include: linguistic feature acquisition device 1702, they are multiple for what each training statement carried out participle obtains Each participle block in participle block, obtains one or more linguistic feature of participle in each participle block as feature Amount, wherein, the multi word unit in training statement is marked;Input equipment 1704, its using characteristic quantity as artificial neural network Parameter be input in artificial neural network;Judgment means 1706, it uses in artificial neural networks each participle block Second probability of the part that participle is the first probability of a part for multi word unit and this participle is not multi word unit, and And judge that whether this participle is a part for multi word unit according to the comparative result of the first probability and the second probability;Training Device 1708, it, according to the result judged and the result of mark, carrys out training of human artificial neural networks;And feedback information acquisition device 1710, it obtains the result of judgement of the previous participle block adjacent with current participle block as feedback information, and by feedback letter Breath also serves as the characteristic quantity of the participle in current participle block.
It is pointed out that at the relational language involved with the embodiment of device-dependent or state with above basis Term used in the embodiment elaboration of the method for embodiments of the invention or statement correspondence, do not repeat them here.
Figure 18 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement The schematic block diagram of the equipment of neutral net.As shown in figure 18, training is for extracting the artificial neuron of the multi word unit in statement The equipment 1800 of network includes linguistic feature acquisition device 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination unit 1802.Training is for extracting the artificial god of the multi word unit in statement Linguistic feature acquisition device 1702 in the equipment 1800 of network, input equipment 1704, judgment means 1706, training devices 1708 and feedback information acquisition device 1710 with training for extracting the equipment of the artificial neural network of the multi word unit in statement Linguistic feature acquisition device 1702, input equipment 1704, judgment means 1706, training devices 1708 and feedback letter in 1700 Breath acquisition device 1710 is identical, and its details does not repeats them here.It addition, training is for extracting the artificial of multi word unit in statement Combination unit 1802 in the equipment 1800 of neutral net successively by training statement in adjacent N number of participle be combined as N tuple with Forming participle block, wherein N is the natural number more than or equal to 2.
Figure 19 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement The schematic block diagram of the equipment of neutral net.As shown in figure 19, training is for extracting the artificial neuron of the multi word unit in statement The equipment 1900 of network includes linguistic feature acquisition device 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710, combination unit 1802, morphology template generation device 1902 and part of speech template generation device 1904.Training linguistic feature in the equipment 1900 extracting the artificial neural network of the multi word unit in statement obtains dress Put 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination unit 1802 obtain with training linguistic feature in the equipment 1800 extracting the artificial neural network of the multi word unit in statement Device 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination dress Putting 1802 identical, its details does not repeats them here.It addition, training is for extracting the artificial neural network of the multi word unit in statement Equipment 1900 in part of speech template generation device 1902, it is according to the morphology feature of participle in the result of mark and N tuple, Calculate the morphology extraction probability that the participle in N tuple is a part for multi word unit, to generate morphology template;And/or part of speech mould Plate generating means 1904, it according to the part of speech feature of the participle in the result marked and N tuple, the participle calculated in N tuple is The part of speech of a part for multi word unit extracts probability, to generate part of speech template.
Figure 20 is to illustrate that training according to another embodiment of the present invention is for extracting the artificial of multi word unit in statement The schematic block diagram of the equipment of neutral net.As shown in figure 20, training is for extracting the artificial neuron of the multi word unit in statement The equipment 2000 of network includes linguistic feature acquisition device 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710, combination unit 1802, extensive device 2002 and part of speech fault-tolerant template generation device 2004.Training linguistic feature in the equipment 2000 extracting the artificial neural network of the multi word unit in statement obtains dress Put 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination unit 1802 obtain with training linguistic feature in the equipment 1800 extracting the artificial neural network of the multi word unit in statement Device 1702, input equipment 1704, judgment means 1706, training devices 1708, feedback information acquisition device 1710 and combination dress Putting 1802 identical, its details does not repeats them here.It addition, training is for extracting the artificial neural network of the multi word unit in statement Equipment 2000 in extensive device 2002, the morphology of the participle in N tuple is replaced with corresponding part of speech, to be mixed with Morphology and the extensive N tuple of part of speech;Part of speech fault-tolerant template generation device 2004, it is according in the result marked and extensive N tuple The morphology feature of participle and part of speech feature, calculate the extraction probability that the participle in extensive N tuple is a part for multi word unit As part of speech fault tolerance information, to generate the fault-tolerant template of part of speech.
It will be appreciated by those skilled in the art that the many words list extracted in statement according to various embodiments of the present invention described above Each functional unit in the equipment of each step in the method for unit or the multi word unit in extraction statement, can be according to actual need Combine arbitrarily, i.e. the process step in the embodiment of the method for a multi word unit extracted in statement can be with it The process step that it extracts in the embodiment of the method for the multi word unit in statement is combined, or, one is extracted in statement Functional unit in the apparatus embodiments of multi word unit can extract in the apparatus embodiments of the multi word unit in statement with other Functional unit be combined, in order to realize desired technical purpose.Similarly, described above according to each reality of the present invention Each step in the method for the training of human artificial neural networks executing example or each function list in the equipment of training of human artificial neural networks Unit, can combine the most arbitrarily, i.e. the process in the embodiment of the method for a training of human artificial neural networks Step can be combined with the process step in the embodiment of the method for other training of human artificial neural networks, or, a training Functional unit in the apparatus embodiments of artificial neural network can with in the apparatus embodiments of other training of human artificial neural networks Functional unit be combined, in order to realize desired technical purpose
Additionally, embodiments herein also proposed a kind of program product, this program product carrying executable finger of machine Order, when performing instruction on messaging device, instruction makes messaging device perform the enforcement according to the invention described above The method of the multi word unit extracted in statement of example.Similarly, embodiments herein also proposed a kind of program product, this journey The carrying executable instruction of machine of sequence product, when performing instruction on messaging device, instruction makes messaging device The method performing the training of human artificial neural networks according to embodiments of the invention described above.
Additionally, embodiments herein also proposed a kind of storage medium, this storage medium includes machine-readable program Code, when performing program code on messaging device, program code makes messaging device perform according to above-mentioned The method of the multi word unit extracted in statement of inventive embodiment.Similarly, embodiments herein also proposed one and deposits Storage media, this storage medium includes machine-readable program code, when performing program code on messaging device, program Code makes the method that messaging device performs the training of human artificial neural networks according to embodiments of the invention described above.
Correspondingly, the storage medium being used for carrying the program product of the instruction code that above-mentioned storage has machine-readable also wraps Include in disclosure of the invention.Storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc..
The equipment and each component units thereof that extract the multi word unit in statement according to an embodiment of the invention can pass through The mode of software, firmware, hardware or a combination thereof configures.Similarly, artificial neuron is trained according to an embodiment of the invention The equipment of network and each component units thereof also can configure by the way of software, firmware, hardware or a combination thereof.Configuration can The specific means or the mode that use are well known to those skilled in the art, and do not repeat them here.Realized by software or firmware In the case of, from storage medium or network to messaging device (such as general shown in Figure 21 with specialized hardware structure Computer 2100) program constituting this software is installed, this computer is when being provided with various program, it is possible to perform various function Deng.
In figure 21, CPU (CPU) 2101 according in read only memory (ROM) 2102 storage program or from Storage part 2108 is loaded into the program of random-access memory (ram) 2103 and performs various process.In RAM 2103, also root According to the data that needs storage is required when CPU 2101 performs various process etc..CPU 2101, ROM 2102 and RAM 2103 It is connected to each other via bus 2104.Input/output interface 2105 is also connected to bus 2104.
Components described below is connected to input/output interface 2105: importation 2106(includes keyboard, mouse etc.), output Part 2107(includes display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and speaker etc.), storage part Point 2108(includes hard disk etc.), communications portion 2109(include NIC such as LAN card, modem etc.).Communication unit 2109 are divided to perform communication process via network such as the Internet.As required, driver 2110 can be connected to input/output and connects Mouth 2105.Detachable media 2111 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed in as required and drive On dynamic device 2110 so that the computer program read out is installed to store in part 2108 as required.
In the case of realizing above-mentioned series of processes by software, the most removable from network such as the Internet or storage medium Unload medium 2111 and the program constituting software is installed.
It will be understood by those of skill in the art that this storage medium is not limited to the wherein storage shown in Figure 21 and has journey Sequence and equipment distribute the detachable media 2111 of the program that provides a user with separately.The example bag of detachable media 2111 Containing disk (comprising floppy disk (registered trade mark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), Magneto-optic disk (comprising mini-disk (MD) (registered trade mark)) and semiconductor memory.Or, storage medium can be ROM 2102, deposit Hard disk comprised in storage part 2108 etc., wherein computer program stored, and it is distributed to user together with the equipment comprising them.
When instruction code is read by machine and performs, above-mentioned method according to embodiments of the present invention can be performed.
Finally, in addition it is also necessary to explanation, term " includes ", " comprising " or its any other variant are intended to non-exclusive Comprising of property, so that include that the process of a series of key element, method, article or equipment not only include those key elements, and Also include other key elements being not expressly set out, or also include intrinsic for this process, method, article or equipment Key element.Additionally, in the case of there is no more restriction, statement " including ... " key element limited, it is not excluded that at bag Include and the process of key element, method, article or equipment there is also other identical element.Furthermore, by wording " first ", " the Two ", technical characteristic that " 3rd " etc. limits or parameter, do not have because of the use of these wording specific order or Person's priority or importance degree.In other words, the use of these wording is intended merely to distinguish or identify these technical characteristics Or parameter and there is no any other restriction implication.
Being not difficult to find out by above description, the technical scheme that embodiments of the invention provide includes but not limited to:
Remarks 1, a kind of method of multi word unit extracted in statement, including:
For each participle block carried out by statement in multiple participle blocks that participle obtains, obtain participle in each participle block One or more linguistic feature as characteristic quantity;
Described characteristic quantity is input in described artificial neural network as the parameter of artificial neural network;
Use the participle in described artificial neural networks each participle block be multi word unit a part first can Energy property and this participle are not the second probabilities of a part for multi word unit, and may according to described first probability and second Property judges that whether this participle is a part for multi word unit;And
Extract the participle that adjacent two or more are judged as a part for multi word unit, to form multi word unit,
Wherein, described method also includes: obtain the result conduct of the judgement of the previous participle block adjacent with current participle block Feedback information, and described feedback information is also served as the characteristic quantity of participle in described current participle block.
Remarks 2, according to the method described in remarks 1, wherein, described linguistic feature be following in one or more: The part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence.
Remarks 3, according to the method according to any one of remarks 1-2, also include:
Successively N number of participle adjacent in described statement is combined as N tuple to form participle block, wherein N for more than or etc. In the natural number of 2.
Remarks 4, according to the method described in remarks 3, also include:
According to the morphology feature of the participle in described N tuple, the participle obtained from morphology template in described N tuple is many The morphology of a part for word unit extracts probability, and described morphology extraction probability also serves as participle in described N tuple Characteristic quantity;And/or
According to the part of speech feature of the participle in described N tuple, the participle obtained from part of speech template in described N tuple is many The part of speech of a part for word unit extracts probability, and described part of speech extraction probability also serves as participle in described N tuple Characteristic quantity.
Remarks 5, according to the method described in remarks 4, also include:
The morphology of the participle in described N tuple is replaced with corresponding part of speech, to obtain being mixed with the general of morphology and part of speech Change N tuple;And
Morphology feature according to the participle in described extensive N tuple and part of speech feature, obtain institute from the fault-tolerant template of part of speech State the extraction probability of a part that the participle in extensive N tuple is multi word unit as part of speech fault tolerance information, and by institute's predicate Property fault tolerance information also serves as the characteristic quantity of the participle in described N tuple.
Remarks 6, the equipment of a kind of multi word unit extracted in statement, including:
Linguistic feature acquiring unit, it is for each participle carried out by statement in multiple participle blocks that participle obtains Block, obtains one or more linguistic feature of participle in each participle block as characteristic quantity;
Input block, described characteristic quantity is input to described artificial neural network as the parameter of artificial neural network by it In;
Judging unit, it uses the participle in described artificial neural networks each participle block to be of multi word unit The first probability divided and this participle are not the second probabilities of a part for multi word unit, and according to described first probability Judge that whether this participle is a part for multi word unit with the second probability;And
Extraction unit, it extracts the participle that adjacent two or more are judged as a part for multi word unit, with shape Become multi word unit,
Wherein, described equipment also includes: feedback information acquiring unit, and it obtains the previous participle adjacent with current participle block The result of the judgement of block is as feedback information, and described feedback information also serves as the spy of participle in described current participle block The amount of levying.
Remarks 7, according to the equipment described in remarks 6, wherein, described linguistic feature be following in one or more: The part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence.
Remarks 8, according to the equipment according to any one of remarks 6-7, also include:
Assembled unit, N number of participle adjacent in described statement is combined as N tuple to form participle block, wherein N by successively For the natural number more than or equal to 2.
Remarks 9, according to the equipment described in remarks 8, also include:
Morphology extracts probability acquiring unit, and it, according to the morphology feature of the participle in described N tuple, obtains from morphology template Take the morphology extraction probability that the participle in described N tuple is a part for multi word unit, and described morphology is extracted probability also Characteristic quantity as the participle in described N tuple;And/or
Part of speech extracts probability acquiring unit, and it, according to the part of speech feature of the participle in described N tuple, obtains from part of speech template Take the part of speech extraction probability that the participle in described N tuple is a part for multi word unit, and described part of speech is extracted probability also Characteristic quantity as the participle in described N tuple.
Remarks 10, according to the equipment described in remarks 8, also include:
Extensive unit, the morphology of the participle in described N tuple is replaced with corresponding part of speech by it, to obtain being mixed with morphology Extensive N tuple with part of speech;And
Part of speech fault tolerance information acquiring unit, it is according to the morphology feature of the participle in described extensive N tuple and part of speech feature, The extraction probability of a part that the participle in described extensive N tuple is multi word unit is obtained as part of speech from the fault-tolerant template of part of speech Fault tolerance information, and described part of speech fault tolerance information is also served as the characteristic quantity of each participle in described N tuple.
Remarks 11, a kind of method of training of human artificial neural networks, described artificial neural network is many for extracting in statement Word unit, described method includes:
For each participle block carried out by each training statement in multiple participle blocks that participle obtains, obtain each participle One or more linguistic feature of participle in block is as characteristic quantity, and wherein, the multi word unit in described training statement is It is marked;
Described characteristic quantity is input in described artificial neural network as the parameter of artificial neural network;
Use the participle in described artificial neural networks each participle block be multi word unit a part first can Energy property and this participle are not the second probabilities of a part for multi word unit, and may according to described first probability and second The comparative result of property judges that whether this participle is a part for multi word unit;And
According to the result judged and the result of mark, train described artificial neural network,
Wherein, described method also includes: obtain the result conduct of the judgement of the previous participle block adjacent with current participle block Feedback information, and described feedback information is also served as the characteristic quantity of participle in described current participle block.
Remarks 12, according to the method described in remarks 11, wherein, described linguistic feature be following in one or more Individual: the part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence.
Remarks 13, according to the method described in remarks 11 or 12, also include:
N number of participle adjacent in described training statement being combined as N tuple successively to form participle block, wherein N is for being more than Or the natural number equal to 2.
Remarks 14, according to the method described in remarks 13, also include:
According to the morphology feature of the participle in the result marked and described N tuple, the participle calculated in described N tuple is many The morphology of a part for word unit extracts probability, to generate morphology template;And/or
According to the part of speech feature of the participle in the result marked and described N tuple, the participle calculated in described N tuple is many The part of speech of a part for word unit extracts probability, to generate part of speech template.
Remarks 15, according to the method described in remarks 13, also include:
The morphology of the participle in described N tuple is replaced with corresponding part of speech, to obtain being mixed with the general of morphology and part of speech Change N tuple;And
Morphology feature according to the participle in the result marked and described extensive N tuple and part of speech feature, calculate described general Change the extraction probability of a part that the participle in N tuple is multi word unit as part of speech fault tolerance information, to generate the fault-tolerant mould of part of speech Plate.
Remarks 16, the equipment of a kind of training of human artificial neural networks, described artificial neural network is many for extracting in statement Word unit, described equipment includes:
Linguistic feature acquisition device, it is every for carried out by each training statement in multiple participle blocks that participle obtains Individual participle block, obtains one or more linguistic feature of participle in each participle block as characteristic quantity, wherein, and described instruction The multi word unit practiced in statement is marked;
Input equipment, described characteristic quantity is input to described artificial neural network as the parameter of artificial neural network by it In;
Judgment means, using the participle in described artificial neural networks each participle block is the part of multi word unit The first probability and this participle be not second probability of a part of multi word unit, and according to described first probability and The comparative result of the second probability judges that whether this participle is a part for multi word unit;And
Training devices, its according to judge result and the result of mark, train described artificial neural network,
Wherein, described equipment also includes: feedback information acquisition device, and it obtains the previous participle adjacent with current participle block The result of the judgement of block is as feedback information, and described feedback information also serves as the spy of participle in described current participle block The amount of levying.
Remarks 17, according to the equipment described in remarks 16, wherein, described linguistic feature be following in one or more Individual: the part of speech of participle, the morphology of participle, participle sequence number or participle probability of occurrence.
Remarks 18, according to the equipment described in remarks 16 or 17, also include:
Combination unit, N number of participle adjacent in described training statement is combined as N tuple to form participle block by successively, Wherein N is the natural number more than or equal to 2.
Remarks 19, according to the equipment described in remarks 18, also include:
Morphology template generation device, it calculates institute according to the morphology feature of the participle in the result marked and described N tuple State the morphology extraction probability that the participle in N tuple is a part for multi word unit, to generate morphology template;And/or
Part of speech template generation device, it calculates institute according to the part of speech feature of the participle in the result marked and described N tuple State the part of speech extraction probability that the participle in N tuple is a part for multi word unit, to generate part of speech template.
Remarks 20, according to the equipment described in remarks 18, also include:
Extensive device, the morphology of the participle in described N tuple is replaced with corresponding morphology by it, to obtain being mixed with morphology Extensive N tuple with part of speech;And
Part of speech fault-tolerant template generation device, it is special according to the morphology of the participle in the result marked and described extensive N tuple Part of speech of seeking peace feature, calculate the participle in described extensive N tuple be multi word unit a part extraction probability as part of speech hold Wrong information, to generate the fault-tolerant template of part of speech.
While a preferred embodiment of the present invention be shown and described, it is contemplated that those skilled in the art can be in institute The design various amendments to the present invention in attached spirit and scope by the claims.

Claims (10)

1. the method extracting multi word unit in statement, including:
For each participle block carried out by statement in multiple participle blocks that participle obtains, obtain participle in each participle block One or more linguistic feature are as characteristic quantity;
Described characteristic quantity is input in described artificial neural network as the parameter of artificial neural network;
Use the first probability that the participle in described artificial neural networks each participle block is a part for multi word unit With the second probability of the part that this participle is not multi word unit, and come according to described first probability and the second probability Judge that whether this participle is a part for multi word unit;And
Extract the participle that adjacent two or more are judged as a part for multi word unit, to form multi word unit,
Wherein, described method also includes: obtain the result of judgement of the previous participle block adjacent with current participle block as feedback Information, and described feedback information is also served as the characteristic quantity of participle in described current participle block.
2., according to the method described in claim 1, also include:
Successively N number of participle adjacent in described statement is combined as N tuple to form described participle block, wherein N for more than or etc. In the natural number of 2.
Method the most according to claim 2, also includes:
The morphology of the participle in described N tuple is replaced with corresponding part of speech, to obtain the extensive N unit being mixed with morphology with part of speech Group;And
Morphology feature according to the participle in described extensive N tuple and part of speech feature, obtain described general from the fault-tolerant template of part of speech Change the extraction probability of a part that the participle in N tuple is multi word unit as part of speech fault tolerance information, and described part of speech is held Wrong information also serves as the characteristic quantity of the participle in described N tuple.
4. extract an equipment for multi word unit in statement, including:
Linguistic feature acquiring unit, it, for each participle block carried out by statement in multiple participle blocks that participle obtains, obtains Take one or more linguistic feature of participle in each participle block as characteristic quantity;
Input block, described characteristic quantity is input in described artificial neural network by it as the parameter of artificial neural network;
Judging unit, it uses the participle in described artificial neural networks each participle block to be the part of multi word unit First probability and this participle are not the second probabilities of a part for multi word unit, and according to described first probability and the Two probabilities judge that whether this participle is a part for multi word unit;And
Extraction unit, it extracts the participle that adjacent two or more are judged as a part for multi word unit, many to be formed Word unit,
Wherein, described equipment also includes: feedback information acquiring unit, and it obtains the previous participle block adjacent with current participle block The result judged is as feedback information, and described feedback information also serves as the feature of participle in described current participle block Amount.
Equipment the most according to claim 4, also includes:
Assembled unit, N number of participle adjacent in described statement is combined as N tuple to form described participle block, wherein N by successively For the natural number more than or equal to 2.
Equipment the most according to claim 5, also includes:
Extensive unit, the morphology of the participle in described N tuple is replaced with corresponding part of speech by it, to obtain being mixed with morphology and word The extensive N tuple of property;And
Part of speech fault tolerance information acquiring unit, it is according to the morphology feature of the participle in described extensive N tuple and part of speech feature, from word Obtaining the participle in described extensive N tuple in the fault-tolerant template of property is that the extraction probability of a part of multi word unit is fault-tolerant as part of speech Information, and described part of speech fault tolerance information is also served as the characteristic quantity of participle in described N tuple.
7. a method for training of human artificial neural networks, described artificial neural network is for extracting the multi word unit in statement, institute The method of stating includes:
For each participle block each training statement carried out in multiple participle blocks that participle obtains, obtain in each participle block One or more linguistic feature of participle as characteristic quantity, wherein, the multi word unit in described training statement is marked Note;
Described characteristic quantity is input in described artificial neural network as the parameter of artificial neural network;
Use the first probability that the participle in described artificial neural networks each participle block is a part for multi word unit With the second probability of the part that this participle is not multi word unit, and according to described first probability and the second probability Comparative result judges that whether this participle is a part for multi word unit;And
According to the result judged and the result of mark, train described artificial neural network,
Wherein, described method also includes: obtain the result of judgement of the previous participle block adjacent with current participle block as feedback Information, and described feedback information is also served as the characteristic quantity of participle in described current participle block.
Method the most according to claim 7, also includes:
N number of participle adjacent in described training statement being combined as N tuple successively to form described participle block, wherein N is for being more than Or the natural number equal to 2.
Method the most according to claim 8, also includes:
The morphology of the participle in described N tuple is replaced with corresponding part of speech, to obtain the extensive N unit being mixed with morphology with part of speech Group;And
Morphology feature according to the participle in the result marked and described extensive N tuple and part of speech feature, calculate described extensive N unit Participle in group be the extraction probability of a part for multi word unit as part of speech fault tolerance information, to generate the fault-tolerant template of part of speech.
10. an equipment for training of human artificial neural networks, described artificial neural network is for extracting the multi word unit in statement, institute The equipment of stating includes:
Linguistic feature acquisition device, it is for each point carried out by each training statement in multiple participle blocks that participle obtains Word block, obtains one or more linguistic feature of participle in each participle block as characteristic quantity, wherein, and described training language Multi word unit in Ju is marked;
Input equipment, described characteristic quantity is input in described artificial neural network by it as the parameter of artificial neural network;
Judgment means, use that the participle in described artificial neural networks each participle block is a part for multi word unit One probability and this participle are not the second probabilities of a part for multi word unit, and according to described first probability and second The comparative result of probability judges that whether this participle is a part for multi word unit;And
Training devices, its according to judge result and the result of mark, train described artificial neural network,
Wherein, described equipment also includes: feedback information acquisition device, and it obtains the previous participle block adjacent with current participle block The result judged is as feedback information, and described feedback information also serves as the characteristic quantity of described current participle block.
CN201210320806.XA 2012-08-31 2012-08-31 Multi-word unit extraction method and equipment and artificial neural network training method and equipment Expired - Fee Related CN103678318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210320806.XA CN103678318B (en) 2012-08-31 2012-08-31 Multi-word unit extraction method and equipment and artificial neural network training method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210320806.XA CN103678318B (en) 2012-08-31 2012-08-31 Multi-word unit extraction method and equipment and artificial neural network training method and equipment

Publications (2)

Publication Number Publication Date
CN103678318A CN103678318A (en) 2014-03-26
CN103678318B true CN103678318B (en) 2016-12-21

Family

ID=50315921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210320806.XA Expired - Fee Related CN103678318B (en) 2012-08-31 2012-08-31 Multi-word unit extraction method and equipment and artificial neural network training method and equipment

Country Status (1)

Country Link
CN (1) CN103678318B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105404632B (en) * 2014-09-15 2020-07-31 深港产学研基地 System and method for carrying out serialized annotation on biomedical text based on deep neural network
CN107301454B (en) * 2016-04-15 2021-01-22 中科寒武纪科技股份有限公司 Artificial neural network reverse training device and method supporting discrete data representation
CN107977352A (en) * 2016-10-21 2018-05-01 富士通株式会社 Information processor and method
CN107273356B (en) 2017-06-14 2020-08-11 北京百度网讯科技有限公司 Artificial intelligence based word segmentation method, device, server and storage medium
CN109829162B (en) * 2019-01-30 2022-04-08 新华三大数据技术有限公司 Text word segmentation method and device
CN110532551A (en) * 2019-08-15 2019-12-03 苏州朗动网络科技有限公司 Method, equipment and the storage medium that text key word automatically extracts
CN111291195B (en) * 2020-01-21 2021-08-10 腾讯科技(深圳)有限公司 Data processing method, device, terminal and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093504A (en) * 2006-03-24 2007-12-26 国际商业机器公司 System for extracting new compound word
CN101187921A (en) * 2007-12-20 2008-05-28 腾讯科技(深圳)有限公司 Chinese compound words extraction method and system
CN101354712A (en) * 2008-09-05 2009-01-28 北京大学 System and method for automatically extracting Chinese technical terms

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093504A (en) * 2006-03-24 2007-12-26 国际商业机器公司 System for extracting new compound word
CN101187921A (en) * 2007-12-20 2008-05-28 腾讯科技(深圳)有限公司 Chinese compound words extraction method and system
CN101354712A (en) * 2008-09-05 2009-01-28 北京大学 System and method for automatically extracting Chinese technical terms

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A study on multi-word extraction from Chinese documents;Wen Zhang等;《Advanced Web and Network Technologies, and Applications》;20080428;42-53 *
Improving word representations via global context and multiple word prototypes;Eric H. Huang等;《Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics》;20120714;873-882 *
基于神经网络汉语分词模型的优化;何嘉等;《成都信息工程学院学报》;20061231;812-815 *
神经网络和匹配融合的中文分词研究;李华;《心智与计算》;20100630;117-127 *

Also Published As

Publication number Publication date
CN103678318A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN103678318B (en) Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN110222163B (en) Intelligent question-answering method and system integrating CNN and bidirectional LSTM
US7873584B2 (en) Method and system for classifying users of a computer network
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN108363816A (en) Open entity relation extraction method based on sentence justice structural model
CN107590134A (en) Text sentiment classification method, storage medium and computer
CN110222178A (en) Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing
CN111797898B (en) Online comment automatic reply method based on deep semantic matching
CN108170848B (en) Chinese mobile intelligent customer service-oriented conversation scene classification method
Le et al. Text classification: Naïve bayes classifier with sentiment Lexicon
CN105022754A (en) Social network based object classification method and apparatus
CN111460157B (en) Cyclic convolution multitask learning method for multi-field text classification
CN109783794A (en) File classification method and device
CN106997341A (en) A kind of innovation scheme matching process, device, server and system
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN108647191A (en) It is a kind of based on have supervision emotion text and term vector sentiment dictionary construction method
CN113033610B (en) Multi-mode fusion sensitive information classification detection method
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN112215629B (en) Multi-target advertisement generating system and method based on construction countermeasure sample
CN113326374B (en) Short text emotion classification method and system based on feature enhancement
CN110728144A (en) Extraction type document automatic summarization method based on context semantic perception
CN111368524A (en) Microblog viewpoint sentence recognition method based on self-attention bidirectional GRU and SVM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161221

Termination date: 20180831