Specific embodiment
The disclosure is described in further detail with embodiment with reference to the accompanying drawing.It is understood that this place
The specific embodiment of description is only used for explaining related content, rather than the restriction to the disclosure.It also should be noted that being
Convenient for description, part relevant to the disclosure is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the disclosure can
To be combined with each other.The disclosure is described in detail below with reference to the accompanying drawings and in conjunction with embodiment.
Natural language understanding (Natural Language Understanding, NLU), including sentence detection, participle, word
Property mark, syntactic analysis, text classification/cluster, text angle, information extraction/autoabstract, machine translation, automatic question answering, text
The multiple fields such as this generation.
Condition random field (Conditional Random Fields, CRF) is in given one group of input stochastic variable item
The conditional probability distribution model of another set output stochastic variable under part.
In accordance with one embodiment of the present disclosure, a kind of participle processing method is provided.As shown in Figure 1, the word segmentation processing
Method 10 may include that step S11 carries out the knot of sequence labelling, step S12 based on sequence labelling to the character in the sentence of input
Fruit is segmented and step S13 is adjusted word segmentation result according to one or more domain term dictionaries, by participle adjusted
As a result it is used as final word segmentation result.
In step s 11, sequence labelling is carried out to the sentence of input being made of character.Such as CRF, hidden horse can be passed through
Er Kefu model (Hidden Markov Model, HMM) or maximum entropy Markov model (Maximum Entropy
Markov Model, MEMM) the methods of carry out sequence labelling.But in the disclosure, it is preferred to use the mode of CRF is come
It carries out sequence labelling and participle problem is converted by sequence labelling problem by the step.
In the disclosure, optionally, label for labelling is carried out for each character.It wherein, can be in sequence labelling
Use corresponding label system.It is, for example, possible to use the labels systems such as IO, BIO, BMEWO or BMEWO+.It below will be with BMEWO
For be illustrated.For example, the result after being marked for text " Zhang little Ming goes to the Inner Mongol to go on business " by BMEWO label system
Are as follows:/B-PER, small/M-PER, bright/E-PER, go/O, interior/B-LOC, illiteracy/M-LOC, Gu/E-LOC, go out/O, difference/O.Wherein
B-PER (B-PERSON) indicates the beginning of name, and M-PER (M-PERSON) indicates the centre of name, E-PER (E-PERSON) table
The end for name of leting others have a look at, O (OTHER) indicate other, and B-LOC (B-LOCATION) indicates the beginning of place name, M-LOC (M-
LOCATION the centre of place name) is indicated, E-LOC (E-LOCATION) indicates the end of place name.It in the disclosure can also basis
Other label systems are handled to be labeled.
In step s 12, the result based on sequence labelling is segmented.Such as sequence labelling is being carried out to each character
In the case of, can according to after such as CRF sequence labelling as a result, the identical character of label is considered as a word, ignore here as
The prefix of BMEWO.Such as in the above example, for "inner" " illiteracy " " Gu ", since three characters are LOCATION, before ignoring
Sew " B ", then "inner" " illiteracy " " Gu " can be considered as to a word " Inner Mongol ".According to this method, may be implemented for example, by CRF's
The word segmentation processing of machine learning method.
In step s 13, word segmentation result is adjusted according to one or more domain term dictionaries, by participle adjusted
As a result it is used as final word segmentation result.In this step, domain term dictionary can be selected according to actual needs, wherein the domain term
The quantity of dictionary can be one or more, be divided based on the string matching of domain term dictionary step S12 to realize
Word result is adjusted.When carrying out string matching, can be matched by way of Forward Maximum Method.Such as it will
The word or character separated in step S12 compares, if in field according to positive sequence with the word in domain term dictionary
There are the words then to record in word dictionary, continues forward sequence and increases a character, then be compared, if domain term dictionary
It in the presence of the word for increasing a character, then records, then compares again again, until there is no corresponding in domain term dictionary
Word then compares end.After the completion of comparing, according to record as a result, can choose longest matching word as adjustment word segmentation result into
Row output.
According to another embodiment of the disclosure, as shown in Fig. 2, additionally providing a kind of participle processing method 20, comprising: step
Rapid S21 to the character in the sentence of input carry out sequence labelling, step S22 segmented based on the result of sequence labelling, step
S23 is adjusted word segmentation result according to one or more domain term dictionaries, using word segmentation result adjusted as final participle
And step S24 as a result.Wherein it should be noted that step S24 can be executed before step S21 and S23, that is to say, that
One or more domain term dictionaries can be constructed in advance, therefore the serial number of step does not represent its and executes sequence.
Wherein, step S21, S22 can execute identical processing with step S11, S12 respectively.
In step s 24, one or more domain term dictionaries are constructed.For example, in music field, it may be necessary to building song
The relevant domain term dictionary such as hand, album, song.Wherein, the word of domain term dictionary can be obtained by the prior art.?
It can be the significance level that the word in domain term dictionary is arranged weight or confidence level etc. to identify word when constructing domain term dictionary.
Wherein, admissible factor includes temperature, relationship and frequency of occurrence of the word etc. when weight or confidence level are arranged.For temperature
For, higher weight or confidence level can be correspondingly arranged in the higher word of hot value.For relationship, it may be considered that its
The relationship between entity and entity in knowledge mapping, for example, there are relationships for a word (entity) with domain term dictionary
Another entity importance it is higher, then according to relationship between the two, then it is assumed that the importance of the word is higher, therefore can will
Weight or the confidence level setting of the word are higher.Certain word for frequency of occurrence, such as in this field, in domain term dictionary
Frequency of occurrence is more in the article in the field, can also set high value for the weight of the word or confidence level.
In step S23, step S22 can be divided according to one or more domain term dictionaries that step S24 is constructed
Word is adjusted, and is adjusted based on the string matching of domain term dictionary to the word segmentation result of step S22 to realize.?
When carrying out string matching, it can be matched by way of Forward Maximum Method.During Forward Maximum Method,
Such as above-mentioned mode, each matched word is recorded.Then the weight or confidence level for the word being recorded again, from these
Suitable word is selected in word.Such as when weight between word or larger confidence level difference, it can choose weight or confidence level
Highest matching word, and when the weight or confidence level between word are not much different, it can choose longest matching word.Pass through this side
Formula can be adjusted word segmentation result in conjunction with the weight or confidence level of word when the matched word of institute has conflict, such as
It, can disambiguation most possibly when ambiguous.
Disclosed method will be further explained by simple example below.
For example, by " the Nanjing Yangtze Bridge " as input, be input in CRF, by the processing of CRF, it is available with
Lower word segmentation result " Nanjing/mayor/Jiang great Qiao ".Then based on domain term dictionary " city class dictionary " and " place name class dictionary " into
Line character String matching, due to there are " Nanjing " and " Yangtze Bridge " two words in two dictionaries, pass through domain term
Dictionary word segmentation result adjusted will become " Nanjing/Yangtze Bridge ".Result is exported using the result as final participle.Root
It, can also be with it will be understood by those of skill in the art that when being adjusted by domain term dictionary according to the foregoing description of the disclosure
Consider weight or the confidence level etc. of each word, Lai Youhua word segmentation result.
According to a further embodiment of the disclosure, a kind of word segmentation processing device is provided.As shown in figure 3, the word segmentation processing
Device 30 may include sequence labelling module 31, mark word segmentation module 32 and participle adjustment module 33.
In sequence labelling module 31, sequence labelling is carried out to the sentence of input being made of character.Such as it can pass through
CRF, hidden Markov model (Hidden Markov Model, HMM) or maximum entropy Markov model (Maximum
Entropy Markov Model, MEMM) the methods of carry out sequence labelling.But in the disclosure, it is preferred to use CRF
Mode carry out sequence labelling, by the module, participle problem is converted into sequence labelling problem.In the disclosure, optional
Ground carries out label for labelling for each character.Wherein, in sequence labelling, corresponding label system can be used.Example
Such as, the labels system such as IO, BIO, BMEWO or BMEWO+ can be used.
In mark word segmentation module 32, the result based on sequence labelling is segmented.Such as sequence is being carried out to each character
It, can be according to after such as CRF sequence labelling as a result, the identical character of label is considered as a word, here in the case where column mark
Ignore the prefix such as BMEWO.According to the module, the word segmentation processing of the machine learning method for example, by CRF may be implemented.
In participle adjustment module 33, word segmentation result is adjusted according to one or more domain term dictionaries, will be adjusted
Word segmentation result afterwards is as final word segmentation result.In the module, domain term dictionary can be selected according to actual needs, wherein
The quantity of the domain term dictionary can be one or more, to realize based on the string matching of domain term dictionary come to step
The word segmentation result of S12 is adjusted.When carrying out string matching, can be carried out by way of Forward Maximum Method
Match.Such as the word or character separated in word segmentation module 32 will be marked, according to positive sequence, carried out with the word in domain term dictionary
Comparison, if in domain term dictionary there are the word if record, continue forward sequence one character of increase, then compared
Compared with if domain term dictionary records again, then compares again in the presence of the word for increasing a character, until domain term word
Corresponding word is not present in allusion quotation, then compares end.After the completion of comparing, made according to record as a result, longest matching word can be chosen
It is exported for adjustment word segmentation result.
According to the another embodiment of the disclosure, as shown in figure 4, additionally providing a kind of word segmentation processing device 40, comprising: sequence
Column labeling module 41, mark word segmentation module 42, participle adjustment module 43 and dictionary creation module 44.Wherein it should be noted that
The processing carried out in dictionary creation module 44 can sequence labelling module 41 and the processing that is carried out of mark word segmentation module 42 it
It is executed before preceding, that is to say, that one or more domain term dictionaries can be constructed in advance.
Wherein, sequence labelling module 41, mark word segmentation module 42 can segment mould with sequence labelling module 31, mark respectively
Block 32 executes identical processing.
In dictionary creation module 44, one or more domain term dictionaries are constructed.For example, in music field, it may be necessary to
Construct the relevant domain term dictionaries such as singer, album, song.Wherein, the word of domain term dictionary can by the prior art come
It obtains.It can be the weight that the word in domain term dictionary is arranged weight or confidence level etc. to identify word when constructing domain term dictionary
Want degree.Wherein, admissible factor includes temperature, relationship and frequency of occurrence of the word etc. when weight or confidence level are arranged.
For temperature, higher weight or confidence level can be correspondingly arranged in the higher word of hot value.It, can be with for relationship
Its relationship between the entity and entity in knowledge mapping is considered, for example, for a word (entity) with domain term dictionary
There are the importance of another entity of relationship is higher, then according to relationship between the two, then it is assumed that the importance of the word is higher, because
This can be higher by the weight of the word or confidence level setting.For frequency of occurrence, such as in this field, domain term dictionary
In certain word in the article in the field frequency of occurrence it is more, high value can also be set by the weight of the word or confidence level.
Participle adjustment module 43 in, can according to dictionary creation module 44 construct one or more domain term dictionaries come
The participle of mark word segmentation module 42 is adjusted, mark is segmented to realize based on the string matching of domain term dictionary
The word segmentation result of module 42 is adjusted.When carrying out string matching, can be carried out by way of Forward Maximum Method
Matching.During Forward Maximum Method, such as above-mentioned mode records each matched word.Then it is recorded again
Word weight or confidence level, to select suitable word from these words.Such as weight or confidence level phase between word
When difference is larger, weight or the highest matching word of confidence level can choose, and when the weight or confidence level between word are not much different, it can
To select longest matching word.In this way, can when the matched word of institute has conflict, in conjunction with word weight or set
Reliability is adjusted word segmentation result, such as when ambiguous, can disambiguation most possibly.
In the disclosure, it in conjunction with practical field demand, is improved for the shortcoming of existing segmentation methods, by machine
Device learning algorithm and the dictionary of field customization combine, and on the one hand can be improved participle accuracy rate, on the other hand can be for real
The application scenarios on border improve its field adaptability.Therefore, in the disclosure, by traditional participle side based on string matching
Method and machine learning method combine, and combine practical application scene, have both compensated in machine learning method and have identified to domain term
Poor problem, and solve the problems, such as to be difficult to unregistered word when pure string matching, participle can be effectively improved
Accuracy rate, and the field applicability of word segmentation result can be better achieved.
The disclosure also provides a kind of electronic equipment, as shown in figure 5, the equipment includes: communication interface 1000, memory 2000
With processor 3000.Communication interface 1000 carries out data interaction for being communicated with external device.In memory 2000
It is stored with the computer program that can be run on processor 3000.Processor 3000 is realized above-mentioned when executing the computer program
Method in embodiment.The quantity of the memory 2000 and processor 3000 can be one or more.
Memory 2000 may include high speed RAM memory, can also further include nonvolatile memory (non-
Volatile memory), a for example, at least magnetic disk storage.
If communication interface 1000, memory 2000 and the independent realization of processor 3000, communication interface 1000, memory
2000 and processor 3000 can be connected with each other by bus and complete mutual communication.The bus can be industrial standard
Architecture (ISA, Industry Standard Architecture) bus, external equipment interconnection (PCI, Peripheral
Component) bus or extended industry-standard architecture (EISA, Extended Industry Standard
Component) bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for expression, the figure
In only indicated with a thick line, it is not intended that an only bus or a type of bus.
Optionally, in specific implementation, if communication interface 1000, memory 2000 and processor 3000 are integrated in one
On block chip, then communication interface 1000, memory 2000 and processor 3000 can complete mutual lead to by internal interface
Letter.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the disclosure includes other realization, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the disclosure
Embodiment person of ordinary skill in the field understood.Processor executes each method as described above and processing.
For example, the method implementation in the disclosure may be implemented as software program, it is tangibly embodied in machine readable media,
Such as memory.In some embodiments, some or all of of software program can be via memory and/or communication interface
And it is loaded into and/or installs.When software program is loaded into memory and is executed by processor, above-described side can be executed
One or more steps in method.Alternatively, in other embodiments, processor can pass through other any modes appropriate
(for example, by means of firmware) and be configured as executing one of above method.
Expression or logic and/or step described otherwise above herein in flow charts, may be embodied in any
In readable storage medium storing program for executing, so that (such as computer based system is including processor for instruction execution system, device or equipment
Unite or other can be from instruction execution system, device or equipment instruction fetch and the system executed instruction) it uses, or refer in conjunction with these
It enables and executes system, device or equipment and use.
For the purpose of this specification, " readable storage medium storing program for executing " can be it is any may include, store, communicate, propagate, or transport
Program is for instruction execution system, device or equipment or the device used in conjunction with these instruction execution systems, device or equipment.
The more specific example (non-exhaustive list) of readable storage medium storing program for executing include the following: there is the electrical connection section of one or more wirings
(electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM) are erasable
Except editable read-only memory (EPROM or flash memory), fiber device and portable read-only memory (CDROM).Separately
Outside, readable storage medium storing program for executing can even is that the paper that can print described program on it or other suitable media, because can example
Such as by carrying out optical scanner to paper or other media, is then edited, interpreted or when necessary with the progress of other suitable methods
Processing is then stored in memory electronically to obtain described program.
It should be appreciated that each section of the disclosure can be realized with hardware, software or their combination.In above-mentioned embodiment party
In formula, multiple steps or method can carry out reality in memory and by the software that suitable instruction execution system executes with storage
It is existing.It, and in another embodiment, can be in following technology well known in the art for example, if realized with hardware
Any one or their combination are realized: having a discrete logic for realizing the logic gates of logic function to data-signal
Circuit, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), field-programmable gate array
Arrange (FPGA) etc..
Those skilled in the art are understood that realize all or part of the steps of above embodiment method
It is that relevant hardware can be instructed to complete by program, the program can store in a kind of readable storage medium storing program for executing, should
Program when being executed, includes the steps that one or a combination set of method implementation.
In addition, can integrate in a processing module in each functional unit in each embodiment of the disclosure, it can also
To be that each unit physically exists alone, can also be integrated in two or more units in a module.It is above-mentioned integrated
Module both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module
If in the form of software function module realize and when sold or used as an independent product, also can store readable at one
In storage medium.The storage medium can be read-only memory, disk or CD etc..
In the description of this specification, reference term " an embodiment/mode ", " some embodiment/modes ",
The description of " example ", " specific example " or " some examples " etc. means the embodiment/mode or example is combined to describe specific
Feature, structure, material or feature are contained at least one embodiment/mode or example of the application.In this specification
In, schematic expression of the above terms are necessarily directed to identical embodiment/mode or example.Moreover, description
Particular features, structures, materials, or characteristics can be in any one or more embodiment/modes or example in an appropriate manner
In conjunction with.In addition, without conflicting with each other, those skilled in the art can be by different implementations described in this specification
Mode/mode or example and different embodiments/mode or exemplary feature are combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three
It is a etc., unless otherwise specifically defined.
It will be understood by those of skill in the art that above embodiment is used for the purpose of clearly demonstrating the disclosure, and simultaneously
Non- be defined to the scope of the present disclosure.For those skilled in the art, may be used also on the basis of disclosed above
To make other variations or modification, and these variations or modification are still in the scope of the present disclosure.