CN110008475A

CN110008475A - Participle processing method, device, equipment and storage medium

Info

Publication number: CN110008475A
Application number: CN201910284804.1A
Authority: CN
Inventors: 孟振南
Original assignee: Chumen Wenwen Information Technology Co Ltd
Current assignee: Volkswagen China Investment Co Ltd; Mobvoi Innovation Technology Co Ltd
Priority date: 2019-04-10
Filing date: 2019-04-10
Publication date: 2019-07-12

Abstract

Present disclose provides a kind of participle processing methods, comprising: carries out sequence labelling to the character in the sentence of input；Result based on sequence labelling is segmented；And word segmentation result is adjusted according to one or more domain term dictionaries, using word segmentation result adjusted as final word segmentation result.The disclosure additionally provides a kind of word segmentation processing device, electronic equipment and readable storage medium storing program for executing.

Description

Participle processing method, device, equipment and storage medium

Technical field

This disclosure relates to a kind of participle processing method, word segmentation processing device, electronic equipment and readable storage medium storing program for executing.

Background technique

In the field of natural language understanding, existing segmenting method is broadly divided into two kinds, and one is be based on character string The method matched, and another kind is then the segmenting method based on statistics and machine learning.

It is actually to scan character string, common includes Forward Maximum Method for the method for string matching Method, reverse maximum matching method and bi-directional matching participle method etc..Such as MMSEG (AWord Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm, the standard Chinese character identification system based on two kinds of maximum matching algorithms), this kind of algorithm advantage be speed it is fast, Time complexity is lower, realization is simple, but poor to unregistered word (not having the word occurred in dictionary) treatment effect.

It is the participle side based on statistics and machine learning for the segmenting method based on statistics and machine learning Formula, this kind of algorithm are preferable to unknown word identification effect, can realize according to using field to reach the higher precision of word segmentation More complicated and it is generally necessary to a large amount of previous work.

Summary of the invention

At least one of in order to solve the above-mentioned technical problem, present disclose provides at a kind of participle processing method, participle Manage device, electronic equipment and readable storage medium storing program for executing.

According to one aspect of the disclosure, a kind of participle processing method, comprising: sequence is carried out to the character in the sentence of input Column mark；Result based on sequence labelling is segmented；And word segmentation result is carried out according to one or more domain term dictionaries Adjustment, using word segmentation result adjusted as final word segmentation result.

It is when the result based on sequence labelling is segmented, label is identical according at least one embodiment of the disclosure Adjacent character as a word.

According at least one embodiment of the disclosure, word segmentation result is adjusted according to one or more domain term dictionaries When whole, based on word present in one or more of domain term dictionaries, by way of Forward Maximum Method, come to passing through sequence The result that the result of column mark is segmented is adjusted.

It further include constructing one or more domain term dictionaries, wherein in structure according at least one embodiment of the disclosure When building one or more domain term dictionaries, determines weight of each word in one or more domain term dictionaries in field or set Reliability.

According at least one embodiment of the disclosure, word segmentation result is adjusted according to one or more domain term dictionaries When whole, word segmentation result is adjusted according to the weight of word present in one or more domain term dictionaries or confidence level.

According at least one embodiment of the disclosure, sequence is carried out to the character in the sentence of input by condition random field Column mark.

According at least one embodiment of the disclosure, determine each word in one or more domain term dictionary in field In weight or confidence level when, weight or confidence level are determined according to the temperature of the word, relationship, and/or frequency of occurrence.

According to another aspect of the present disclosure, a kind of word segmentation processing device, comprising: sequence labelling module, in the sentence of input Character carry out sequence labelling；Word segmentation module is marked, the result based on sequence labelling is segmented；And participle adjustment module, Word segmentation result is adjusted according to one or more domain term dictionaries, using word segmentation result adjusted as final participle knot Fruit.

According to the another aspect of the disclosure, a kind of electronic equipment, comprising: memory, memory storage computer execution refer to It enables；And processor, processor executes the computer executed instructions of memory storage, so that processor executes above-mentioned method.

According to the another further aspect of the disclosure, a kind of readable storage medium storing program for executing is stored with computer execution in readable storage medium storing program for executing Instruction, for realizing above-mentioned method when computer executed instructions are executed by processor.

Detailed description of the invention

Attached drawing shows the illustrative embodiments of the disclosure, and it is bright together for explaining the principles of this disclosure, Which includes these attached drawings to provide further understanding of the disclosure, and attached drawing is included in the description and constitutes this Part of specification.

Fig. 1 is the schematic flow chart according to the participle processing method of one embodiment of the disclosure.

Fig. 2 is the schematic flow chart according to the participle processing method of one embodiment of the disclosure.

Fig. 3 is the schematic block diagram according to the word segmentation processing device of one embodiment of the disclosure.

Fig. 4 is the schematic block diagram according to the word segmentation processing device of one embodiment of the disclosure.

Fig. 5 is the explanatory view according to the electronic equipment of one embodiment of the disclosure.

Specific embodiment

The disclosure is described in further detail with embodiment with reference to the accompanying drawing.It is understood that this place The specific embodiment of description is only used for explaining related content, rather than the restriction to the disclosure.It also should be noted that being Convenient for description, part relevant to the disclosure is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the disclosure can To be combined with each other.The disclosure is described in detail below with reference to the accompanying drawings and in conjunction with embodiment.

Natural language understanding (Natural Language Understanding, NLU), including sentence detection, participle, word Property mark, syntactic analysis, text classification/cluster, text angle, information extraction/autoabstract, machine translation, automatic question answering, text The multiple fields such as this generation.

Condition random field (Conditional Random Fields, CRF) is in given one group of input stochastic variable item The conditional probability distribution model of another set output stochastic variable under part.

In accordance with one embodiment of the present disclosure, a kind of participle processing method is provided.As shown in Figure 1, the word segmentation processing Method 10 may include that step S11 carries out the knot of sequence labelling, step S12 based on sequence labelling to the character in the sentence of input Fruit is segmented and step S13 is adjusted word segmentation result according to one or more domain term dictionaries, by participle adjusted As a result it is used as final word segmentation result.

In step s 11, sequence labelling is carried out to the sentence of input being made of character.Such as CRF, hidden horse can be passed through Er Kefu model (Hidden Markov Model, HMM) or maximum entropy Markov model (Maximum Entropy Markov Model, MEMM) the methods of carry out sequence labelling.But in the disclosure, it is preferred to use the mode of CRF is come It carries out sequence labelling and participle problem is converted by sequence labelling problem by the step.

In the disclosure, optionally, label for labelling is carried out for each character.It wherein, can be in sequence labelling Use corresponding label system.It is, for example, possible to use the labels systems such as IO, BIO, BMEWO or BMEWO+.It below will be with BMEWO For be illustrated.For example, the result after being marked for text " Zhang little Ming goes to the Inner Mongol to go on business " by BMEWO label system Are as follows:/B-PER, small/M-PER, bright/E-PER, go/O, interior/B-LOC, illiteracy/M-LOC, Gu/E-LOC, go out/O, difference/O.Wherein B-PER (B-PERSON) indicates the beginning of name, and M-PER (M-PERSON) indicates the centre of name, E-PER (E-PERSON) table The end for name of leting others have a look at, O (OTHER) indicate other, and B-LOC (B-LOCATION) indicates the beginning of place name, M-LOC (M- LOCATION the centre of place name) is indicated, E-LOC (E-LOCATION) indicates the end of place name.It in the disclosure can also basis Other label systems are handled to be labeled.

In step s 12, the result based on sequence labelling is segmented.Such as sequence labelling is being carried out to each character In the case of, can according to after such as CRF sequence labelling as a result, the identical character of label is considered as a word, ignore here as The prefix of BMEWO.Such as in the above example, for "inner" " illiteracy " " Gu ", since three characters are LOCATION, before ignoring Sew " B ", then "inner" " illiteracy " " Gu " can be considered as to a word " Inner Mongol ".According to this method, may be implemented for example, by CRF's The word segmentation processing of machine learning method.

In step s 13, word segmentation result is adjusted according to one or more domain term dictionaries, by participle adjusted As a result it is used as final word segmentation result.In this step, domain term dictionary can be selected according to actual needs, wherein the domain term The quantity of dictionary can be one or more, be divided based on the string matching of domain term dictionary step S12 to realize Word result is adjusted.When carrying out string matching, can be matched by way of Forward Maximum Method.Such as it will The word or character separated in step S12 compares, if in field according to positive sequence with the word in domain term dictionary There are the words then to record in word dictionary, continues forward sequence and increases a character, then be compared, if domain term dictionary It in the presence of the word for increasing a character, then records, then compares again again, until there is no corresponding in domain term dictionary Word then compares end.After the completion of comparing, according to record as a result, can choose longest matching word as adjustment word segmentation result into Row output.

According to another embodiment of the disclosure, as shown in Fig. 2, additionally providing a kind of participle processing method 20, comprising: step Rapid S21 to the character in the sentence of input carry out sequence labelling, step S22 segmented based on the result of sequence labelling, step S23 is adjusted word segmentation result according to one or more domain term dictionaries, using word segmentation result adjusted as final participle And step S24 as a result.Wherein it should be noted that step S24 can be executed before step S21 and S23, that is to say, that One or more domain term dictionaries can be constructed in advance, therefore the serial number of step does not represent its and executes sequence.

Wherein, step S21, S22 can execute identical processing with step S11, S12 respectively.

In step s 24, one or more domain term dictionaries are constructed.For example, in music field, it may be necessary to building song The relevant domain term dictionary such as hand, album, song.Wherein, the word of domain term dictionary can be obtained by the prior art.? It can be the significance level that the word in domain term dictionary is arranged weight or confidence level etc. to identify word when constructing domain term dictionary. Wherein, admissible factor includes temperature, relationship and frequency of occurrence of the word etc. when weight or confidence level are arranged.For temperature For, higher weight or confidence level can be correspondingly arranged in the higher word of hot value.For relationship, it may be considered that its The relationship between entity and entity in knowledge mapping, for example, there are relationships for a word (entity) with domain term dictionary Another entity importance it is higher, then according to relationship between the two, then it is assumed that the importance of the word is higher, therefore can will Weight or the confidence level setting of the word are higher.Certain word for frequency of occurrence, such as in this field, in domain term dictionary Frequency of occurrence is more in the article in the field, can also set high value for the weight of the word or confidence level.

In step S23, step S22 can be divided according to one or more domain term dictionaries that step S24 is constructed Word is adjusted, and is adjusted based on the string matching of domain term dictionary to the word segmentation result of step S22 to realize.? When carrying out string matching, it can be matched by way of Forward Maximum Method.During Forward Maximum Method, Such as above-mentioned mode, each matched word is recorded.Then the weight or confidence level for the word being recorded again, from these Suitable word is selected in word.Such as when weight between word or larger confidence level difference, it can choose weight or confidence level Highest matching word, and when the weight or confidence level between word are not much different, it can choose longest matching word.Pass through this side Formula can be adjusted word segmentation result in conjunction with the weight or confidence level of word when the matched word of institute has conflict, such as It, can disambiguation most possibly when ambiguous.

Disclosed method will be further explained by simple example below.

For example, by " the Nanjing Yangtze Bridge " as input, be input in CRF, by the processing of CRF, it is available with Lower word segmentation result " Nanjing/mayor/Jiang great Qiao ".Then based on domain term dictionary " city class dictionary " and " place name class dictionary " into Line character String matching, due to there are " Nanjing " and " Yangtze Bridge " two words in two dictionaries, pass through domain term Dictionary word segmentation result adjusted will become " Nanjing/Yangtze Bridge ".Result is exported using the result as final participle.Root It, can also be with it will be understood by those of skill in the art that when being adjusted by domain term dictionary according to the foregoing description of the disclosure Consider weight or the confidence level etc. of each word, Lai Youhua word segmentation result.

According to a further embodiment of the disclosure, a kind of word segmentation processing device is provided.As shown in figure 3, the word segmentation processing Device 30 may include sequence labelling module 31, mark word segmentation module 32 and participle adjustment module 33.

In sequence labelling module 31, sequence labelling is carried out to the sentence of input being made of character.Such as it can pass through CRF, hidden Markov model (Hidden Markov Model, HMM) or maximum entropy Markov model (Maximum Entropy Markov Model, MEMM) the methods of carry out sequence labelling.But in the disclosure, it is preferred to use CRF Mode carry out sequence labelling, by the module, participle problem is converted into sequence labelling problem.In the disclosure, optional Ground carries out label for labelling for each character.Wherein, in sequence labelling, corresponding label system can be used.Example Such as, the labels system such as IO, BIO, BMEWO or BMEWO+ can be used.

In mark word segmentation module 32, the result based on sequence labelling is segmented.Such as sequence is being carried out to each character It, can be according to after such as CRF sequence labelling as a result, the identical character of label is considered as a word, here in the case where column mark Ignore the prefix such as BMEWO.According to the module, the word segmentation processing of the machine learning method for example, by CRF may be implemented.

In participle adjustment module 33, word segmentation result is adjusted according to one or more domain term dictionaries, will be adjusted Word segmentation result afterwards is as final word segmentation result.In the module, domain term dictionary can be selected according to actual needs, wherein The quantity of the domain term dictionary can be one or more, to realize based on the string matching of domain term dictionary come to step The word segmentation result of S12 is adjusted.When carrying out string matching, can be carried out by way of Forward Maximum Method Match.Such as the word or character separated in word segmentation module 32 will be marked, according to positive sequence, carried out with the word in domain term dictionary Comparison, if in domain term dictionary there are the word if record, continue forward sequence one character of increase, then compared Compared with if domain term dictionary records again, then compares again in the presence of the word for increasing a character, until domain term word Corresponding word is not present in allusion quotation, then compares end.After the completion of comparing, made according to record as a result, longest matching word can be chosen It is exported for adjustment word segmentation result.

According to the another embodiment of the disclosure, as shown in figure 4, additionally providing a kind of word segmentation processing device 40, comprising: sequence Column labeling module 41, mark word segmentation module 42, participle adjustment module 43 and dictionary creation module 44.Wherein it should be noted that The processing carried out in dictionary creation module 44 can sequence labelling module 41 and the processing that is carried out of mark word segmentation module 42 it It is executed before preceding, that is to say, that one or more domain term dictionaries can be constructed in advance.

Wherein, sequence labelling module 41, mark word segmentation module 42 can segment mould with sequence labelling module 31, mark respectively Block 32 executes identical processing.

In dictionary creation module 44, one or more domain term dictionaries are constructed.For example, in music field, it may be necessary to Construct the relevant domain term dictionaries such as singer, album, song.Wherein, the word of domain term dictionary can by the prior art come It obtains.It can be the weight that the word in domain term dictionary is arranged weight or confidence level etc. to identify word when constructing domain term dictionary Want degree.Wherein, admissible factor includes temperature, relationship and frequency of occurrence of the word etc. when weight or confidence level are arranged. For temperature, higher weight or confidence level can be correspondingly arranged in the higher word of hot value.It, can be with for relationship Its relationship between the entity and entity in knowledge mapping is considered, for example, for a word (entity) with domain term dictionary There are the importance of another entity of relationship is higher, then according to relationship between the two, then it is assumed that the importance of the word is higher, because This can be higher by the weight of the word or confidence level setting.For frequency of occurrence, such as in this field, domain term dictionary In certain word in the article in the field frequency of occurrence it is more, high value can also be set by the weight of the word or confidence level.

Participle adjustment module 43 in, can according to dictionary creation module 44 construct one or more domain term dictionaries come The participle of mark word segmentation module 42 is adjusted, mark is segmented to realize based on the string matching of domain term dictionary The word segmentation result of module 42 is adjusted.When carrying out string matching, can be carried out by way of Forward Maximum Method Matching.During Forward Maximum Method, such as above-mentioned mode records each matched word.Then it is recorded again Word weight or confidence level, to select suitable word from these words.Such as weight or confidence level phase between word When difference is larger, weight or the highest matching word of confidence level can choose, and when the weight or confidence level between word are not much different, it can To select longest matching word.In this way, can when the matched word of institute has conflict, in conjunction with word weight or set Reliability is adjusted word segmentation result, such as when ambiguous, can disambiguation most possibly.

In the disclosure, it in conjunction with practical field demand, is improved for the shortcoming of existing segmentation methods, by machine Device learning algorithm and the dictionary of field customization combine, and on the one hand can be improved participle accuracy rate, on the other hand can be for real The application scenarios on border improve its field adaptability.Therefore, in the disclosure, by traditional participle side based on string matching Method and machine learning method combine, and combine practical application scene, have both compensated in machine learning method and have identified to domain term Poor problem, and solve the problems, such as to be difficult to unregistered word when pure string matching, participle can be effectively improved Accuracy rate, and the field applicability of word segmentation result can be better achieved.

The disclosure also provides a kind of electronic equipment, as shown in figure 5, the equipment includes: communication interface 1000, memory 2000 With processor 3000.Communication interface 1000 carries out data interaction for being communicated with external device.In memory 2000 It is stored with the computer program that can be run on processor 3000.Processor 3000 is realized above-mentioned when executing the computer program Method in embodiment.The quantity of the memory 2000 and processor 3000 can be one or more.

Memory 2000 may include high speed RAM memory, can also further include nonvolatile memory (non- Volatile memory), a for example, at least magnetic disk storage.

If communication interface 1000, memory 2000 and the independent realization of processor 3000, communication interface 1000, memory 2000 and processor 3000 can be connected with each other by bus and complete mutual communication.The bus can be industrial standard Architecture (ISA, Industry Standard Architecture) bus, external equipment interconnection (PCI, Peripheral Component) bus or extended industry-standard architecture (EISA, Extended Industry Standard Component) bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for expression, the figure In only indicated with a thick line, it is not intended that an only bus or a type of bus.

Optionally, in specific implementation, if communication interface 1000, memory 2000 and processor 3000 are integrated in one On block chip, then communication interface 1000, memory 2000 and processor 3000 can complete mutual lead to by internal interface Letter.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the disclosure includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the disclosure Embodiment person of ordinary skill in the field understood.Processor executes each method as described above and processing. For example, the method implementation in the disclosure may be implemented as software program, it is tangibly embodied in machine readable media, Such as memory.In some embodiments, some or all of of software program can be via memory and/or communication interface And it is loaded into and/or installs.When software program is loaded into memory and is executed by processor, above-described side can be executed One or more steps in method.Alternatively, in other embodiments, processor can pass through other any modes appropriate (for example, by means of firmware) and be configured as executing one of above method.

Expression or logic and/or step described otherwise above herein in flow charts, may be embodied in any In readable storage medium storing program for executing, so that (such as computer based system is including processor for instruction execution system, device or equipment Unite or other can be from instruction execution system, device or equipment instruction fetch and the system executed instruction) it uses, or refer in conjunction with these It enables and executes system, device or equipment and use.

For the purpose of this specification, " readable storage medium storing program for executing " can be it is any may include, store, communicate, propagate, or transport Program is for instruction execution system, device or equipment or the device used in conjunction with these instruction execution systems, device or equipment. The more specific example (non-exhaustive list) of readable storage medium storing program for executing include the following: there is the electrical connection section of one or more wirings (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM) are erasable Except editable read-only memory (EPROM or flash memory), fiber device and portable read-only memory (CDROM).Separately Outside, readable storage medium storing program for executing can even is that the paper that can print described program on it or other suitable media, because can example Such as by carrying out optical scanner to paper or other media, is then edited, interpreted or when necessary with the progress of other suitable methods Processing is then stored in memory electronically to obtain described program.

It should be appreciated that each section of the disclosure can be realized with hardware, software or their combination.In above-mentioned embodiment party In formula, multiple steps or method can carry out reality in memory and by the software that suitable instruction execution system executes with storage It is existing.It, and in another embodiment, can be in following technology well known in the art for example, if realized with hardware Any one or their combination are realized: having a discrete logic for realizing the logic gates of logic function to data-signal Circuit, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), field-programmable gate array Arrange (FPGA) etc..

Those skilled in the art are understood that realize all or part of the steps of above embodiment method It is that relevant hardware can be instructed to complete by program, the program can store in a kind of readable storage medium storing program for executing, should Program when being executed, includes the steps that one or a combination set of method implementation.

In addition, can integrate in a processing module in each functional unit in each embodiment of the disclosure, it can also To be that each unit physically exists alone, can also be integrated in two or more units in a module.It is above-mentioned integrated Module both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module If in the form of software function module realize and when sold or used as an independent product, also can store readable at one In storage medium.The storage medium can be read-only memory, disk or CD etc..

In the description of this specification, reference term " an embodiment/mode ", " some embodiment/modes ", The description of " example ", " specific example " or " some examples " etc. means the embodiment/mode or example is combined to describe specific Feature, structure, material or feature are contained at least one embodiment/mode or example of the application.In this specification In, schematic expression of the above terms are necessarily directed to identical embodiment/mode or example.Moreover, description Particular features, structures, materials, or characteristics can be in any one or more embodiment/modes or example in an appropriate manner In conjunction with.In addition, without conflicting with each other, those skilled in the art can be by different implementations described in this specification Mode/mode or example and different embodiments/mode or exemplary feature are combined.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

It will be understood by those of skill in the art that above embodiment is used for the purpose of clearly demonstrating the disclosure, and simultaneously Non- be defined to the scope of the present disclosure.For those skilled in the art, may be used also on the basis of disclosed above To make other variations or modification, and these variations or modification are still in the scope of the present disclosure.

Claims

1. a kind of participle processing method characterized by comprising

Sequence labelling is carried out to the character in the sentence of input；

Result based on sequence labelling is segmented；And

Word segmentation result is adjusted according to one or more domain term dictionaries, using word segmentation result adjusted as final participle As a result.

2. the method as described in claim 1, which is characterized in that when the result based on sequence labelling is segmented, by label phase Same adjacent character is as a word.

3. method according to claim 2, which is characterized in that carried out according to one or more domain term dictionaries to word segmentation result When adjustment, based on word present in one or more of domain term dictionaries, by way of Forward Maximum Method, come to passing through The result that the result of sequence labelling is segmented is adjusted.

4. method according to any one of claims 1 to 3, which is characterized in that further include constructing one or more domain terms Dictionary, wherein when constructing one or more domain term dictionaries, determine that each word in one or more domain term dictionary is being led Weight or confidence level in domain.

5. method as claimed in claim 4, which is characterized in that carried out according to one or more domain term dictionaries to word segmentation result When adjustment, word segmentation result is adjusted according to the weight of word present in one or more domain term dictionaries or confidence level.

6. the method as described in any one of claims 1 to 5, which is characterized in that by condition random field to the sentence of input In character carry out sequence labelling.

7. method as described in claim 4 or 5, wherein determine each word in one or more domain term dictionary in field In weight or confidence level when, weight or confidence level are determined according to the temperature of the word, relationship, and/or frequency of occurrence.

8. a kind of word segmentation processing device characterized by comprising

Sequence labelling module, the character in the sentence of input carry out sequence labelling；

Word segmentation module is marked, the result based on sequence labelling is segmented；And

Participle adjustment module, is adjusted word segmentation result according to one or more domain term dictionaries, participle adjusted is tied Fruit is as final word segmentation result.

9. a kind of electronic equipment characterized by comprising

Memory, the memory storage execute instruction；And

Processor, the processor execute executing instruction for the memory storage, so that the processor is executed as right is wanted Method described in asking any one of 1 to 7.

10. a kind of readable storage medium storing program for executing, which is characterized in that it is stored with and executes instruction in the readable storage medium storing program for executing, the execution For realizing the method as described in any one of claims 1 to 7 when instruction is executed by processor.