CN110457683A

CN110457683A - Model optimization method, apparatus, computer equipment and storage medium

Info

Publication number: CN110457683A
Application number: CN201910636482.2A
Authority: CN
Inventors: 孙辉丰; 孙叔琦; 孙珂; 杨煜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2019-11-15
Anticipated expiration: 2039-07-15
Also published as: CN110457683B

Abstract

The invention discloses model optimization method, apparatus, computer equipment and storage medium, wherein method can include: obtain the serializing marking model that training obtains；Each sentence in scheduled large-scale corpus is labeled using serializing marking model；Based on annotation results, the sentence of marking error is determined according to predetermined policy；The sentence of marking error is modified, using revised sentence as training data；Serializing marking model is optimized according to training data.Using scheme of the present invention, the problem of serializing marking model can be found automatically and specific aim optimizes, thus lift scheme precision etc..

Description

Model optimization method, apparatus, computer equipment and storage medium

[technical field]

The present invention relates to Computer Applied Technology, in particular to model optimization method, apparatus, computer equipment and storage is situated between Matter.

[background technique]

Serializing marking model is in the field natural language processing (NLP, Natural Language Processing) Common model, many important research directions, such as participle, part-of-speech tagging, name Entity recognition can be abstracted as serializing Mark problem.

Serializing mark problem research in, basic ideas be by manually marking training corpus (i.e. training data), Carry out training sequence marking model, the effect of model depends on the quality and quantity of data mark.

And the at high cost, period manually marked is long, professional corpus such as part-of-speech tagging corpus needs domain expert that could complete, Therefore it is limited to cost of labor etc., the scale of training corpus is usually all not too large, to affect the model essence that training obtains Degree etc..

[summary of the invention]

In view of this, the present invention provides model optimization method, apparatus, computer equipment and storage mediums.

Specific technical solution is as follows:

A kind of model optimization method, comprising:

Obtain the serializing marking model that training obtains；

Each sentence in scheduled large-scale corpus is labeled using the serializing marking model；

Based on annotation results, the sentence of marking error is determined according to predetermined policy；

The sentence of the marking error is modified, using revised sentence as training data；

The serializing marking model is optimized according to the training data.

According to one preferred embodiment of the present invention, the sentence that marking error is determined according to predetermined policy includes:

The text fragments for meeting the following conditions: same text segment are filtered out from each text fragments for being marked label Different labels have been marked in different contextual windows；

From filtering out the contextual window for meeting the following conditions: the difference in same context window in each contextual window Text fragments have been marked different labels；

The label of the text fragments filtered out and contextual window is determined respectively；

The sentence of marking error is determined according to the label determined.

According to one preferred embodiment of the present invention, the label includes level-one label and second level label；

This method further comprises: if any text fragments filtered out be not present the level-one label, abandon described in Text fragments；If the level-one label is not present in any contextual window filtered out, the contextual window is abandoned.

According to one preferred embodiment of the present invention, the label for determining the text fragments filtered out includes:

For each text fragments filtered out, carry out the following processing respectively:

The number that the text fragments are marked label is counted, the first statistical result is obtained；

All labels that the text fragments are marked are obtained, for different labels each of are marked, count institute respectively The number that text fragments are noted as the label is stated, the second statistical result is obtained, with second statistical result divided by described First statistical result, it is no using the label as the level-one label of the text fragments if obtained quotient is greater than first threshold Then, using the label as the second level label of the text fragments.

According to one preferred embodiment of the present invention, the label for determining the contextual window filtered out includes:

For each contextual window filtered out, carry out the following processing respectively:

The number that the text fragments in the contextual window are marked label is counted, third statistical result is obtained；

All labels that the text fragments in the contextual window are marked are obtained, for each of being marked different mark Label, count the number that the text fragments in the contextual window are noted as the label respectively, obtain the 4th statistical result, With the 4th statistical result divided by the third statistical result, if obtained quotient is greater than second threshold, the label is made For the level-one label of the contextual window, otherwise, using the label as the second level label of the contextual window.

According to one preferred embodiment of the present invention, this method further comprises: for each contextual window filtered out, if It determines that the contextual window does not meet scheduled confidence level requirement, then abandons the contextual window, otherwise, it determines going out described The label of contextual window.

According to one preferred embodiment of the present invention, the label that the basis is determined determines that the sentence of marking error includes:

For each second level label of each contextual window, carry out the following processing respectively:

If it is determined that including the contextual window in any sentence, and the text fragments in the contextual window are marked For the second level label, then when the text fragments belong to the text fragments filtered out, and the second level label is similarly described The second level label of text fragments, and when the level-one label of the contextual window is consistent with the level-one label of the text fragments, Using the sentence as the sentence of marking error.

According to one preferred embodiment of the present invention, the sentence to the marking error be modified include: will be described on The label of text fragments in lower text window is modified to the level-one label of the text fragments.

A kind of model optimization device, comprising: acquiring unit, mark unit, amending unit and optimization unit；

The acquiring unit, the serializing marking model obtained for obtaining training；

The mark unit, for using the serializing marking model to each sentence in scheduled large-scale corpus into Rower note；

The amending unit determines the sentence of marking error according to predetermined policy, and to institute for being based on annotation results The sentence for stating marking error is modified, using revised sentence as training data；

The optimization unit, for being optimized according to the training data to the serializing marking model.

According to one preferred embodiment of the present invention, the amending unit is filtered out from each text fragments for being marked label Meet the text fragments of the following conditions: same text segment has been marked different labels in different contextual windows；From each The contextual window for meeting the following conditions is filtered out in lower text window: the different text fragments in same context window are marked Different labels；The label of the text fragments filtered out and contextual window is determined respectively；It is determined according to the label determined The sentence of marking error out.

The amending unit is further used for, if the level-one label is not present in any text fragments filtered out, loses The text fragments are abandoned, if the level-one label is not present in any contextual window filtered out, abandon the context window Mouthful.

According to one preferred embodiment of the present invention, the amending unit is directed to each text fragments filtered out, carries out respectively It handles below: counting the number that the text fragments are marked label, obtain the first statistical result；Obtain the text fragments quilt All labels of mark count the text fragments respectively and are noted as the label for different labels each of are marked Number, the second statistical result is obtained, with second statistical result divided by first statistical result, if obtained quotient is greater than First threshold, then using the label as the level-one label of the text fragments, otherwise, using the label as the text piece The second level label of section.

According to one preferred embodiment of the present invention, the amending unit is directed to each contextual window for filtering out, respectively into The following processing of row: the number that the text fragments in the contextual window are marked label is counted, third statistical result is obtained；It obtains All labels that the text fragments in the contextual window are marked are taken, for different labels each of are marked, are united respectively The number that the text fragments in the contextual window are noted as the label is counted, the 4th statistical result is obtained, with described Four statistical results are divided by the third statistical result, if obtained quotient is greater than second threshold, using the label as on described The level-one label of lower text window, otherwise, using the label as the second level label of the contextual window.

According to one preferred embodiment of the present invention, the amending unit is further used for, for each context filtered out Window, however, it is determined that the contextual window does not meet scheduled confidence level requirement, then abandons the contextual window, otherwise, really Make the label of the contextual window.

According to one preferred embodiment of the present invention, the amending unit for each contextual window each second level label, It carries out the following processing respectively: if it is determined that comprising the contextual window in any sentence, and the text in the contextual window Segment is noted as the second level label, then when the text fragments belong to the text fragments filtered out, and the second level label It is similarly the second level label of the text fragments, and the level-one mark of the level-one label of the contextual window and the text fragments When signing consistent, using the sentence as the sentence of marking error.

According to one preferred embodiment of the present invention, the amending unit is by the label of the text fragments in the contextual window It is modified to the level-one label of the text fragments.

A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processor The computer program of upper operation, the processor realize method as described above when executing described program.

A kind of computer readable storage medium is stored thereon with computer program, real when described program is executed by processor Now method as described above.

It can be seen that based on above-mentioned introduction using scheme of the present invention, the serializing mark obtained for existing training Injection molding type can be carried out automatic marking to large-scale corpus using it, and be determined the sentence of marking error based on annotation results, into And the sentence of marking error can be modified, and using revised sentence as training data, according to training data to sequence Change marking model optimize, so as to find automatically serializing marking model the problem of and specific aim optimize, in turn Improve model accuracy etc..

[Detailed description of the invention]

Fig. 1 is the flow chart of model optimization embodiment of the method for the present invention.

Fig. 2 is that the whole of model optimization method of the present invention realizes process schematic.

Fig. 3 is the composed structure schematic diagram of model optimization Installation practice of the present invention.

Fig. 4 shows the block diagram for being suitable for the exemplary computer system/server 12 for being used to realize embodiment of the present invention.

[specific embodiment]

In order to be clearer and more clear technical solution of the present invention, hereinafter, referring to the drawings and the embodiments, to institute of the present invention The scheme of stating is further described.

Obviously, described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on the present invention In embodiment, those skilled in the art's all other embodiment obtained without creative efforts, all Belong to the scope of protection of the invention.

In addition, it should be understood that the terms "and/or", a kind of only incidence relation for describing affiliated partner, expression can With there are three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three feelings of individualism B Condition.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".

Fig. 1 is the flow chart of model optimization embodiment of the method for the present invention.As shown in Figure 1, including realizing in detail below Mode.

In 101, the serializing marking model that training obtains is obtained.

In 102, each sentence in scheduled large-scale corpus is labeled using serializing marking model.

In 103, annotation results are based on, the sentence of marking error is determined according to predetermined policy.

In 104, the sentence of marking error is modified, using revised sentence as training data.

In 105, serializing marking model is optimized according to training data.

Initial serializing marking model can be obtained according to existing way training, it later can be according to mode described in the present embodiment Serializing marking model is optimized.

Each sentence in scheduled large-scale corpus is labeled using serializing marking model.It is scheduled extensive Which corpus and specific scale etc. are specifically included in corpus can be determined according to actual needs, an article, a webpage etc. As corpus, each sentence (sentence) in each corpus can be labeled respectively.

It is illustrated for serializing marking model and being Named Entity Extraction Model below.

For example, for sentence " Beijing is gone in navigation ", following annotation results can be obtained: navigation go Beijing LOC, wherein " north The label that capital " is marked is " LOC ", indicates place name.For another example, it for sentence " his local is in Shi Gezhuan ", can be obtained as follows Annotation results: his local is in Shi Ge Zhuan LOC, wherein the label that " Shi Gezhuan " is marked is " LOC ".

Annotation results can be based on, the sentence of marking error is determined according to predetermined policy.

Specifically, following two categories data can be filtered out first:

1) text fragments for meeting the following conditions: same text piece are filtered out from each text fragments for being marked label Section has been marked different labels in different contextual windows.

The content of label can be identified by as text fragments to be screened, be marked label as described above " Beijing " and " Shi Gezhuan " can be used as text fragments to be screened.

The text fragments for meeting the following conditions can be filtered out from text fragments to be screened: same text segment is in difference Different labels have been marked in contextual window.For example, having appeared in different sentences for text fragments " Shi Gezhuan " In, be such as respectively " his local is in Shi Gezhuan ", " I like to be beautiful the village Li Shige ", wherein " his local is in Shi Gezhuan " this In sentence, text fragments " Shi Gezhuan " have been marked the label of " LOC ", and in " I like to be beautiful the village Li Shige " this sentence, Text fragments " Shi Gezhuan " have been marked the label of " PER ", and PER indicates name, i.e., same text segment " Shi Gezhuan " is in difference Different labels are marked in contextual window, then then can be by text fragments " Shi Gezhuan " as the text fragments filtered out.

In serializing mark task, it is although same text segment is marked different labels in different contextual windows It is possible, but be in most cases marking error, i.e., having greater probability in this case is marking error, therefore can be screened Out.

2) from filtering out the contextual window for meeting the following conditions in each contextual window: in same context window not Different labels have been marked with text fragments.

For example, " * * is removed in navigation " in sentence " Beijing is gone in navigation " is a contextual window.

For example, different sentence " Beijing is gone in navigation " and " the navigation village Qu Shige ", contextual window is " navigation * * " is removed, but the text fragments " Beijing " in " Beijing is gone in navigation " this sentence have been marked the label of " LOC ", and " history is gone in navigation Text fragments " Shi Gezhuan " in this sentence of each village " have been marked the label of " PER ", i.e. same context window " navigation The different text fragments gone in * * " have been marked different labels, then then can be by contextual window " * * is removed in navigation " as screening Contextual window out.

In same context window, different text fragments maximum probabilities usually has phase as identical semantic role Same label, so if label is different, greater probability is marking error, can be screened.

After completing processing 1) and 2), the label of the text fragments filtered out and contextual window can be determined respectively, And then the sentence of marking error can be determined according to the label determined.

Wherein, the label may include level-one label and second level label.If any text fragments filtered out are not present one Grade label, can drop text segment, similarly, if level-one label is not present in any contextual window filtered out, can drop The contextual window.

For each text fragments filtered out, the level-one label of text segment can be determined in the following way respectively With second level label: statistics text segment is marked the number of label, obtains the first statistical result；Text segment is obtained to be marked All labels of note count the number that text segment is noted as the label for different labels each of are marked respectively, The second statistical result is obtained, with the second statistical result divided by the first statistical result, if obtained quotient is greater than first threshold, can be incited somebody to action Level-one label of the label as text segment otherwise can be using the label as the second level label of text segment.

For example, having been appeared in 20 different sentences altogether for text fragments " Shi Gezhuan ", being marked 20 deutero-albumoses Label, the label being marked includes " LOC " and " PER ", wherein be marked for 16 times for " LOC ", is marked for " PER " for 4 times, that It is directed to " LOC " this label, 16/20=80% can be calculated, is greater than first threshold such as 50%, therefore " LOC " can be made 4/20=20% can be calculated for " PER " this label for the level-one label of text fragments " Shi Gezhuan ", less than 50%, It therefore can second level label by " PER " as text fragments " Shi Gezhuan ".

The specific value of above-mentioned first threshold can be determined according to actual needs, it is preferable that and it can be above-mentioned 50%, if the One threshold value is 50%, then only having a level-one label for each text fragments, may there is one or more second levels Label.

For each contextual window filtered out, the level-one of the contextual window can be determined in the following way respectively Label and second level label: the number that the text fragments in the contextual window are marked label is counted, third statistical result is obtained； All labels that the text fragments in the contextual window are marked are obtained, for different labels each of are marked, are united respectively The number that the text fragments in the contextual window are noted as the label is counted, the 4th statistical result is obtained, is tied with the 4th statistics Fruit is divided by third statistical result, can be using the label as the level-one of the contextual window if obtained quotient is greater than second threshold Label otherwise can be using the label as the second level label of the contextual window.

For example, having been appeared in 20 different sentences altogether, the contextual window for contextual window " * * is removed in navigation " In text fragments to be marked the number of label be 20, the label being marked includes " LOC " and " PER ", wherein the context window Being marked for " LOC " for text fragments 16 times in mouthful, is marked for " PER " for 4 times, then it is directed to " LOC " this label, it can 16/20=80% is calculated, is greater than second threshold such as 70%, therefore " LOC " can be used as contextual window " * * is removed in navigation " Level-one label 4/20=20% can be calculated for " PER " this label, less than 70%, therefore can be by " PER " conduct The second level label of contextual window " * * is removed in navigation ".

The specific value of above-mentioned second threshold can be determined according to actual needs, and it is possible to it is identical as first threshold, it can also With difference, it is preferable that can be above-mentioned 70%, if second threshold is 70%, for each contextual window, only A level-one label is had, there may be one or more second level labels.

In the present embodiment, for each contextual window filtered out, it also can first determine that whether the contextual window accords with Scheduled confidence level requirement is closed, if not meeting scheduled confidence level requirement, can drop the contextual window, it otherwise, can be according to upper The mode of stating determines the label of the contextual window.

For example, the frequency of occurrence of the contextual window can be counted respectively for each contextual window filtered out, if going out Occurrence number is very low, such as only occurs twice in all corpus, then it is assumed that the contextual window does not meet certain mould Formula does not have versatility, and confidence level is lower, therefore can drop the contextual window, to reduce the workload etc. of subsequent processing.

It, can be according to the label determined after determining the level-one label and second level label of text fragments and contextual window Determine the sentence of marking error.

Specifically, it for each second level label of each contextual window, can carry out the following processing respectively: if it is determined that any It include the contextual window in sentence, and the text fragments in the contextual window in the sentence are noted as the second level mark Label, then when text segment belongs to the text fragments filtered out, and the second level label is similarly the second level label of text segment, And the level-one label of the contextual window it is consistent with the level-one label of the text segment when, which is determined as marking error Sentence.

For example, contextual window is " * * is removed in navigation ", level-one label is " LOC ", and second level label is " PER ", a certain language Sentence is " the navigation village Qu Shige ", includes contextual window " * * is removed in navigation " in the sentence, and " history is each for the text fragments in the sentence The village " is noted as the label of " PER ", and text fragments " Shi Gezhuan " belong to the text fragments filtered out, and " PER " is similarly text The second level label of this segment " Shi Gezhuan ", and the level-one label of contextual window " * * is removed in navigation " and text fragments " Shi Gezhuan " Level-one label is " LOC ", then the sentence can be then determined as to the sentence of marking error.

Further, the sentence of marking error can be modified.It specifically, can be by the text in above-mentioned contextual window The label of segment is modified to the level-one label of text segment.

For example, the label of the text fragments " Shi Gezhuan " of " PER " can will be labeled as in " the navigation village Qu Shige " this sentence It is modified to level-one label i.e. " LOC " of text fragments " Shi Gezhuan ".

It can be using revised sentence as training data.In this manner it is achieved that a plurality of training data can be obtained, it can be further Serializing marking model is optimized according to training data.

Specific optimal way is unlimited, for example, the training data that can be will acquire be added to before training data in, Marking model is serialized using updated training data re -training, alternatively, also can use the training data got, In Serializing marking model is finely adjusted in original basis.After being optimized to serializing marking model, repeats and execute Fig. 1 institute Show process, Continuous optimization modelling effect.

Based on above-mentioned introduction, Fig. 2 is that the whole of model optimization method of the present invention realizes process schematic.Such as Fig. 2 institute Show, after getting serializing marking model, automatic marking is carried out to large-scale corpus using serializing marking model, is obtained Annotation results.Annotation results can be based on, satisfactory text fragments and contextual window are filtered out, wherein can be from being marked Filter out the text fragments for meeting the following conditions in each text fragments of label: same text segment is in different contextual windows In be marked different labels；The contextual window for meeting the following conditions can be filtered out from each contextual window: above and below identical Different text fragments in text window have been marked different labels.For each text fragments filtered out, can determine respectively Its level-one label and second level label, wherein if level-one label is not present in any text fragments, can drop text segment.For The each contextual window filtered out, however, it is determined that the contextual window does not meet scheduled confidence level requirement, and it is upper and lower to can drop this Text window, otherwise, it may be determined that go out the level-one label and second level label of the contextual window, wherein if any contextual window is not There are level-one label, the contextual window can drop.It later, can be based on the level-one of the text fragments and contextual window determined Label and second level label are determined the sentence of marking error, and can be modified to the sentence of marking error, to be corrected Training data afterwards, and then serializing marking model can be optimized according to training data.Serializing marking model is carried out After optimization, repeats and execute process shown in Fig. 2, Continuous optimization modelling effect.Specific implementation please refers in embodiment illustrated in fig. 1 Related description, repeat no more.

It should be noted that for the aforementioned method embodiment, for simple description, being stated that a series of movement Combination, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described, because according to this Invention, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know that, explanation Embodiment described in book belongs to preferred embodiment, and related actions and modules not necessarily present invention institute is necessary 's.

In short, can find the problem of serializing marking model automatically using scheme of the present invention and specific aim progress is excellent Change, to improve model accuracy, and sustainable Optimized model effect etc..

The introduction about embodiment of the method above, below by way of Installation practice, to scheme of the present invention carry out into One step explanation.

Fig. 3 is the composed structure schematic diagram of model optimization Installation practice of the present invention.As shown in Figure 3, comprising: obtain Unit 301, mark unit 302, amending unit 303 and optimization unit 304.

Acquiring unit 301, the serializing marking model obtained for obtaining training.

Unit 302 is marked, for marking using serializing marking model to each sentence in scheduled large-scale corpus Note.

Amending unit 303 determines the sentence of marking error according to predetermined policy, and to mark for being based on annotation results The sentence of note mistake is modified, using revised sentence as training data.

Optimize unit 304, for optimizing according to training data to serializing marking model.

Initial serializing marking model can be obtained according to existing way training, and using serializing marking model to pre- Each sentence in fixed large-scale corpus is labeled.

Wherein, amending unit 303 can be based on annotation results, and the sentence of marking error is determined according to predetermined policy.Specifically Ground, amending unit 303 can filter out the text fragments for meeting the following conditions from each text fragments for being marked label: identical Text fragments have been marked different labels in different contextual windows；It is filtered out from each contextual window and meets the following conditions Contextual window: the different text fragments in same context window have been marked different labels；It determines to filter out respectively Text fragments and contextual window label；The sentence of marking error is determined according to the label determined.

Wherein, the label may include level-one label and second level label.If any text fragments filtered out are not present one Grade label, amending unit 303 can drop text segment, similarly, if level-one is not present in any contextual window filtered out Label, amending unit 303 can drop the contextual window.

For each text fragments filtered out, amending unit 303 can determine this article this film in the following way respectively The level-one label and second level label of section: statistics text segment is marked the number of label, obtains the first statistical result；Obtaining should All labels that text fragments are marked count text segment respectively and are noted as different labels each of are marked The number of the label obtains the second statistical result, with the second statistical result divided by the first statistical result, if obtained quotient is greater than the One threshold value, then can be using the label as the level-one label of text segment, otherwise, can be using the label as the two of text segment Grade label.

For each contextual window filtered out, amending unit 303 can determine that this is upper and lower in the following way respectively The level-one label and second level label of text window: the number that the text fragments in the contextual window are marked label is counted, is obtained Third statistical result；All labels that the text fragments in the contextual window are marked are obtained, for each of being marked no Same label counts the number that the text fragments in the contextual window are noted as the label respectively, obtains the 4th statistical result, It, can be upper and lower as this using the label if obtained quotient is greater than second threshold with the 4th statistical result divided by third statistical result The level-one label of text window otherwise can be using the label as the second level label of the contextual window.

For each contextual window filtered out, amending unit 303 also can first determine that whether the contextual window accords with Scheduled confidence level requirement is closed, if not meeting scheduled confidence level requirement, can drop the contextual window, it otherwise, can be according to upper The mode of stating determines the label of the contextual window.

For example, amending unit 303 can count going out for the contextual window respectively for each contextual window filtered out Occurrence number such as only occurs twice in all corpus if frequency of occurrence is very low, then it is assumed that the contextual window is not inconsistent Unify fixed mode, does not have versatility, confidence level is lower, therefore can drop the contextual window, to reduce subsequent processing Workload etc..

After determining the level-one label and second level label of text fragments and contextual window, amending unit 303 can basis The label determined determines the sentence of marking error.

Specifically, for each second level label of each contextual window, amending unit 303 can carry out following locate respectively Reason: if it is determined that including the contextual window in any sentence, and the text fragments in the contextual window in the sentence are marked Note is the second level label, then when text segment belongs to the text fragments filtered out, and the second level label is similarly this article this film The second level label of section, and when the level-one label of the contextual window is consistent with the level-one label of text segment, the sentence is true It is set to the sentence of marking error.

Further, amending unit 303 can be modified the sentence of marking error.It specifically, can be by above-mentioned context The label of text fragments in window is modified to the level-one label of text segment.

Amending unit 303 can be using revised sentence as training data.In this manner it is achieved that a plurality of trained number can be obtained According to correspondingly, optimization unit 304 can optimize serializing marking model according to training data.

Specific optimal way is unlimited, for example, the training data that can be will acquire be added to before training data in, Marking model is serialized using updated training data re -training, alternatively, also can use the training data got, In Serializing marking model is finely adjusted in original basis.

The specific workflow of Fig. 3 shown device embodiment please refers to the related description in preceding method embodiment, no longer It repeats.

Fig. 4 shows the block diagram for being suitable for the exemplary computer system/server 12 for being used to realize embodiment of the present invention. The computer system/server 12 that Fig. 4 is shown is only an example, should not function and use scope to the embodiment of the present invention Bring any restrictions.

As shown in figure 4, computer system/server 12 is showed in the form of universal computing device.Computer system/service The component of device 12 can include but is not limited to: one or more processor (processing unit) 16, memory 28, connect not homology The bus 18 of system component (including memory 28 and processor 16).

Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.

Computer system/server 12 typically comprises a variety of computer system readable media.These media, which can be, appoints What usable medium that can be accessed by computer system/server 12, including volatile and non-volatile media, it is moveable and Immovable medium.

Memory 28 may include the computer system readable media of form of volatile memory, such as random access memory Device (RAM) 30 and/or cache memory 32.Computer system/server 12 may further include it is other it is removable/no Movably, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing Immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although not shown in fig 4, may be used To provide the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk "), and it is non-volatile to moving Property CD (such as CD-ROM, DVD-ROM or other optical mediums) read and write CD drive.In these cases, each drive Dynamic device can be connected by one or more data media interfaces with bus 18.Memory 28 may include at least one program Product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform the present invention The function of each embodiment.

Program/utility 40 with one group of (at least one) program module 42 can store in such as memory 28 In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs It may include the realization of network environment in module and program data, each of these examples or certain combination.Program mould Block 42 usually executes function and/or method in embodiment described in the invention.

Computer system/server 12 can also be (such as keyboard, sensing equipment, aobvious with one or more external equipments 14 Show device 24 etc.) communication, it is logical that the equipment interacted with the computer system/server 12 can be also enabled a user to one or more Letter, and/or with the computer system/server 12 any is set with what one or more of the other calculating equipment was communicated Standby (such as network interface card, modem etc.) communicates.This communication can be carried out by input/output (I/O) interface 22.And And computer system/server 12 can also pass through network adapter 20 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, such as internet) communication.As shown in figure 4, network adapter 20 passes through bus 18 communicate with other modules of computer system/server 12.It should be understood that although not shown in the drawings, computer can be combined Systems/servers 12 use other hardware and/or software module, including but not limited to: microcode, device driver, at redundancy Manage unit, external disk drive array, RAID system, tape drive and data backup storage system etc..

The program that processor 16 is stored in memory 28 by operation, at various function application and data Reason, such as realize the method in embodiment illustrated in fig. 1.

The present invention discloses a kind of computer readable storage mediums, are stored thereon with computer program, the program quilt Processor will realize the method in embodiment as shown in Figure 1 when executing.

It can be using any combination of one or more computer-readable media.Computer-readable medium can be calculating Machine readable signal medium or computer readable storage medium.Computer readable storage medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example (non exhaustive list) of machine readable storage medium storing program for executing includes: electrical connection with one or more conducting wires, just Taking formula computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this document, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.In Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service It is connected for quotient by internet).

In several embodiments provided by the present invention, it should be understood that disclosed device and method etc. can pass through Other modes are realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, Only a kind of logical function partition, there may be another division manner in actual implementation.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. it is various It can store the medium of program code.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of model optimization method characterized by comprising

Obtain the serializing marking model that training obtains；

The serializing marking model is optimized according to the training data.

2. the method according to claim 1, wherein

The sentence that marking error is determined according to predetermined policy includes:

The text fragments for meeting the following conditions are filtered out from each text fragments for being marked label: same text segment is not With being marked different labels in contextual window；

From filtering out the contextual window for meeting the following conditions in each contextual window: the different texts in same context window Segment has been marked different labels；

The sentence of marking error is determined according to the label determined.

3. according to the method described in claim 2, it is characterized in that,

The label includes level-one label and second level label；

This method further comprises: if the level-one label is not present in any text fragments filtered out, abandoning the text Segment；If the level-one label is not present in any contextual window filtered out, the contextual window is abandoned.

4. according to the method described in claim 3, it is characterized in that,

The label for determining the text fragments filtered out includes:

All labels that the text fragments are marked are obtained, for different labels each of are marked, count the text respectively This segment is noted as the number of the label, obtains the second statistical result, with second statistical result divided by described first Statistical result, if obtained quotient is greater than first threshold, using the label as the level-one label of the text fragments, otherwise, Using the label as the second level label of the text fragments.

5. according to the method described in claim 4, it is characterized in that,

The label for determining the contextual window filtered out includes:

All labels that the text fragments in the contextual window are marked are obtained, for each of being marked different labels, The number that the text fragments in the contextual window are noted as the label is counted respectively, obtains the 4th statistical result, is used 4th statistical result divided by the third statistical result, if obtained quotient be greater than second threshold, using the label as The level-one label of the contextual window, otherwise, using the label as the second level label of the contextual window.

6. according to the method described in claim 5, it is characterized in that,

This method further comprises: for each contextual window filtered out, however, it is determined that the contextual window does not meet pre- Fixed confidence level requirement, then abandon the contextual window, otherwise, it determines the label of the contextual window out.

7. according to the method described in claim 5, it is characterized in that,

The label that the basis is determined determines that the sentence of marking error includes:

If it is determined that including the contextual window in any sentence, and the text fragments in the contextual window are noted as institute Second level label is stated, then when the text fragments belong to the text fragments filtered out, and the second level label is similarly the text The second level label of segment, and when the level-one label of the contextual window is consistent with the level-one label of the text fragments, by institute Sentence of the predicate sentence as marking error.

8. the method according to the description of claim 7 is characterized in that

It includes: to repair the label of the text fragments in the contextual window that the sentence to the marking error, which is modified, It is just being the level-one label of the text fragments.

9. a kind of model optimization device characterized by comprising acquiring unit, mark unit, amending unit and optimization are single Member；

The mark unit, for being marked using the serializing marking model to each sentence in scheduled large-scale corpus Note；

The amending unit determines the sentence of marking error according to predetermined policy, and to the mark for being based on annotation results The sentence of note mistake is modified, using revised sentence as training data；

10. device according to claim 9, which is characterized in that

The amending unit filters out the text fragments for meeting the following conditions from each text fragments for being marked label: identical Text fragments have been marked different labels in different contextual windows；It is filtered out from each contextual window and meets the following conditions Contextual window: the different text fragments in same context window have been marked different labels；It determines to filter out respectively Text fragments and contextual window label；The sentence of marking error is determined according to the label determined.

11. device according to claim 10, which is characterized in that

The label includes level-one label and second level label；

The amending unit is further used for, if the level-one label is not present in any text fragments filtered out, abandons institute Text fragments are stated, if the level-one label is not present in any contextual window filtered out, abandon the contextual window.

12. device according to claim 11, which is characterized in that

The amending unit is directed to each text fragments filtered out, carries out the following processing respectively: counting the text fragments quilt The number for marking label, obtains the first statistical result；All labels that the text fragments are marked are obtained, for what is be marked Each difference label, counts the number that the text fragments are noted as the label respectively, obtains the second statistical result, use institute The second statistical result is stated divided by first statistical result, if obtained quotient is greater than first threshold, using the label as institute The level-one label of text fragments is stated, otherwise, using the label as the second level label of the text fragments.

13. device according to claim 12, which is characterized in that

The amending unit is directed to each contextual window filtered out, carries out the following processing respectively: counting the context window Text fragments in mouthful are marked the number of label, obtain third statistical result；Obtain the text piece in the contextual window All labels for being marked of section count the text piece in the contextual window for different labels each of are marked respectively Section is noted as the number of the label, obtains the 4th statistical result, is counted with the 4th statistical result divided by the third As a result, if obtained quotient is greater than second threshold, it, otherwise, will using the label as the level-one label of the contextual window Second level label of the label as the contextual window.

14. device according to claim 13, which is characterized in that

The amending unit is further used for, for each contextual window filtered out, however, it is determined that the contextual window is not Meet scheduled confidence level requirement, then the contextual window is abandoned, otherwise, it determines the label of the contextual window out.

15. device according to claim 13, which is characterized in that

The amending unit carries out the following processing: if it is determined that any each second level label of each contextual window respectively It include the contextual window in sentence, and the text fragments in the contextual window are noted as the second level label, then When the text fragments belong to the text fragments filtered out, and the second level label is similarly the second level mark of the text fragments Label, and when the level-one label of the contextual window is consistent with the level-one label of the text fragments, using the sentence as marking Infuse the sentence of mistake.

16. device according to claim 15, which is characterized in that

The label of text fragments in the contextual window is modified to the level-one mark of the text fragments by the amending unit Label.

17. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, which is characterized in that the processor is realized when executing described program as any in claim 1~8 Method described in.

18. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed Such as method according to any one of claims 1 to 8 is realized when device executes.