CN110457683A - Model optimization method, apparatus, computer equipment and storage medium - Google Patents
Model optimization method, apparatus, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110457683A CN110457683A CN201910636482.2A CN201910636482A CN110457683A CN 110457683 A CN110457683 A CN 110457683A CN 201910636482 A CN201910636482 A CN 201910636482A CN 110457683 A CN110457683 A CN 110457683A
- Authority
- CN
- China
- Prior art keywords
- label
- text fragments
- level
- contextual window
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses model optimization method, apparatus, computer equipment and storage medium, wherein method can include: obtain the serializing marking model that training obtains;Each sentence in scheduled large-scale corpus is labeled using serializing marking model;Based on annotation results, the sentence of marking error is determined according to predetermined policy;The sentence of marking error is modified, using revised sentence as training data;Serializing marking model is optimized according to training data.Using scheme of the present invention, the problem of serializing marking model can be found automatically and specific aim optimizes, thus lift scheme precision etc..
Description
[technical field]
The present invention relates to Computer Applied Technology, in particular to model optimization method, apparatus, computer equipment and storage is situated between
Matter.
[background technique]
Serializing marking model is in the field natural language processing (NLP, Natural Language Processing)
Common model, many important research directions, such as participle, part-of-speech tagging, name Entity recognition can be abstracted as serializing
Mark problem.
Serializing mark problem research in, basic ideas be by manually marking training corpus (i.e. training data),
Carry out training sequence marking model, the effect of model depends on the quality and quantity of data mark.
And the at high cost, period manually marked is long, professional corpus such as part-of-speech tagging corpus needs domain expert that could complete,
Therefore it is limited to cost of labor etc., the scale of training corpus is usually all not too large, to affect the model essence that training obtains
Degree etc..
[summary of the invention]
In view of this, the present invention provides model optimization method, apparatus, computer equipment and storage mediums.
Specific technical solution is as follows:
A kind of model optimization method, comprising:
Obtain the serializing marking model that training obtains;
Each sentence in scheduled large-scale corpus is labeled using the serializing marking model;
Based on annotation results, the sentence of marking error is determined according to predetermined policy;
The sentence of the marking error is modified, using revised sentence as training data;
The serializing marking model is optimized according to the training data.
According to one preferred embodiment of the present invention, the sentence that marking error is determined according to predetermined policy includes:
The text fragments for meeting the following conditions: same text segment are filtered out from each text fragments for being marked label
Different labels have been marked in different contextual windows;
From filtering out the contextual window for meeting the following conditions: the difference in same context window in each contextual window
Text fragments have been marked different labels;
The label of the text fragments filtered out and contextual window is determined respectively;
The sentence of marking error is determined according to the label determined.
According to one preferred embodiment of the present invention, the label includes level-one label and second level label;
This method further comprises: if any text fragments filtered out be not present the level-one label, abandon described in
Text fragments;If the level-one label is not present in any contextual window filtered out, the contextual window is abandoned.
According to one preferred embodiment of the present invention, the label for determining the text fragments filtered out includes:
For each text fragments filtered out, carry out the following processing respectively:
The number that the text fragments are marked label is counted, the first statistical result is obtained;
All labels that the text fragments are marked are obtained, for different labels each of are marked, count institute respectively
The number that text fragments are noted as the label is stated, the second statistical result is obtained, with second statistical result divided by described
First statistical result, it is no using the label as the level-one label of the text fragments if obtained quotient is greater than first threshold
Then, using the label as the second level label of the text fragments.
According to one preferred embodiment of the present invention, the label for determining the contextual window filtered out includes:
For each contextual window filtered out, carry out the following processing respectively:
The number that the text fragments in the contextual window are marked label is counted, third statistical result is obtained;
All labels that the text fragments in the contextual window are marked are obtained, for each of being marked different mark
Label, count the number that the text fragments in the contextual window are noted as the label respectively, obtain the 4th statistical result,
With the 4th statistical result divided by the third statistical result, if obtained quotient is greater than second threshold, the label is made
For the level-one label of the contextual window, otherwise, using the label as the second level label of the contextual window.
According to one preferred embodiment of the present invention, this method further comprises: for each contextual window filtered out, if
It determines that the contextual window does not meet scheduled confidence level requirement, then abandons the contextual window, otherwise, it determines going out described
The label of contextual window.
According to one preferred embodiment of the present invention, the label that the basis is determined determines that the sentence of marking error includes:
For each second level label of each contextual window, carry out the following processing respectively:
If it is determined that including the contextual window in any sentence, and the text fragments in the contextual window are marked
For the second level label, then when the text fragments belong to the text fragments filtered out, and the second level label is similarly described
The second level label of text fragments, and when the level-one label of the contextual window is consistent with the level-one label of the text fragments,
Using the sentence as the sentence of marking error.
According to one preferred embodiment of the present invention, the sentence to the marking error be modified include: will be described on
The label of text fragments in lower text window is modified to the level-one label of the text fragments.
A kind of model optimization device, comprising: acquiring unit, mark unit, amending unit and optimization unit;
The acquiring unit, the serializing marking model obtained for obtaining training;
The mark unit, for using the serializing marking model to each sentence in scheduled large-scale corpus into
Rower note;
The amending unit determines the sentence of marking error according to predetermined policy, and to institute for being based on annotation results
The sentence for stating marking error is modified, using revised sentence as training data;
The optimization unit, for being optimized according to the training data to the serializing marking model.
According to one preferred embodiment of the present invention, the amending unit is filtered out from each text fragments for being marked label
Meet the text fragments of the following conditions: same text segment has been marked different labels in different contextual windows;From each
The contextual window for meeting the following conditions is filtered out in lower text window: the different text fragments in same context window are marked
Different labels;The label of the text fragments filtered out and contextual window is determined respectively;It is determined according to the label determined
The sentence of marking error out.
According to one preferred embodiment of the present invention, the label includes level-one label and second level label;
The amending unit is further used for, if the level-one label is not present in any text fragments filtered out, loses
The text fragments are abandoned, if the level-one label is not present in any contextual window filtered out, abandon the context window
Mouthful.
According to one preferred embodiment of the present invention, the amending unit is directed to each text fragments filtered out, carries out respectively
It handles below: counting the number that the text fragments are marked label, obtain the first statistical result;Obtain the text fragments quilt
All labels of mark count the text fragments respectively and are noted as the label for different labels each of are marked
Number, the second statistical result is obtained, with second statistical result divided by first statistical result, if obtained quotient is greater than
First threshold, then using the label as the level-one label of the text fragments, otherwise, using the label as the text piece
The second level label of section.
According to one preferred embodiment of the present invention, the amending unit is directed to each contextual window for filtering out, respectively into
The following processing of row: the number that the text fragments in the contextual window are marked label is counted, third statistical result is obtained;It obtains
All labels that the text fragments in the contextual window are marked are taken, for different labels each of are marked, are united respectively
The number that the text fragments in the contextual window are noted as the label is counted, the 4th statistical result is obtained, with described
Four statistical results are divided by the third statistical result, if obtained quotient is greater than second threshold, using the label as on described
The level-one label of lower text window, otherwise, using the label as the second level label of the contextual window.
According to one preferred embodiment of the present invention, the amending unit is further used for, for each context filtered out
Window, however, it is determined that the contextual window does not meet scheduled confidence level requirement, then abandons the contextual window, otherwise, really
Make the label of the contextual window.
According to one preferred embodiment of the present invention, the amending unit for each contextual window each second level label,
It carries out the following processing respectively: if it is determined that comprising the contextual window in any sentence, and the text in the contextual window
Segment is noted as the second level label, then when the text fragments belong to the text fragments filtered out, and the second level label
It is similarly the second level label of the text fragments, and the level-one mark of the level-one label of the contextual window and the text fragments
When signing consistent, using the sentence as the sentence of marking error.
According to one preferred embodiment of the present invention, the amending unit is by the label of the text fragments in the contextual window
It is modified to the level-one label of the text fragments.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processor
The computer program of upper operation, the processor realize method as described above when executing described program.
A kind of computer readable storage medium is stored thereon with computer program, real when described program is executed by processor
Now method as described above.
It can be seen that based on above-mentioned introduction using scheme of the present invention, the serializing mark obtained for existing training
Injection molding type can be carried out automatic marking to large-scale corpus using it, and be determined the sentence of marking error based on annotation results, into
And the sentence of marking error can be modified, and using revised sentence as training data, according to training data to sequence
Change marking model optimize, so as to find automatically serializing marking model the problem of and specific aim optimize, in turn
Improve model accuracy etc..
[Detailed description of the invention]
Fig. 1 is the flow chart of model optimization embodiment of the method for the present invention.
Fig. 2 is that the whole of model optimization method of the present invention realizes process schematic.
Fig. 3 is the composed structure schematic diagram of model optimization Installation practice of the present invention.
Fig. 4 shows the block diagram for being suitable for the exemplary computer system/server 12 for being used to realize embodiment of the present invention.
[specific embodiment]
In order to be clearer and more clear technical solution of the present invention, hereinafter, referring to the drawings and the embodiments, to institute of the present invention
The scheme of stating is further described.
Obviously, described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on the present invention
In embodiment, those skilled in the art's all other embodiment obtained without creative efforts, all
Belong to the scope of protection of the invention.
In addition, it should be understood that the terms "and/or", a kind of only incidence relation for describing affiliated partner, expression can
With there are three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three feelings of individualism B
Condition.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".
Fig. 1 is the flow chart of model optimization embodiment of the method for the present invention.As shown in Figure 1, including realizing in detail below
Mode.
In 101, the serializing marking model that training obtains is obtained.
In 102, each sentence in scheduled large-scale corpus is labeled using serializing marking model.
In 103, annotation results are based on, the sentence of marking error is determined according to predetermined policy.
In 104, the sentence of marking error is modified, using revised sentence as training data.
In 105, serializing marking model is optimized according to training data.
Initial serializing marking model can be obtained according to existing way training, it later can be according to mode described in the present embodiment
Serializing marking model is optimized.
Each sentence in scheduled large-scale corpus is labeled using serializing marking model.It is scheduled extensive
Which corpus and specific scale etc. are specifically included in corpus can be determined according to actual needs, an article, a webpage etc.
As corpus, each sentence (sentence) in each corpus can be labeled respectively.
It is illustrated for serializing marking model and being Named Entity Extraction Model below.
For example, for sentence " Beijing is gone in navigation ", following annotation results can be obtained: navigation go Beijing LOC, wherein " north
The label that capital " is marked is " LOC ", indicates place name.For another example, it for sentence " his local is in Shi Gezhuan ", can be obtained as follows
Annotation results: his local is in Shi Ge Zhuan LOC, wherein the label that " Shi Gezhuan " is marked is " LOC ".
Annotation results can be based on, the sentence of marking error is determined according to predetermined policy.
Specifically, following two categories data can be filtered out first:
1) text fragments for meeting the following conditions: same text piece are filtered out from each text fragments for being marked label
Section has been marked different labels in different contextual windows.
The content of label can be identified by as text fragments to be screened, be marked label as described above
" Beijing " and " Shi Gezhuan " can be used as text fragments to be screened.
The text fragments for meeting the following conditions can be filtered out from text fragments to be screened: same text segment is in difference
Different labels have been marked in contextual window.For example, having appeared in different sentences for text fragments " Shi Gezhuan "
In, be such as respectively " his local is in Shi Gezhuan ", " I like to be beautiful the village Li Shige ", wherein " his local is in Shi Gezhuan " this
In sentence, text fragments " Shi Gezhuan " have been marked the label of " LOC ", and in " I like to be beautiful the village Li Shige " this sentence,
Text fragments " Shi Gezhuan " have been marked the label of " PER ", and PER indicates name, i.e., same text segment " Shi Gezhuan " is in difference
Different labels are marked in contextual window, then then can be by text fragments " Shi Gezhuan " as the text fragments filtered out.
In serializing mark task, it is although same text segment is marked different labels in different contextual windows
It is possible, but be in most cases marking error, i.e., having greater probability in this case is marking error, therefore can be screened
Out.
2) from filtering out the contextual window for meeting the following conditions in each contextual window: in same context window not
Different labels have been marked with text fragments.
For example, " * * is removed in navigation " in sentence " Beijing is gone in navigation " is a contextual window.
For example, different sentence " Beijing is gone in navigation " and " the navigation village Qu Shige ", contextual window is " navigation
* * " is removed, but the text fragments " Beijing " in " Beijing is gone in navigation " this sentence have been marked the label of " LOC ", and " history is gone in navigation
Text fragments " Shi Gezhuan " in this sentence of each village " have been marked the label of " PER ", i.e. same context window " navigation
The different text fragments gone in * * " have been marked different labels, then then can be by contextual window " * * is removed in navigation " as screening
Contextual window out.
In same context window, different text fragments maximum probabilities usually has phase as identical semantic role
Same label, so if label is different, greater probability is marking error, can be screened.
After completing processing 1) and 2), the label of the text fragments filtered out and contextual window can be determined respectively,
And then the sentence of marking error can be determined according to the label determined.
Wherein, the label may include level-one label and second level label.If any text fragments filtered out are not present one
Grade label, can drop text segment, similarly, if level-one label is not present in any contextual window filtered out, can drop
The contextual window.
For each text fragments filtered out, the level-one label of text segment can be determined in the following way respectively
With second level label: statistics text segment is marked the number of label, obtains the first statistical result;Text segment is obtained to be marked
All labels of note count the number that text segment is noted as the label for different labels each of are marked respectively,
The second statistical result is obtained, with the second statistical result divided by the first statistical result, if obtained quotient is greater than first threshold, can be incited somebody to action
Level-one label of the label as text segment otherwise can be using the label as the second level label of text segment.
For example, having been appeared in 20 different sentences altogether for text fragments " Shi Gezhuan ", being marked 20 deutero-albumoses
Label, the label being marked includes " LOC " and " PER ", wherein be marked for 16 times for " LOC ", is marked for " PER " for 4 times, that
It is directed to " LOC " this label, 16/20=80% can be calculated, is greater than first threshold such as 50%, therefore " LOC " can be made
4/20=20% can be calculated for " PER " this label for the level-one label of text fragments " Shi Gezhuan ", less than 50%,
It therefore can second level label by " PER " as text fragments " Shi Gezhuan ".
The specific value of above-mentioned first threshold can be determined according to actual needs, it is preferable that and it can be above-mentioned 50%, if the
One threshold value is 50%, then only having a level-one label for each text fragments, may there is one or more second levels
Label.
For each contextual window filtered out, the level-one of the contextual window can be determined in the following way respectively
Label and second level label: the number that the text fragments in the contextual window are marked label is counted, third statistical result is obtained;
All labels that the text fragments in the contextual window are marked are obtained, for different labels each of are marked, are united respectively
The number that the text fragments in the contextual window are noted as the label is counted, the 4th statistical result is obtained, is tied with the 4th statistics
Fruit is divided by third statistical result, can be using the label as the level-one of the contextual window if obtained quotient is greater than second threshold
Label otherwise can be using the label as the second level label of the contextual window.
For example, having been appeared in 20 different sentences altogether, the contextual window for contextual window " * * is removed in navigation "
In text fragments to be marked the number of label be 20, the label being marked includes " LOC " and " PER ", wherein the context window
Being marked for " LOC " for text fragments 16 times in mouthful, is marked for " PER " for 4 times, then it is directed to " LOC " this label, it can
16/20=80% is calculated, is greater than second threshold such as 70%, therefore " LOC " can be used as contextual window " * * is removed in navigation "
Level-one label 4/20=20% can be calculated for " PER " this label, less than 70%, therefore can be by " PER " conduct
The second level label of contextual window " * * is removed in navigation ".
The specific value of above-mentioned second threshold can be determined according to actual needs, and it is possible to it is identical as first threshold, it can also
With difference, it is preferable that can be above-mentioned 70%, if second threshold is 70%, for each contextual window, only
A level-one label is had, there may be one or more second level labels.
In the present embodiment, for each contextual window filtered out, it also can first determine that whether the contextual window accords with
Scheduled confidence level requirement is closed, if not meeting scheduled confidence level requirement, can drop the contextual window, it otherwise, can be according to upper
The mode of stating determines the label of the contextual window.
For example, the frequency of occurrence of the contextual window can be counted respectively for each contextual window filtered out, if going out
Occurrence number is very low, such as only occurs twice in all corpus, then it is assumed that the contextual window does not meet certain mould
Formula does not have versatility, and confidence level is lower, therefore can drop the contextual window, to reduce the workload etc. of subsequent processing.
It, can be according to the label determined after determining the level-one label and second level label of text fragments and contextual window
Determine the sentence of marking error.
Specifically, it for each second level label of each contextual window, can carry out the following processing respectively: if it is determined that any
It include the contextual window in sentence, and the text fragments in the contextual window in the sentence are noted as the second level mark
Label, then when text segment belongs to the text fragments filtered out, and the second level label is similarly the second level label of text segment,
And the level-one label of the contextual window it is consistent with the level-one label of the text segment when, which is determined as marking error
Sentence.
For example, contextual window is " * * is removed in navigation ", level-one label is " LOC ", and second level label is " PER ", a certain language
Sentence is " the navigation village Qu Shige ", includes contextual window " * * is removed in navigation " in the sentence, and " history is each for the text fragments in the sentence
The village " is noted as the label of " PER ", and text fragments " Shi Gezhuan " belong to the text fragments filtered out, and " PER " is similarly text
The second level label of this segment " Shi Gezhuan ", and the level-one label of contextual window " * * is removed in navigation " and text fragments " Shi Gezhuan "
Level-one label is " LOC ", then the sentence can be then determined as to the sentence of marking error.
Further, the sentence of marking error can be modified.It specifically, can be by the text in above-mentioned contextual window
The label of segment is modified to the level-one label of text segment.
For example, the label of the text fragments " Shi Gezhuan " of " PER " can will be labeled as in " the navigation village Qu Shige " this sentence
It is modified to level-one label i.e. " LOC " of text fragments " Shi Gezhuan ".
It can be using revised sentence as training data.In this manner it is achieved that a plurality of training data can be obtained, it can be further
Serializing marking model is optimized according to training data.
Specific optimal way is unlimited, for example, the training data that can be will acquire be added to before training data in,
Marking model is serialized using updated training data re -training, alternatively, also can use the training data got, In
Serializing marking model is finely adjusted in original basis.After being optimized to serializing marking model, repeats and execute Fig. 1 institute
Show process, Continuous optimization modelling effect.
Based on above-mentioned introduction, Fig. 2 is that the whole of model optimization method of the present invention realizes process schematic.Such as Fig. 2 institute
Show, after getting serializing marking model, automatic marking is carried out to large-scale corpus using serializing marking model, is obtained
Annotation results.Annotation results can be based on, satisfactory text fragments and contextual window are filtered out, wherein can be from being marked
Filter out the text fragments for meeting the following conditions in each text fragments of label: same text segment is in different contextual windows
In be marked different labels;The contextual window for meeting the following conditions can be filtered out from each contextual window: above and below identical
Different text fragments in text window have been marked different labels.For each text fragments filtered out, can determine respectively
Its level-one label and second level label, wherein if level-one label is not present in any text fragments, can drop text segment.For
The each contextual window filtered out, however, it is determined that the contextual window does not meet scheduled confidence level requirement, and it is upper and lower to can drop this
Text window, otherwise, it may be determined that go out the level-one label and second level label of the contextual window, wherein if any contextual window is not
There are level-one label, the contextual window can drop.It later, can be based on the level-one of the text fragments and contextual window determined
Label and second level label are determined the sentence of marking error, and can be modified to the sentence of marking error, to be corrected
Training data afterwards, and then serializing marking model can be optimized according to training data.Serializing marking model is carried out
After optimization, repeats and execute process shown in Fig. 2, Continuous optimization modelling effect.Specific implementation please refers in embodiment illustrated in fig. 1
Related description, repeat no more.
It should be noted that for the aforementioned method embodiment, for simple description, being stated that a series of movement
Combination, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described, because according to this
Invention, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know that, explanation
Embodiment described in book belongs to preferred embodiment, and related actions and modules not necessarily present invention institute is necessary
's.
In short, can find the problem of serializing marking model automatically using scheme of the present invention and specific aim progress is excellent
Change, to improve model accuracy, and sustainable Optimized model effect etc..
The introduction about embodiment of the method above, below by way of Installation practice, to scheme of the present invention carry out into
One step explanation.
Fig. 3 is the composed structure schematic diagram of model optimization Installation practice of the present invention.As shown in Figure 3, comprising: obtain
Unit 301, mark unit 302, amending unit 303 and optimization unit 304.
Acquiring unit 301, the serializing marking model obtained for obtaining training.
Unit 302 is marked, for marking using serializing marking model to each sentence in scheduled large-scale corpus
Note.
Amending unit 303 determines the sentence of marking error according to predetermined policy, and to mark for being based on annotation results
The sentence of note mistake is modified, using revised sentence as training data.
Optimize unit 304, for optimizing according to training data to serializing marking model.
Initial serializing marking model can be obtained according to existing way training, and using serializing marking model to pre-
Each sentence in fixed large-scale corpus is labeled.
Wherein, amending unit 303 can be based on annotation results, and the sentence of marking error is determined according to predetermined policy.Specifically
Ground, amending unit 303 can filter out the text fragments for meeting the following conditions from each text fragments for being marked label: identical
Text fragments have been marked different labels in different contextual windows;It is filtered out from each contextual window and meets the following conditions
Contextual window: the different text fragments in same context window have been marked different labels;It determines to filter out respectively
Text fragments and contextual window label;The sentence of marking error is determined according to the label determined.
Wherein, the label may include level-one label and second level label.If any text fragments filtered out are not present one
Grade label, amending unit 303 can drop text segment, similarly, if level-one is not present in any contextual window filtered out
Label, amending unit 303 can drop the contextual window.
For each text fragments filtered out, amending unit 303 can determine this article this film in the following way respectively
The level-one label and second level label of section: statistics text segment is marked the number of label, obtains the first statistical result;Obtaining should
All labels that text fragments are marked count text segment respectively and are noted as different labels each of are marked
The number of the label obtains the second statistical result, with the second statistical result divided by the first statistical result, if obtained quotient is greater than the
One threshold value, then can be using the label as the level-one label of text segment, otherwise, can be using the label as the two of text segment
Grade label.
For each contextual window filtered out, amending unit 303 can determine that this is upper and lower in the following way respectively
The level-one label and second level label of text window: the number that the text fragments in the contextual window are marked label is counted, is obtained
Third statistical result;All labels that the text fragments in the contextual window are marked are obtained, for each of being marked no
Same label counts the number that the text fragments in the contextual window are noted as the label respectively, obtains the 4th statistical result,
It, can be upper and lower as this using the label if obtained quotient is greater than second threshold with the 4th statistical result divided by third statistical result
The level-one label of text window otherwise can be using the label as the second level label of the contextual window.
For each contextual window filtered out, amending unit 303 also can first determine that whether the contextual window accords with
Scheduled confidence level requirement is closed, if not meeting scheduled confidence level requirement, can drop the contextual window, it otherwise, can be according to upper
The mode of stating determines the label of the contextual window.
For example, amending unit 303 can count going out for the contextual window respectively for each contextual window filtered out
Occurrence number such as only occurs twice in all corpus if frequency of occurrence is very low, then it is assumed that the contextual window is not inconsistent
Unify fixed mode, does not have versatility, confidence level is lower, therefore can drop the contextual window, to reduce subsequent processing
Workload etc..
After determining the level-one label and second level label of text fragments and contextual window, amending unit 303 can basis
The label determined determines the sentence of marking error.
Specifically, for each second level label of each contextual window, amending unit 303 can carry out following locate respectively
Reason: if it is determined that including the contextual window in any sentence, and the text fragments in the contextual window in the sentence are marked
Note is the second level label, then when text segment belongs to the text fragments filtered out, and the second level label is similarly this article this film
The second level label of section, and when the level-one label of the contextual window is consistent with the level-one label of text segment, the sentence is true
It is set to the sentence of marking error.
Further, amending unit 303 can be modified the sentence of marking error.It specifically, can be by above-mentioned context
The label of text fragments in window is modified to the level-one label of text segment.
Amending unit 303 can be using revised sentence as training data.In this manner it is achieved that a plurality of trained number can be obtained
According to correspondingly, optimization unit 304 can optimize serializing marking model according to training data.
Specific optimal way is unlimited, for example, the training data that can be will acquire be added to before training data in,
Marking model is serialized using updated training data re -training, alternatively, also can use the training data got, In
Serializing marking model is finely adjusted in original basis.
The specific workflow of Fig. 3 shown device embodiment please refers to the related description in preceding method embodiment, no longer
It repeats.
Fig. 4 shows the block diagram for being suitable for the exemplary computer system/server 12 for being used to realize embodiment of the present invention.
The computer system/server 12 that Fig. 4 is shown is only an example, should not function and use scope to the embodiment of the present invention
Bring any restrictions.
As shown in figure 4, computer system/server 12 is showed in the form of universal computing device.Computer system/service
The component of device 12 can include but is not limited to: one or more processor (processing unit) 16, memory 28, connect not homology
The bus 18 of system component (including memory 28 and processor 16).
Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts
For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC)
Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer system/server 12 typically comprises a variety of computer system readable media.These media, which can be, appoints
What usable medium that can be accessed by computer system/server 12, including volatile and non-volatile media, it is moveable and
Immovable medium.
Memory 28 may include the computer system readable media of form of volatile memory, such as random access memory
Device (RAM) 30 and/or cache memory 32.Computer system/server 12 may further include it is other it is removable/no
Movably, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing
Immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although not shown in fig 4, may be used
To provide the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk "), and it is non-volatile to moving
Property CD (such as CD-ROM, DVD-ROM or other optical mediums) read and write CD drive.In these cases, each drive
Dynamic device can be connected by one or more data media interfaces with bus 18.Memory 28 may include at least one program
Product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform the present invention
The function of each embodiment.
Program/utility 40 with one group of (at least one) program module 42 can store in such as memory 28
In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs
It may include the realization of network environment in module and program data, each of these examples or certain combination.Program mould
Block 42 usually executes function and/or method in embodiment described in the invention.
Computer system/server 12 can also be (such as keyboard, sensing equipment, aobvious with one or more external equipments 14
Show device 24 etc.) communication, it is logical that the equipment interacted with the computer system/server 12 can be also enabled a user to one or more
Letter, and/or with the computer system/server 12 any is set with what one or more of the other calculating equipment was communicated
Standby (such as network interface card, modem etc.) communicates.This communication can be carried out by input/output (I/O) interface 22.And
And computer system/server 12 can also pass through network adapter 20 and one or more network (such as local area network
(LAN), wide area network (WAN) and/or public network, such as internet) communication.As shown in figure 4, network adapter 20 passes through bus
18 communicate with other modules of computer system/server 12.It should be understood that although not shown in the drawings, computer can be combined
Systems/servers 12 use other hardware and/or software module, including but not limited to: microcode, device driver, at redundancy
Manage unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
The program that processor 16 is stored in memory 28 by operation, at various function application and data
Reason, such as realize the method in embodiment illustrated in fig. 1.
The present invention discloses a kind of computer readable storage mediums, are stored thereon with computer program, the program quilt
Processor will realize the method in embodiment as shown in Figure 1 when executing.
It can be using any combination of one or more computer-readable media.Computer-readable medium can be calculating
Machine readable signal medium or computer readable storage medium.Computer readable storage medium for example can be --- but it is unlimited
In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates
The more specific example (non exhaustive list) of machine readable storage medium storing program for executing includes: electrical connection with one or more conducting wires, just
Taking formula computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In this document, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium other than computer readable storage medium, which can send, propagate or
Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.In
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service
It is connected for quotient by internet).
In several embodiments provided by the present invention, it should be understood that disclosed device and method etc. can pass through
Other modes are realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit,
Only a kind of logical function partition, there may be another division manner in actual implementation.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention
The part steps of embodiment the method.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. it is various
It can store the medium of program code.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (18)
1. a kind of model optimization method characterized by comprising
Obtain the serializing marking model that training obtains;
Each sentence in scheduled large-scale corpus is labeled using the serializing marking model;
Based on annotation results, the sentence of marking error is determined according to predetermined policy;
The sentence of the marking error is modified, using revised sentence as training data;
The serializing marking model is optimized according to the training data.
2. the method according to claim 1, wherein
The sentence that marking error is determined according to predetermined policy includes:
The text fragments for meeting the following conditions are filtered out from each text fragments for being marked label: same text segment is not
With being marked different labels in contextual window;
From filtering out the contextual window for meeting the following conditions in each contextual window: the different texts in same context window
Segment has been marked different labels;
The label of the text fragments filtered out and contextual window is determined respectively;
The sentence of marking error is determined according to the label determined.
3. according to the method described in claim 2, it is characterized in that,
The label includes level-one label and second level label;
This method further comprises: if the level-one label is not present in any text fragments filtered out, abandoning the text
Segment;If the level-one label is not present in any contextual window filtered out, the contextual window is abandoned.
4. according to the method described in claim 3, it is characterized in that,
The label for determining the text fragments filtered out includes:
For each text fragments filtered out, carry out the following processing respectively:
The number that the text fragments are marked label is counted, the first statistical result is obtained;
All labels that the text fragments are marked are obtained, for different labels each of are marked, count the text respectively
This segment is noted as the number of the label, obtains the second statistical result, with second statistical result divided by described first
Statistical result, if obtained quotient is greater than first threshold, using the label as the level-one label of the text fragments, otherwise,
Using the label as the second level label of the text fragments.
5. according to the method described in claim 4, it is characterized in that,
The label for determining the contextual window filtered out includes:
For each contextual window filtered out, carry out the following processing respectively:
The number that the text fragments in the contextual window are marked label is counted, third statistical result is obtained;
All labels that the text fragments in the contextual window are marked are obtained, for each of being marked different labels,
The number that the text fragments in the contextual window are noted as the label is counted respectively, obtains the 4th statistical result, is used
4th statistical result divided by the third statistical result, if obtained quotient be greater than second threshold, using the label as
The level-one label of the contextual window, otherwise, using the label as the second level label of the contextual window.
6. according to the method described in claim 5, it is characterized in that,
This method further comprises: for each contextual window filtered out, however, it is determined that the contextual window does not meet pre-
Fixed confidence level requirement, then abandon the contextual window, otherwise, it determines the label of the contextual window out.
7. according to the method described in claim 5, it is characterized in that,
The label that the basis is determined determines that the sentence of marking error includes:
For each second level label of each contextual window, carry out the following processing respectively:
If it is determined that including the contextual window in any sentence, and the text fragments in the contextual window are noted as institute
Second level label is stated, then when the text fragments belong to the text fragments filtered out, and the second level label is similarly the text
The second level label of segment, and when the level-one label of the contextual window is consistent with the level-one label of the text fragments, by institute
Sentence of the predicate sentence as marking error.
8. the method according to the description of claim 7 is characterized in that
It includes: to repair the label of the text fragments in the contextual window that the sentence to the marking error, which is modified,
It is just being the level-one label of the text fragments.
9. a kind of model optimization device characterized by comprising acquiring unit, mark unit, amending unit and optimization are single
Member;
The acquiring unit, the serializing marking model obtained for obtaining training;
The mark unit, for being marked using the serializing marking model to each sentence in scheduled large-scale corpus
Note;
The amending unit determines the sentence of marking error according to predetermined policy, and to the mark for being based on annotation results
The sentence of note mistake is modified, using revised sentence as training data;
The optimization unit, for being optimized according to the training data to the serializing marking model.
10. device according to claim 9, which is characterized in that
The amending unit filters out the text fragments for meeting the following conditions from each text fragments for being marked label: identical
Text fragments have been marked different labels in different contextual windows;It is filtered out from each contextual window and meets the following conditions
Contextual window: the different text fragments in same context window have been marked different labels;It determines to filter out respectively
Text fragments and contextual window label;The sentence of marking error is determined according to the label determined.
11. device according to claim 10, which is characterized in that
The label includes level-one label and second level label;
The amending unit is further used for, if the level-one label is not present in any text fragments filtered out, abandons institute
Text fragments are stated, if the level-one label is not present in any contextual window filtered out, abandon the contextual window.
12. device according to claim 11, which is characterized in that
The amending unit is directed to each text fragments filtered out, carries out the following processing respectively: counting the text fragments quilt
The number for marking label, obtains the first statistical result;All labels that the text fragments are marked are obtained, for what is be marked
Each difference label, counts the number that the text fragments are noted as the label respectively, obtains the second statistical result, use institute
The second statistical result is stated divided by first statistical result, if obtained quotient is greater than first threshold, using the label as institute
The level-one label of text fragments is stated, otherwise, using the label as the second level label of the text fragments.
13. device according to claim 12, which is characterized in that
The amending unit is directed to each contextual window filtered out, carries out the following processing respectively: counting the context window
Text fragments in mouthful are marked the number of label, obtain third statistical result;Obtain the text piece in the contextual window
All labels for being marked of section count the text piece in the contextual window for different labels each of are marked respectively
Section is noted as the number of the label, obtains the 4th statistical result, is counted with the 4th statistical result divided by the third
As a result, if obtained quotient is greater than second threshold, it, otherwise, will using the label as the level-one label of the contextual window
Second level label of the label as the contextual window.
14. device according to claim 13, which is characterized in that
The amending unit is further used for, for each contextual window filtered out, however, it is determined that the contextual window is not
Meet scheduled confidence level requirement, then the contextual window is abandoned, otherwise, it determines the label of the contextual window out.
15. device according to claim 13, which is characterized in that
The amending unit carries out the following processing: if it is determined that any each second level label of each contextual window respectively
It include the contextual window in sentence, and the text fragments in the contextual window are noted as the second level label, then
When the text fragments belong to the text fragments filtered out, and the second level label is similarly the second level mark of the text fragments
Label, and when the level-one label of the contextual window is consistent with the level-one label of the text fragments, using the sentence as marking
Infuse the sentence of mistake.
16. device according to claim 15, which is characterized in that
The label of text fragments in the contextual window is modified to the level-one mark of the text fragments by the amending unit
Label.
17. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor
The computer program of operation, which is characterized in that the processor is realized when executing described program as any in claim 1~8
Method described in.
18. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed
Such as method according to any one of claims 1 to 8 is realized when device executes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910636482.2A CN110457683B (en) | 2019-07-15 | 2019-07-15 | Model optimization method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910636482.2A CN110457683B (en) | 2019-07-15 | 2019-07-15 | Model optimization method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110457683A true CN110457683A (en) | 2019-11-15 |
CN110457683B CN110457683B (en) | 2023-04-07 |
Family
ID=68481237
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910636482.2A Active CN110457683B (en) | 2019-07-15 | 2019-07-15 | Model optimization method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110457683B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955433A (en) * | 2019-11-27 | 2020-04-03 | 中国银行股份有限公司 | Method and device for generating automatic deployment script |
CN112528671A (en) * | 2020-12-02 | 2021-03-19 | 北京小米松果电子有限公司 | Semantic analysis method, semantic analysis device and storage medium |
CN113761939A (en) * | 2021-09-07 | 2021-12-07 | 北京明略昭辉科技有限公司 | Method, system, medium, and electronic device for defining text range of contextual window |
CN113919348A (en) * | 2020-07-07 | 2022-01-11 | 阿里巴巴集团控股有限公司 | Named entity recognition method and device, electronic equipment and computer storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866337A (en) * | 2009-04-14 | 2010-10-20 | 日电(中国)有限公司 | Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model |
CN108228557A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | A kind of method and device of sequence labelling |
US20180267997A1 (en) * | 2017-03-20 | 2018-09-20 | Adobe Systems Incorporated | Large-scale image tagging using image-to-topic embedding |
CN109271630A (en) * | 2018-09-11 | 2019-01-25 | 成都信息工程大学 | A kind of intelligent dimension method and device based on natural language processing |
CN109299296A (en) * | 2018-11-01 | 2019-02-01 | 郑州云海信息技术有限公司 | A kind of interactive image text marking method and system |
CN109460551A (en) * | 2018-10-29 | 2019-03-12 | 北京知道创宇信息技术有限公司 | Signing messages extracting method and device |
US20190205794A1 (en) * | 2017-12-29 | 2019-07-04 | Oath Inc. | Method and system for detecting anomalies in data labels |
CN109992763A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | Language marks processing method, system, electronic equipment and computer-readable medium |
-
2019
- 2019-07-15 CN CN201910636482.2A patent/CN110457683B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866337A (en) * | 2009-04-14 | 2010-10-20 | 日电(中国)有限公司 | Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model |
CN108228557A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | A kind of method and device of sequence labelling |
US20180267997A1 (en) * | 2017-03-20 | 2018-09-20 | Adobe Systems Incorporated | Large-scale image tagging using image-to-topic embedding |
US20190205794A1 (en) * | 2017-12-29 | 2019-07-04 | Oath Inc. | Method and system for detecting anomalies in data labels |
CN109992763A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | Language marks processing method, system, electronic equipment and computer-readable medium |
CN109271630A (en) * | 2018-09-11 | 2019-01-25 | 成都信息工程大学 | A kind of intelligent dimension method and device based on natural language processing |
CN109460551A (en) * | 2018-10-29 | 2019-03-12 | 北京知道创宇信息技术有限公司 | Signing messages extracting method and device |
CN109299296A (en) * | 2018-11-01 | 2019-02-01 | 郑州云海信息技术有限公司 | A kind of interactive image text marking method and system |
Non-Patent Citations (1)
Title |
---|
曾道建等: "面向非结构化文本的开放式实体属性抽取", 《江西师范大学学报(自然科学版)》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955433A (en) * | 2019-11-27 | 2020-04-03 | 中国银行股份有限公司 | Method and device for generating automatic deployment script |
CN110955433B (en) * | 2019-11-27 | 2023-08-29 | 中国银行股份有限公司 | Automatic deployment script generation method and device |
CN113919348A (en) * | 2020-07-07 | 2022-01-11 | 阿里巴巴集团控股有限公司 | Named entity recognition method and device, electronic equipment and computer storage medium |
CN112528671A (en) * | 2020-12-02 | 2021-03-19 | 北京小米松果电子有限公司 | Semantic analysis method, semantic analysis device and storage medium |
CN113761939A (en) * | 2021-09-07 | 2021-12-07 | 北京明略昭辉科技有限公司 | Method, system, medium, and electronic device for defining text range of contextual window |
Also Published As
Publication number | Publication date |
---|---|
CN110457683B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108170749B (en) | Dialog method, device and computer readable medium based on artificial intelligence | |
CN107220232B (en) | Keyword extraction method and device based on artificial intelligence, equipment and readable medium | |
CN110276023B (en) | POI transition event discovery method, device, computing equipment and medium | |
CN103400577B (en) | The acoustic model method for building up of multilingual speech recognition and device | |
CN110457683A (en) | Model optimization method, apparatus, computer equipment and storage medium | |
CN108537176B (en) | Target barrage identification method and device, terminal and storage medium | |
CN109614625B (en) | Method, device and equipment for determining title text relevancy and storage medium | |
AU2017408800B2 (en) | Method and system of mining information, electronic device and readable storable medium | |
CN108062388A (en) | Interactive reply generation method and device | |
CN107678561A (en) | Phonetic entry error correction method and device based on artificial intelligence | |
EP1619620A1 (en) | Adaptation of Exponential Models | |
CN110245348A (en) | A kind of intension recognizing method and system | |
CN107204184A (en) | Audio recognition method and system | |
US20060020448A1 (en) | Method and apparatus for capitalizing text using maximum entropy | |
CN110334209B (en) | Text classification method, device, medium and electronic equipment | |
WO2021208727A1 (en) | Text error detection method and apparatus based on artificial intelligence, and computer device | |
CN107544726A (en) | Method for correcting error of voice identification result, device and storage medium based on artificial intelligence | |
CN109599095A (en) | A kind of mask method of voice data, device, equipment and computer storage medium | |
CN109947924B (en) | Dialogue system training data construction method and device, electronic equipment and storage medium | |
CN111695338A (en) | Interview content refining method, device, equipment and medium based on artificial intelligence | |
CN110032734B (en) | Training method and device for similar meaning word expansion and generation of confrontation network model | |
CN113053367A (en) | Speech recognition method, model training method and device for speech recognition | |
CN115099239B (en) | Resource identification method, device, equipment and storage medium | |
CN111858905A (en) | Model training method, information identification method, device, electronic equipment and storage medium | |
CN112287698A (en) | Chapter translation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |