CN110457683B - Model optimization method and device, computer equipment and storage medium - Google Patents

Model optimization method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN110457683B
CN110457683B CN201910636482.2A CN201910636482A CN110457683B CN 110457683 B CN110457683 B CN 110457683B CN 201910636482 A CN201910636482 A CN 201910636482A CN 110457683 B CN110457683 B CN 110457683B
Authority
CN
China
Prior art keywords
label
labels
context
text
window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910636482.2A
Other languages
Chinese (zh)
Other versions
CN110457683A (en
Inventor
孙辉丰
孙叔琦
孙珂
杨煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910636482.2A priority Critical patent/CN110457683B/en
Publication of CN110457683A publication Critical patent/CN110457683A/en
Application granted granted Critical
Publication of CN110457683B publication Critical patent/CN110457683B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a model optimization method, a model optimization device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a serialized annotation model obtained by training; marking each sentence in a preset large-scale corpus by using a serialized marking model; determining a statement with a wrong annotation according to a preset strategy based on the annotation result; correcting the wrongly labeled sentences, and taking the corrected sentences as training data; and optimizing the serialized annotation model according to the training data. By applying the scheme of the invention, the problems of the serialized annotation model can be automatically found and optimized in a targeted manner, so that the model precision is improved, and the like.

Description

Model optimization method and device, computer equipment and storage medium
[ technical field ] A method for producing a semiconductor device
The present invention relates to computer application technologies, and in particular, to a model optimization method, apparatus, computer device, and storage medium.
[ background of the invention ]
The serialization tagging model is a common model in the field of Natural Language Processing (NLP), and many important research directions, such as word segmentation, part of speech tagging, named entity recognition, etc., can be abstracted as the serialization tagging problem.
In the research of the serialization labeling problem, the basic idea is to train the serialization labeling model by manually labeling the training corpus (i.e. training data), and the effect of the model depends on the quantity and quality of data labels.
The manual labeling is high in cost and long in period, and professional linguistic data such as part-of-speech labeled linguistic data can be completed only by field experts, so that the manual cost is limited, the scale of the training linguistic data cannot be too large, and the accuracy of a model obtained by training is influenced.
[ summary of the invention ]
In view of the above, the invention provides a model optimization method, apparatus, computer device and storage medium.
The specific technical scheme is as follows:
a method of model optimization, comprising:
acquiring a serialized annotation model obtained by training;
marking each statement in a preset large-scale corpus by using the serialized marking model;
determining a statement with a wrong annotation according to a preset strategy based on the annotation result;
correcting the wrongly labeled sentences, and taking the corrected sentences as training data;
and optimizing the serialized annotation model according to the training data.
According to a preferred embodiment of the present invention, the determining the statement with the mislabeling according to the predetermined policy includes:
screening out text segments meeting the following conditions from the text segments marked with the labels: the same text segment is marked with different labels in different context windows;
and screening the context windows meeting the following conditions from the context windows: different text segments in the same context window are labeled with different labels;
respectively determining the screened text segments and the labels of the context windows;
and determining the sentence with the wrong label according to the determined label.
According to a preferred embodiment of the present invention, the tags include a primary tag and a secondary tag;
the method further comprises the following steps: if the primary label does not exist in any screened text segment, discarding the text segment; and if any screened context window does not have the primary label, discarding the context window.
According to a preferred embodiment of the present invention, determining the tags of the screened text segments includes:
aiming at each screened text segment, the following processing is respectively carried out:
counting the times of labeling the text segments to obtain a first statistical result;
and acquiring all the labeled labels of the text segments, respectively counting the times of labeling the text segments as the labels aiming at each labeled different label to obtain a second statistical result, dividing the first statistical result by the second statistical result, and if the obtained quotient is greater than a first threshold value, taking the labels as primary labels of the text segments, otherwise, taking the labels as secondary labels of the text segments.
According to a preferred embodiment of the present invention, determining the label of the screened contextual window comprises:
aiming at each screened context window, the following processing is respectively carried out:
counting the times of labeling the text segments in the context window to obtain a third statistical result;
acquiring all labels marked on the text segments in the context window, respectively counting the times of marking the text segments in the context window as the labels aiming at each different marked labels to obtain a fourth statistical result, dividing the third statistical result by the fourth statistical result, if the obtained quotient is greater than a second threshold value, taking the labels as primary labels of the context window, otherwise, taking the labels as secondary labels of the context window.
According to a preferred embodiment of the invention, the method further comprises: and for each screened context window, if the context window is determined not to meet the preset confidence requirement, discarding the context window, otherwise, determining the label of the context window.
According to a preferred embodiment of the present invention, the determining a statement with a wrong annotation according to the determined tag includes:
for each secondary label of each contextual window, the following processes are respectively carried out:
if it is determined that any statement contains the context window and the text segment in the context window is labeled as the secondary label, when the text segment belongs to the screened text segment and the secondary label is also the secondary label of the text segment and the primary label of the context window is consistent with the primary label of the text segment, the statement is used as a statement with a wrong label.
According to a preferred embodiment of the present invention, the correcting the incorrectly labeled sentence includes: and modifying the label of the text segment in the context window into a primary label of the text segment.
A model optimization apparatus, comprising: the device comprises an acquisition unit, a labeling unit, a correction unit and an optimization unit;
the acquisition unit is used for acquiring the serialized annotation model obtained by training;
the labeling unit is used for labeling each statement in the preset large-scale corpus by using the serialization labeling model;
the correcting unit is used for determining a wrongly labeled sentence according to a preset strategy based on the labeling result, correcting the wrongly labeled sentence and taking the corrected sentence as training data;
and the optimization unit is used for optimizing the serialized annotation model according to the training data.
According to a preferred embodiment of the present invention, the modifying unit screens out text segments that meet the following conditions from the text segments labeled with the labels: the same text segment is marked with different labels in different context windows; and screening out the contextual windows meeting the following conditions from the contextual windows: different text segments in the same context window are labeled with different labels; respectively determining the screened text segments and the labels of the context windows; and determining the sentence with the wrong label according to the determined label.
According to a preferred embodiment of the present invention, the tags include a primary tag and a secondary tag;
the correction unit is further configured to discard the text segment if the primary label does not exist in any of the screened text segments, and discard the context window if the primary label does not exist in any of the screened context windows.
According to a preferred embodiment of the present invention, the correcting unit performs the following processing for each selected text segment: counting the times of labeling the text segments to obtain a first statistical result; and acquiring all the labeled labels of the text segments, respectively counting the times of labeling the text segments as the labels aiming at each labeled different label to obtain a second statistical result, dividing the second statistical result by the first statistical result, if the obtained quotient is greater than a first threshold value, taking the labels as primary labels of the text segments, otherwise, taking the labels as secondary labels of the text segments.
According to a preferred embodiment of the present invention, the modifying unit performs the following processing for each selected context window: counting the times of labeling the text segments in the context window to obtain a third statistical result; acquiring all labels marked on the text segments in the context window, respectively counting the times of marking the text segments in the context window as the labels aiming at each different marked labels to obtain a fourth statistical result, dividing the third statistical result by the fourth statistical result, if the obtained quotient is greater than a second threshold value, taking the labels as primary labels of the context window, otherwise, taking the labels as secondary labels of the context window.
According to a preferred embodiment of the present invention, the modifying unit is further configured to, for each screened contextual window, if it is determined that the contextual window does not meet the predetermined confidence requirement, discard the contextual window, otherwise, determine the label of the contextual window.
According to a preferred embodiment of the present invention, the modifying unit performs the following processing for each secondary label of each context window respectively: if it is determined that any statement contains the context window and the text segment in the context window is labeled as the secondary label, when the text segment belongs to the screened text segment and the secondary label is also the secondary label of the text segment and the primary label of the context window is consistent with the primary label of the text segment, the statement is used as a statement with a wrong label.
According to a preferred embodiment of the present invention, the modifying unit modifies the label of the text segment in the context window to be a primary label of the text segment.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.
Based on the introduction, the scheme of the invention can be seen that aiming at the existing serialized annotation model obtained by training, the large-scale corpus can be automatically annotated by using the serialized annotation model, the wrongly annotated sentence can be determined based on the annotation result, the wrongly annotated sentence can be corrected, the corrected sentence can be used as training data, and the serialized annotation model can be optimized according to the training data, so that the problem of the serialized annotation model can be automatically found and optimized in a targeted manner, and the model precision and the like can be improved.
[ description of the drawings ]
FIG. 1 is a flowchart of an embodiment of a model optimization method according to the present invention.
Fig. 2 is a schematic diagram of an overall implementation process of the model optimization method according to the present invention.
Fig. 3 is a schematic structural diagram of a model optimization apparatus according to an embodiment of the present invention.
FIG. 4 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.
[ detailed description ] A
In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described below by referring to the drawings and examples.
It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
FIG. 1 is a flowchart of an embodiment of a model optimization method according to the present invention. As shown in fig. 1, the following detailed implementation is included.
In 101, the trained serialized annotation model is obtained.
At 102, each sentence in the predetermined large-scale corpus is labeled by using a serialized label model.
In 103, based on the labeling result, a statement with a wrong labeling is determined according to a predetermined strategy.
At 104, the error-labeled sentence is corrected, and the corrected sentence is used as training data.
At 105, the serialized annotation model is optimized based on the training data.
An initial serialization tagging model can be obtained by training according to the existing method, and then the serialization tagging model can be optimized according to the method described in the embodiment.
And labeling each statement in the preset large-scale corpus by utilizing a serialization labeling model. The specific language material and the specific scale included in the predetermined large-scale language material can be determined according to actual needs, and an article, a web page and the like can be used as the language material to label each sentence (sentence) in each language material.
The following description will take the serialization tagging model as the named entity recognition model.
For example, for the sentence "navigate to beijing", the following labeling results can be obtained: navigating to remove Beijing \ LOC, wherein the labeled label of the Beijing is LOC and represents the place name. As another example, for the statement "his hometown is in history" the following notation results are available: his hometown is in Shu Zhuang \ LOC, where "Shu Zhuang" is labeled "LOC".
And determining the sentence with the wrong annotation according to a preset strategy based on the annotation result.
Specifically, the following two types of data may be screened first:
1) Screening out text segments meeting the following conditions from the text segments marked with the labels: the same text segment is labeled with different labels in different contextual windows.
The content marked with the label can be used as the text segment to be screened, and the labeled "beijing" and "history banks" can be used as the text segment to be screened.
The text fragments meeting the following conditions can be screened out from the text fragments to be screened: the same text segment is labeled with different labels in different contextual windows. For example, for the text segment "history manor", which appears in different sentences, such as "his hometown is in history manor" and "i love beautiful history manor", respectively, in the sentence "his hometown is in history manor", the text segment "history manor" is labeled with a label of "LOC", and in the sentence "i love beautiful history manor", the text segment "history manor" is labeled with a label of "PER", and PER represents a person name, that is, the same text segment "history manor" is labeled with different labels in different context windows, then the text segment "history manor" can be used as the screened text segment.
In the serialization labeling task, although it is possible that the same text segment is labeled with different labels in different context windows, the labeling error is most of the time, namely, the labeling error has a higher probability in the case, so that the text segment can be screened out.
2) And screening the context windows meeting the following conditions from the context windows: different text segments in the same contextual window are labeled with different labels.
For example, the "navigation to star" in the sentence "navigation to beijing" is a context window.
For example, the text segments "beijing" in the sentence "navigate to get beijing" and "navigate to get each bank of history" are labeled with "LOC" in the context windows of the different sentences, "navigate to get beijing", and the text segments "each bank of history" in the sentence "navigate to get each bank of history" are labeled with "PER" in the sentence, that is, the different text segments in the same context window "navigate to get" are labeled with different labels, so that the context window "navigate to get" can be used as the context window to be screened out.
In the same context window, different text fragments have high probability as the same semantic role, and usually have the same label, so if the labels are different, the label with higher probability is a labeling error, and can be screened out.
After the processing of 1) and 2) is completed, the labels of the screened text segment and the context window can be respectively determined, and then the sentence with the wrong label can be determined according to the determined labels.
Wherein the tags may include a primary tag and a secondary tag. If no primary label exists in any screened text segment, the text segment can be discarded, and similarly, if no primary label exists in any screened context window, the context window can be discarded.
Aiming at each screened text segment, a primary label and a secondary label of the text segment can be determined respectively according to the following modes: counting the times of labeling the text segment to obtain a first statistical result; and acquiring all the labeled labels of the text segment, respectively counting the times of labeling the text segment as the label aiming at each labeled different label to obtain a second statistical result, dividing the first statistical result by the second statistical result, and if the obtained quotient is greater than a first threshold value, taking the label as a primary label of the text segment, otherwise, taking the label as a secondary label of the text segment.
For example, for a text segment "history banker" which appears in 20 different sentences, 20 times of labels are labeled, the labeled labels include "LOC" and "PER", wherein 16 times of labels are labeled as "LOC" and 4 times of labels are labeled as "PER", then 16/20=80% can be calculated for the label of "LOC", which is larger than a first threshold value, such as 50%, so that "LOC" can be used as a primary label of the text segment "history banker", and 4/20=20% and smaller than 50% can be calculated for the label of "PER", so that "PER" can be used as a secondary label of the text segment "history banker".
The specific value of the first threshold may be determined according to actual needs, and preferably may be 50% as described above, and if the first threshold is 50%, there is only one primary label and there may be one or more secondary labels for each text segment.
Aiming at each screened context window, the primary label and the secondary label of the context window can be respectively determined according to the following modes: counting the times of labeling the text segments in the context window to obtain a third statistical result; acquiring all labels marked on the text segments in the context window, respectively counting the times of marking the text segments in the context window as the labels aiming at each different marked labels, acquiring a fourth statistical result, dividing the fourth statistical result by the third statistical result, if the obtained quotient is greater than a second threshold value, taking the labels as primary labels of the context window, otherwise, taking the labels as secondary labels of the context window.
For example, for a context window "navigation remove", which occurs in 20 different sentences, the number of times the text segment in the context window is labeled is 20, and the labeled labels include "LOC" and "PER", wherein the text segment in the context window is labeled as "LOC" 16 times and labeled as "PER" 4 times, then for the label of "LOC", 16/20=80% can be calculated, which is greater than a second threshold value, such as 70%, so "LOC" can be used as the primary label of the context window "navigation remove", and for the label of "PER", 4/20=20% can be calculated, which is less than 70%, so "PER" can be used as the secondary label of the context window "navigation remove".
The specific value of the second threshold may be determined according to actual needs, and may be the same as or different from the first threshold, and preferably may be 70% as described above, and if the second threshold is 70%, there is only one primary label and there may be one or more secondary labels for each context window.
In this embodiment, for each screened contextual window, it may be further determined whether the contextual window meets a predetermined confidence requirement, and if not, the contextual window may be discarded, otherwise, the label of the contextual window may be determined in the above manner.
For example, for each screened context window, the occurrence frequency of the context window may be respectively counted, and if the occurrence frequency is low, and if the occurrence frequency is only twice in all the corpora, the context window may be considered to be not in compliance with a certain mode, and has no commonality and low confidence, so the context window may be discarded to reduce the workload of subsequent processing, and the like.
After the primary label and the secondary label of the text segment and the context window are determined, the sentence with the wrong label can be determined according to the determined labels.
Specifically, for each secondary label of each contextual window, the following processes can be performed: if it is determined that any statement contains the context window and the text segment in the context window in the statement is labeled as the secondary label, when the text segment belongs to the screened text segment and the secondary label is also the secondary label of the text segment, and the primary label of the context window is consistent with the primary label of the text segment, determining the statement as a statement with a wrong label.
For example, if the context window is "navigate away", the primary label thereof is "LOC", the secondary label is "PER", a sentence is "navigate away history each bank", the sentence includes the context window "navigate away", and the text segment "history each bank" in the sentence is labeled as a label of "PER", and the text segment "history each bank" belongs to the screened text segment, and "PER" is also the secondary label of the text segment "history each bank", and the primary label of the context window "navigate away" and the primary label of the text segment "history each bank" are both "LOC", then the sentence can be determined as the sentence labeled with error.
Further, the statement with the error can be corrected. Specifically, the label of the text segment in the above context window may be modified to be the first-level label of the text segment.
For example, the label of the text segment "history banks" labeled "PER" in the sentence "navigate to history banks" may be modified to be the primary label of the text segment "history banks" or "LOC".
The revised sentence may be used as training data. In this way, a plurality of pieces of training data are obtained, and the serialized annotation model can be further optimized according to the training data.
The specific optimization method is not limited, for example, the obtained training data may be added to the previous training data, and the serialized annotation model is retrained by using the updated training data, or the obtained training data may be used to perform fine tuning on the serialized annotation model on the original basis. After the serialized annotation model is optimized, the process shown in fig. 1 can be repeatedly executed, and the effect of the model is continuously optimized.
Based on the above description, fig. 2 is a schematic diagram of an overall implementation process of the model optimization method according to the present invention. As shown in fig. 2, after the serialization tagging model is obtained, the serialization tagging model may be used to automatically tag the large-scale corpus to obtain a tagging result. Text segments and context windows meeting the requirements can be screened out based on the labeling result, wherein the text segments meeting the following conditions can be screened out from the text segments labeled with the labels: the same text segment is marked with different labels in different context windows; the following context windows may be screened from each context window: different text segments in the same contextual window are labeled with different labels. And aiming at each screened text segment, respectively determining a primary label and a secondary label thereof, wherein if the primary label does not exist in any text segment, the text segment can be discarded. For each screened contextual window, if the contextual window is determined to be not in accordance with the predetermined confidence requirement, the contextual window can be discarded, otherwise, the primary label and the secondary label of the contextual window can be determined, wherein if the primary label does not exist in any contextual window, the contextual window can be discarded. And then, determining the wrongly labeled sentences based on the determined text segments and the primary labels and the secondary labels of the context windows, and correcting the wrongly labeled sentences to obtain corrected training data, so that the serialized labeling model can be optimized according to the training data. After the serialized annotation model is optimized, the process shown in fig. 2 can be repeatedly executed, and the effect of the model is continuously optimized. For specific implementation, please refer to the related description in the embodiment shown in fig. 1, which is not repeated.
It should be noted that for simplicity of description, the aforementioned method embodiments are described as a series of combinations of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
In a word, the scheme of the invention can automatically find the problem of the serialized annotation model and carry out targeted optimization, thereby improving the model precision, continuously optimizing the model effect and the like.
The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.
Fig. 3 is a schematic structural diagram of a model optimization apparatus according to an embodiment of the present invention. As shown in fig. 3, includes: an acquisition unit 301, a labeling unit 302, a modification unit 303, and an optimization unit 304.
An obtaining unit 301, configured to obtain the trained serialized annotation model.
And a labeling unit 302, configured to label each sentence in the predetermined large-scale corpus by using the serialized label model.
And a correcting unit 303, configured to determine, based on the tagging result, a statement with a wrong tagging according to a predetermined policy, correct the statement with the wrong tagging, and use the corrected statement as training data.
And the optimizing unit 304 is configured to optimize the serialized annotation model according to the training data.
The initial serialization labeling model can be obtained by training according to the existing mode, and each sentence in the preset large-scale corpus can be labeled by using the serialization labeling model.
The correcting unit 303 may determine, based on the labeling result, a statement with a wrong label according to a predetermined policy. Specifically, the correcting unit 303 may screen out text segments that meet the following conditions from the text segments labeled with the labels: the same text segment is marked with different labels in different context windows; and screening out the contextual windows meeting the following conditions from the contextual windows: different text segments in the same context window are labeled with different labels; respectively determining the screened text segments and the labels of the context windows; and determining the sentence with the wrong label according to the determined label.
Wherein the tags may include a primary tag and a secondary tag. If there is no primary label in any of the screened text segments, the modifying unit 303 may discard the text segment, and similarly, if there is no primary label in any of the screened context windows, the modifying unit 303 may discard the context window.
For each screened text segment, the correcting unit 303 may determine the primary label and the secondary label of the text segment according to the following manners: counting the times of labeling the text segment to obtain a first statistical result; and acquiring all the labeled labels of the text segment, respectively counting the times of labeling the text segment as the label aiming at each labeled different label to obtain a second statistical result, dividing the first statistical result by the second statistical result, and if the obtained quotient is greater than a first threshold value, taking the label as a primary label of the text segment, otherwise, taking the label as a secondary label of the text segment.
For each screened context window, the modifying unit 303 may determine the primary label and the secondary label of the context window according to the following manners: counting the times of labeling the text segments in the context window to obtain a third statistical result; acquiring all labels marked on the text segments in the context window, respectively counting the times of marking the text segments in the context window as the labels aiming at each different marked labels, acquiring a fourth statistical result, dividing the fourth statistical result by the third statistical result, if the obtained quotient is greater than a second threshold value, taking the labels as primary labels of the context window, otherwise, taking the labels as secondary labels of the context window.
For each screened contextual window, the modifying unit 303 may further first determine whether the contextual window meets a predetermined confidence requirement, and if not, may discard the contextual window, otherwise, may determine the label of the contextual window in the manner described above.
For example, for each selected context window, the correcting unit 303 may respectively count the occurrence frequency of the context window, and if the occurrence frequency is very low, for example, if the occurrence frequency is only twice in all the corpora, it may be considered that the context window does not conform to a certain mode, and has no generality and low confidence, so that the context window may be discarded to reduce the workload of subsequent processing, and the like.
After the first-level label and the second-level label of the text segment and the context window are determined, the modifying unit 303 may determine a sentence with a wrong label according to the determined labels.
Specifically, for each secondary label of each contextual window, the modification unit 303 may perform the following processing: if it is determined that any statement contains the context window and the text segment in the context window in the statement is labeled as the secondary label, when the text segment belongs to the screened text segment and the secondary label is also the secondary label of the text segment, and the primary label of the context window is consistent with the primary label of the text segment, determining the statement as a statement with a wrong label.
Further, the correcting unit 303 may correct the sentence with the error. Specifically, the label of the text segment in the context window may be modified to be the primary label of the text segment.
The modification unit 303 may use the modified sentence as training data. In this manner, a plurality of pieces of training data are available, and accordingly, the optimization unit 304 can optimize the serialized annotation model based on the training data.
The specific optimization method is not limited, for example, the obtained training data may be added to the previous training data, and the serialized annotation model is retrained by using the updated training data, or the obtained training data may be used to perform fine adjustment on the serialized annotation model on the original basis.
For a specific work flow of the apparatus embodiment shown in fig. 3, reference is made to the related description in the foregoing method embodiment, and details are not repeated.
FIG. 4 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 4 is only one example and should not be taken to limit the scope of use or functionality of embodiments of the present invention.
As shown in FIG. 4, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 4, network adapter 20 communicates with the other modules of computer system/server 12 via bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 16 executes various functional applications and data processing, such as implementing the method of the embodiment shown in fig. 1, by executing programs stored in the memory 28.
The invention also discloses a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, will carry out the method as in the embodiment shown in fig. 1.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other manners. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer-readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (12)

1. A method of model optimization, comprising:
acquiring a serialized annotation model obtained by training;
marking each statement in a preset large-scale corpus by using the serialized marking model;
determining a statement with a wrong annotation according to a preset strategy based on the annotation result;
correcting the wrongly labeled sentences, and taking the corrected sentences as training data;
optimizing the serialized annotation model according to the training data;
wherein, the determining the statement with the wrong annotation according to the preset strategy comprises the following steps:
based on the labeling result, screening out text segments and context windows which meet the requirements;
counting the times of labeling the text segments to obtain a first statistical result and all labels labeled to the text segments respectively aiming at each screened text segment, counting the times of labeling the text segments to be the labels respectively aiming at each labeled different label to obtain a second statistical result, dividing the first statistical result by the second statistical result, if the obtained quotient is greater than a first threshold value, using the label as a primary label of the text segment, otherwise, using the label as a secondary label of the text segment;
counting the times of labeling the text segments in the context windows respectively aiming at each screened context window to obtain a third statistical result, obtaining all labels of the text segments in the context windows, counting the times of labeling the text segments in the context windows as the labels respectively aiming at each different labeled labels to obtain a fourth statistical result, dividing the fourth statistical result by the third statistical result, if the obtained quotient is greater than a second threshold value, using the labels as primary labels of the context windows, otherwise, using the labels as secondary labels of the context windows;
for each secondary label of each contextual window, the following processes are respectively carried out: if it is determined that any statement contains the context window and the text segment in the context window is labeled as the secondary label, when the text segment belongs to the screened text segment and the secondary label is also the secondary label of the text segment and the primary label of the context window is consistent with the primary label of the text segment, the statement is used as a statement with a wrong label.
2. The method of claim 1,
the screening out the text segments and the context windows which meet the requirements comprises the following steps:
screening out text segments meeting the following conditions from the text segments marked with the labels: the same text segment is marked with different labels in different context windows;
and screening the context windows meeting the following conditions from the context windows: different text segments in the same contextual window are labeled with different labels.
3. The method of claim 1,
the method further comprises the following steps: if any screened text segment does not have the primary label, discarding the text segment; and if any screened context window does not have the primary label, discarding the context window.
4. The method of claim 1,
the method further comprises the following steps: and for each screened contextual window, if the contextual window is determined not to meet the preset confidence requirement, discarding the contextual window, otherwise, determining the label of the contextual window.
5. The method of claim 1,
the step of correcting the statement with the labeling error comprises the following steps: and modifying the label of the text segment in the context window into a primary label of the text segment.
6. A model optimization apparatus, comprising: the device comprises an acquisition unit, a labeling unit, a correction unit and an optimization unit;
the acquisition unit is used for acquiring the serialized annotation model obtained by training;
the labeling unit is used for labeling each statement in the preset large-scale corpus by using the serialization labeling model;
the correcting unit is used for determining a wrongly labeled sentence according to a preset strategy based on the labeling result, correcting the wrongly labeled sentence and taking the corrected sentence as training data; wherein, the determining the statement with the wrong annotation according to the predetermined strategy comprises: based on the labeling result, screening out text segments and context windows which meet the requirements; counting the times of labeling the text segments with labels respectively aiming at each screened text segment to obtain a first statistical result, obtaining all labels labeled on the text segments, counting the times of labeling the text segments with the labels respectively aiming at each labeled different label to obtain a second statistical result, dividing the first statistical result by the second statistical result, if the obtained quotient is greater than a first threshold value, using the label as a primary label of the text segment, otherwise, using the label as a secondary label of the text segment; counting the times of labeling the text segments in the context windows respectively aiming at each screened context window to obtain a third statistical result, obtaining all labels of the text segments in the context windows, counting the times of labeling the text segments in the context windows as the labels respectively aiming at each different labeled labels to obtain a fourth statistical result, dividing the fourth statistical result by the third statistical result, if the obtained quotient is greater than a second threshold value, using the labels as primary labels of the context windows, otherwise, using the labels as secondary labels of the context windows; for each secondary label of each contextual window, the following processes are respectively carried out: if it is determined that any statement contains the context window and the text segment in the context window is labeled as the secondary label, when the text segment belongs to the screened text segment and the secondary label is also the secondary label of the text segment and the primary label of the context window is consistent with the primary label of the text segment, taking the statement as a statement with a wrong label;
and the optimization unit is used for optimizing the serialized annotation model according to the training data.
7. The apparatus of claim 6,
the correction unit screens out text segments meeting the following conditions from the text segments marked with the labels: the same text segment is marked with different labels in different context windows; and screening out the contextual windows meeting the following conditions from the contextual windows: different text segments in the same contextual window are labeled with different labels.
8. The apparatus of claim 6,
the correction unit is further configured to discard the text segment if the primary label does not exist in any of the screened text segments, and discard the context window if the primary label does not exist in any of the screened context windows.
9. The apparatus of claim 6,
the correction unit is further configured to, for each screened context window, discard the context window if it is determined that the context window does not meet a predetermined confidence requirement, and otherwise, determine a label of the context window.
10. The apparatus of claim 6,
the correcting unit corrects the label of the text segment in the context window into a primary label of the text segment.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any one of claims 1 to 5 when executing the program.
12. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 5.
CN201910636482.2A 2019-07-15 2019-07-15 Model optimization method and device, computer equipment and storage medium Active CN110457683B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910636482.2A CN110457683B (en) 2019-07-15 2019-07-15 Model optimization method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910636482.2A CN110457683B (en) 2019-07-15 2019-07-15 Model optimization method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110457683A CN110457683A (en) 2019-11-15
CN110457683B true CN110457683B (en) 2023-04-07

Family

ID=68481237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910636482.2A Active CN110457683B (en) 2019-07-15 2019-07-15 Model optimization method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110457683B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955433B (en) * 2019-11-27 2023-08-29 中国银行股份有限公司 Automatic deployment script generation method and device
CN113919348A (en) * 2020-07-07 2022-01-11 阿里巴巴集团控股有限公司 Named entity recognition method and device, electronic equipment and computer storage medium
CN112149417A (en) * 2020-09-16 2020-12-29 北京小米松果电子有限公司 Part-of-speech tagging method and device, storage medium and electronic equipment
CN112528671A (en) * 2020-12-02 2021-03-19 北京小米松果电子有限公司 Semantic analysis method, semantic analysis device and storage medium
CN113761939A (en) * 2021-09-07 2021-12-07 北京明略昭辉科技有限公司 Method, system, medium, and electronic device for defining text range of contextual window

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460551A (en) * 2018-10-29 2019-03-12 北京知道创宇信息技术有限公司 Signing messages extracting method and device
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866337B (en) * 2009-04-14 2014-07-02 日电(中国)有限公司 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN108228557B (en) * 2016-12-14 2021-12-07 北京国双科技有限公司 Sequence labeling method and device
US10216766B2 (en) * 2017-03-20 2019-02-26 Adobe Inc. Large-scale image tagging using image-to-topic embedding
US11238365B2 (en) * 2017-12-29 2022-02-01 Verizon Media Inc. Method and system for detecting anomalies in data labels
CN109271630B (en) * 2018-09-11 2022-07-05 成都信息工程大学 Intelligent labeling method and device based on natural language processing
CN109299296A (en) * 2018-11-01 2019-02-01 郑州云海信息技术有限公司 A kind of interactive image text marking method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium
CN109460551A (en) * 2018-10-29 2019-03-12 北京知道创宇信息技术有限公司 Signing messages extracting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向非结构化文本的开放式实体属性抽取;曾道建等;《江西师范大学学报(自然科学版)》(第03期);第279-283、305页 *

Also Published As

Publication number Publication date
CN110457683A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110457683B (en) Model optimization method and device, computer equipment and storage medium
CN108491373B (en) Entity identification method and system
CN113807098B (en) Model training method and device, electronic equipment and storage medium
CN111581976B (en) Medical term standardization method, device, computer equipment and storage medium
CN107544726B (en) Speech recognition result error correction method and device based on artificial intelligence and storage medium
CN107908635B (en) Method and device for establishing text classification model and text classification
CN108090043B (en) Error correction report processing method and device based on artificial intelligence and readable medium
CN107273356B (en) Artificial intelligence based word segmentation method, device, server and storage medium
CN111221983A (en) Time sequence knowledge graph generation method, device, equipment and medium
CN111985229B (en) Sequence labeling method and device and computer equipment
CN107038157B (en) Artificial intelligence-based recognition error discovery method and device and storage medium
US20170308790A1 (en) Text classification by ranking with convolutional neural networks
CN109522552B (en) Normalization method and device of medical information, medium and electronic equipment
CN109599095B (en) Method, device and equipment for marking voice data and computer storage medium
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN107301248B (en) Word vector construction method and device of text, computer equipment and storage medium
CN108897869B (en) Corpus labeling method, apparatus, device and storage medium
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN112860919B (en) Data labeling method, device, equipment and storage medium based on generation model
CN113704429A (en) Semi-supervised learning-based intention identification method, device, equipment and medium
CN110377750B (en) Comment generation method, comment generation device, comment generation model training device and storage medium
CN109815481B (en) Method, device, equipment and computer storage medium for extracting event from text
CN111597800B (en) Method, device, equipment and storage medium for obtaining synonyms
CN111241302B (en) Position information map generation method, device, equipment and medium
CN113204667A (en) Method and device for training audio labeling model and audio labeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant