CN113961725A - Automatic label labeling method, system, equipment and storage medium - Google Patents
Automatic label labeling method, system, equipment and storage medium Download PDFInfo
- Publication number
- CN113961725A CN113961725A CN202111240212.3A CN202111240212A CN113961725A CN 113961725 A CN113961725 A CN 113961725A CN 202111240212 A CN202111240212 A CN 202111240212A CN 113961725 A CN113961725 A CN 113961725A
- Authority
- CN
- China
- Prior art keywords
- label
- labeling
- automatic
- word
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 101
- 238000011156 evaluation Methods 0.000 claims abstract description 54
- 238000000034 method Methods 0.000 claims abstract description 15
- 238000004590 computer program Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 12
- 230000001502 supplementing effect Effects 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000013215 result calculation Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 4
- 238000004422 calculation algorithm Methods 0.000 abstract description 8
- 238000012549 training Methods 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 12
- 238000004891 communication Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 239000013589 supplement Substances 0.000 description 5
- 238000013461 design Methods 0.000 description 3
- 238000013515 script Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000019771 cognition Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/381—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using identifiers, e.g. barcodes, RFIDs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Library & Information Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a label automatic labeling method and a system thereof, wherein the method comprises the following steps: a word grouping step: establishing a plurality of independent word banks based on the business corpus, and performing word grouping aiming at the words of each word bank; label and label rule defining step: based on each word stock, after defining tag grouping according to the service elements, selecting or self-defining words from the word stock as tags, and defining tag rules based on an automatic tag labeling matching mode; labeling the label: and matching the words in the text to be labeled with the words in the word groups based on the label rule, labeling the words in the text to be labeled which meet the matching condition, and storing the label automatic labeling result into a label result table to finish the automatic labeling of the label. The invention provides a set of label automatic labeling method, which maintains a corresponding word bank, label and rule system according to the industry, automatically marks through an algorithm model, and provides evaluation gradual training to improve marking accuracy.
Description
Technical Field
The present application relates to the field of data analysis, and in particular, to a method, a system, a computer device, and a computer-readable storage medium for automatically labeling a tag.
Background
Currently, in an online service scenario, the service quality is difficult to monitor and measure due to the uncertainty of the scenario. The dialogue tool in the service process can be imaged into a text, the service quality tool is imaged into a corresponding label, the label rules of the customer under different scenes and different dialogue topics are effectively extracted through automatic and accurate marking of the text, data support is provided for a sales business analysis model, and a digital and visual basis is provided for customer supervision.
And evaluating the text label from the table under a manual line, and automatically calculating an evaluation result, the coverage rate of the label and the accuracy rate of the label.
At present, aiming at the following bottlenecks in the related art, no effective solution is proposed:
(1) the labor cost is high, and division and cooperation cannot be effectively carried out.
(2) The evaluation accuracy rate cannot be guaranteed, and the diversity of the labels is difficult to be compared accurately by manpower to obtain errors.
(3) The calculation mode of the coverage rate and the accuracy rate is difficult to calculate manually.
(4) The "cold start" phase of insufficient cognition in a new service scenario cannot be solved.
In order to solve the problems and bottlenecks existing in the prior art, the invention maintains a corresponding word stock, label and rule system according to the industry; automatic marking is carried out through an algorithm model, multiple persons are supported for evaluating, cooperating and dividing labor by taking a task as a unit, and the efficiency is improved; providing, calculating and recalculating coverage rate and accuracy rate indexes, and gradually training an optimized label; and for a brand-new service scene cold start stage, a word bank and a label rule which are based on discovery function supplement are provided.
Disclosure of Invention
The embodiment of the application provides a method for supporting multi-person evaluation cooperation division work by taking a task as a unit based on a word bank, a label and a rule system and carrying out automatic marking through an algorithm model.
In a first aspect, an embodiment of the present application provides an automatic label labeling method, including:
a word grouping step: establishing a plurality of independent word banks based on the business corpus, and performing word grouping aiming at the words of each word bank;
label and label rule defining step: based on each word stock, after defining tag grouping according to the service elements, selecting or self-defining words from the word stock as tags, and defining tag rules based on an automatic tag labeling matching mode;
labeling the label: and matching the words in the text to be labeled with the words in the word groups based on the label rule, labeling the words in the text to be labeled which meet the matching condition, and storing the label automatic labeling result into a label result table to finish the automatic labeling of the label.
In some embodiments of the present invention, the automatic label labeling method further includes:
and a label labeling result evaluating step: after the automatic label marking result is evaluated, obtaining a label evaluation result, and evaluating and calculating the result of the evaluation task based on the label evaluation result;
and (3) optimizing the label rule: optimizing the label rule based on the output of the result evaluation calculation;
recalculating: and based on the optimized label rule, recalculation is initiated, and automatic labeling of the system label is carried out again.
In some embodiments of the present invention, the automatic label labeling method further includes:
and (3) cold starting of a label system: calculating word frequency and importance degree by word segmentation of the new service corpus, extracting keywords, and supplementing the keywords into a word bank and a label; and clustering the new service corpora, extracting a new label rule based on a clustering result, and supplementing the new label rule.
In some embodiments of the present invention, the step of evaluating the label labeling result includes:
and an evaluation result calculation step: and evaluating and calculating the evaluation task based on the evaluated label marking result to obtain the coverage rate and the accuracy rate of the automatic marking of the label.
In a second aspect, an embodiment of the present application provides an automatic label labeling system, which employs the above automatic label labeling method, and includes:
a word grouping module: establishing a plurality of independent word banks based on the business corpus, and performing word grouping aiming at the words of each word bank;
the label and label rule definition module: based on each word stock, after defining tag grouping according to the service elements, selecting or self-defining words from the word stock as tags, and defining tag rules based on an automatic tag labeling matching mode;
a label labeling module: and matching the words in the text to be labeled with the words in the word groups based on the label rule, labeling the words in the text to be labeled which meet the matching condition, and storing the label automatic labeling result into a label result table to finish the automatic labeling of the label.
In some embodiments of the present invention, the automatic label labeling system further includes:
a label labeling result evaluating module: after the automatic label marking result is evaluated, obtaining a label evaluation result, and evaluating and calculating the result of the evaluation task based on the label evaluation result;
a label rule optimization module: optimizing the label rule based on the output of the result evaluation calculation;
a recalculation module: and based on the optimized label rule, recalculation is initiated, and automatic labeling of the system label is carried out again.
In some embodiments of the present invention, the automatic label labeling system further includes:
the label system cold start module: calculating word frequency and importance degree by word segmentation of the new service corpus, extracting keywords, and supplementing the keywords into a word bank and a label; and clustering the new service corpora, extracting a new label rule based on a clustering result, and supplementing the new label rule.
In some embodiments of the present invention, in the automatic label labeling system, the label labeling result evaluating module includes:
an evaluation result calculation module: and evaluating and calculating the evaluation task based on the evaluated label marking result to obtain the coverage rate and the accuracy rate of the automatic marking of the label.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the automatic label labeling method according to the first aspect is implemented by the processor.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the automatic label labeling method according to the first aspect.
Compared with the prior art, the invention provides a set of label rule training optimization methodology: corresponding word banks, labels and rule systems are maintained according to the industry, automatic marking is carried out through an algorithm model, manual evaluation is provided, and marking accuracy is improved.
And for a brand-new service scene cold start stage, a word bank and a label rule which are based on discovery function supplement are provided.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of an automatic labeling method of labels according to the present invention;
FIG. 2 is a schematic diagram of an automatic labeling system for labels according to the present invention;
FIG. 3 is a diagram illustrating a word library structure according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a tag structure according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a general architecture of an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a tag evaluation flow according to an embodiment of the present invention;
fig. 7 is a hardware structure diagram of a computer device according to an embodiment of the present application.
In the above figures:
100 automatic label marking system
10-word grouping module, 20-label and label rule defining module
30 label marking module
81. A processor; 82. a memory; 83. a communication interface; 80. a bus.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The application relates to and provides a method for training and optimizing a set of label rules, which comprises the following steps: according to an algorithm model, automatic marking is carried out according to a word bank, a label and a rule system corresponding to industry maintenance, manual evaluation is provided, and marking accuracy is improved; and for a brand-new service scene cold start stage, a word bank and a label rule which are based on discovery function supplement are provided.
In order to solve the problems and bottlenecks existing in the prior art, the invention maintains a corresponding word stock, label and rule system according to the industry; automatic marking is carried out through an algorithm model, multiple persons are supported for evaluating, cooperating and dividing labor by taking a task as a unit, and the efficiency is improved; providing, calculating and recalculating coverage rate and accuracy rate indexes, and gradually training an optimized label; and for a brand-new service scene cold start stage, a word bank and a label rule which are based on discovery function supplement are provided.
Fig. 1 is a schematic flow chart of an automatic label labeling method of the present invention, and as shown in fig. 1, the present embodiment provides an automatic label labeling method, including:
word grouping step S10: establishing a plurality of independent word banks based on the business corpus, and performing word grouping aiming at the words of each word bank;
label and label rule defining step S20: based on each word stock, after defining tag grouping according to the service elements, selecting or self-defining words from the word stock as tags, and defining tag rules based on an automatic tag labeling matching mode;
label labeling step S30: and matching the words in the text to be labeled with the words in the word groups based on the label rule, labeling the words in the text to be labeled which meet the matching condition, and storing the label automatic labeling result into a label result table to finish the automatic labeling of the label.
In some embodiments of the present invention, the automatic label labeling method further includes:
and a label labeling result evaluating step: after the automatic label marking result is evaluated, obtaining a label evaluation result, and evaluating and calculating the result of the evaluation task based on the label evaluation result;
and (3) optimizing the label rule: optimizing the label rule based on the output of the result evaluation calculation;
recalculating: and based on the optimized label rule, recalculation is initiated, and automatic labeling of the system label is carried out again.
In some embodiments of the present invention, the automatic label labeling method further includes:
and (3) cold starting of a label system: calculating word frequency and importance degree by word segmentation of the new service corpus, extracting keywords, and supplementing the keywords into a word bank and a label; and clustering the new service corpora, extracting a new label rule based on a clustering result, and supplementing the new label rule.
In some embodiments of the present invention, the step of evaluating the label labeling result includes:
and an evaluation result calculation step: and evaluating and calculating the evaluation task based on the evaluated label marking result to obtain the coverage rate and the accuracy rate of the automatic marking of the label.
In a second aspect, an embodiment of the present application provides a label system establishing system 100, which employs the above-mentioned label automatic labeling method, and fig. 2 is a schematic diagram of an automatic label labeling system according to the present invention, and as shown in fig. 2, the system includes:
the word grouping module 10: establishing a plurality of independent word banks based on the business corpus, and performing word grouping aiming at the words of each word bank;
tag and tag rule definition module 20: based on each word stock, after defining tag grouping according to the service elements, selecting or self-defining words from the word stock as tags, and defining tag rules based on an automatic tag labeling matching mode;
the automatic label labeling module 30: and matching the words in the text to be labeled with the words in the word groups based on the label rule, labeling the words in the text to be labeled which meet the matching condition, and storing the label automatic labeling result into a label result table to finish the automatic labeling of the label.
In some embodiments of the present invention, the automatic label labeling system further includes:
a label labeling result evaluating module: after the automatic label marking result is evaluated, obtaining a label evaluation result, and evaluating and calculating the result of the evaluation task based on the label evaluation result;
a label rule optimization module: optimizing the label rule based on the output of the result evaluation calculation;
a recalculation module: and based on the optimized label rule, recalculation is initiated, and automatic labeling of the system label is carried out again.
In some embodiments of the present invention, the automatic label labeling system further includes:
the label system cold start module: calculating word frequency and importance degree by word segmentation of the new service corpus, extracting keywords, and supplementing the keywords into a word bank and a label; and clustering the new service corpora, extracting a new label rule based on a clustering result, and supplementing the new label rule.
In some embodiments of the present invention, in the automatic label labeling system, the label labeling result evaluating module includes:
an evaluation result calculation module: and evaluating and calculating the evaluation task based on the evaluated label marking result to obtain the coverage rate and the accuracy rate of the automatic marking of the label.
The following detailed description of specific embodiments of the invention refers to the accompanying drawings in which:
the label automatic labeling method provided by the invention constructs a rule for performing structured processing on the text, and evaluates the structured text processed by the rule, thereby further optimizing the rule. And automatically marking by using an algorithm model, and adding a corresponding label to the text according to a set rule. The specific rule is defined in a rule module, and according to the role of the text corresponding to the voice or corpus source and a preset regular expression, certain keywords in the text are matched and identified, and corresponding labels are labeled.
The system of the invention constructs a complete word stock, label and rule system, designs a whole set of label evaluation flow, and continuously trains the label rule to reach the service analysis available level.
The functional modules of the system are divided into: the system comprises a word bank module, a label module, a rule module, an evaluation module, a new word module, a rule module and the like.
(1) A word bank module:
the method is divided into a general word bank, an industry word bank, an enterprise word bank and the like, and the word bank level is refined layer by layer. The general word stock contains keywords which are not related to industry division, such as: you good, goodbye, what is needed, etc. The industry lexicon houses keywords under fixed industries, such as the automotive industry: vehicle model, tire, etc. And the enterprise word bank stores the keywords in the business scene of the client according to the customization of the client enterprise. The three word banks are separated so that the system can complement the industry and the general word bank of the specific enterprise when serving the specific enterprise. Making the system more robust.
The three word banks are stored separately from each other in the storage, so that the universal and industrial word banks are not polluted. The three word banks have similar structures, and fig. 3 is a schematic diagram of the word bank structure according to the embodiment of the present invention; as shown in fig. 3, the thesaurus structure includes: word groupings and words.
(2) A label module:
the system defines tag grouping according to enterprise business elements, and then selects proper words from the word stock or self-defines one word as a tag. Fig. 4 is a schematic structural diagram of a tag according to an embodiment of the present invention, and as shown in fig. 4, the structure of the tag is: and (3) label grouping: topic, theme, keyword; and (4) a label.
(3) A rule module:
the rule represents a labeling matching mode, a text is matched according to a role and a regular expression, a label set by the rule is labeled on the matched text, fig. 5 is a schematic diagram of a rule structure of a specific embodiment of the present invention, and as shown in fig. 5, the rule structure is as follows: canonical matches, roles, topics, keywords, and weights.
(4) An evaluation module:
the module sets a set of label evaluation flow, the whole flow takes a task as a unit, and each stage of the task is controlled by a state. And manually modifying or confirming the marking result of the system, comparing the marking result with the marking result of the original system to obtain the coverage rate and the accuracy rate matched with the label, modifying the problematic label rule, then carrying out marking calculation again, and continuously optimizing the coverage rate and the accuracy rate so as to achieve the training of the label rule.
Fig. 6 is a schematic diagram of a tag evaluation process according to an embodiment of the present invention, and as shown in fig. 6, entity relationships (only core attributes are shown) designed in the whole process.
The basic corpora in the system are shared among tasks, namely, the processing of the same corpora in the task A is synchronized with the processing of the same corpora in the task B. The design has the advantages that only one corpus is ensured, the storage is saved, and meanwhile, the situation that an evaluating person repeatedly evaluates the corpus is avoided.
The accuracy and the coverage rate of the task are obtained based on the comparison of the system result and the manual result. The system result is the result of matching the label rule under the current system, and the artificial result is the final result after the evaluation is modified. The structure of the two is the same with the quantity corresponding to the corpus, and whether the marking of the corpus system is correct can be obtained by comparing the label values in the same label group.
The coverage calculation formula is as follows: the system marks the number of linguistic data/manually evaluates the number of linguistic data with marks.
The accuracy calculation formula is as follows: the system is labeled and manually evaluated to obtain the correct corpus number/the system is labeled.
The key points of the whole process are as follows:
pulling service data: and pulling out the business corpus from the data source, ensuring the uniqueness of the corpus to be deduplicated during storage, and then maintaining the relationship between the evaluation task and the corpus under the task.
Matching label rules: and calling a word segmentation script, segmenting the text, then reversely arranging the rules according to the weight, sequentially carrying out regular matching on the text, and storing the matching result into a system labeling result table.
Manual evaluation: and manually evaluating the tag result of the corpus on the interface, and supporting the operations of modifying the tag, deleting the tag, adding the tag, directly passing through the operation and the like. And storing the label result of the corpus into a manual evaluation label result table in each operation.
Calculation/recalculation: and after the evaluation is finished, calculating the task to obtain the coverage rate and the accuracy rate. And after the label rule is optimized, recalculation can be initiated, the system labeling can be performed again, and at the moment, the last manual evaluation result is still in place, and the updated coverage rate and accuracy rate can be obtained by directly performing calculation.
New words module/new rules module:
for a brand-new service scene 'cold start' stage, the system itself has no thesaurus, label and rule for the industry. A batch of keywords and rules need to be extracted from the current corpus as the basis for initial training.
And (3) finding a new word: word frequency and importance are calculated by the script for word segmentation, and then keywords are extracted. And manually confirming whether the keywords need to be supplemented into the enterprise word stock and the labels.
And discovering a new rule: clustering corpora through scripts, and clustering the corpora together according to the similarity between sentences so as to extract each class of label rules. The rules are manually confirmed whether or not to be supplemented into the label rules.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the automatic label labeling method according to the first aspect is implemented.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the automatic label labeling method according to the first aspect.
In addition, the automatic label labeling method of the embodiment of the present application described in conjunction with fig. 1 may be implemented by a computer device. Fig. 7 is a hardware structure diagram of a computer device according to an embodiment of the present application.
The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 reads and executes the computer program instructions stored in the memory 82 to implement any one of the automatic labeling methods in the above embodiments.
In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 7, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
The computer device can be based on a label system, so as to realize the automatic label labeling method described in conjunction with fig. 1.
In addition, in combination with the automatic label labeling method in the foregoing embodiments, embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any one of the automatic labeling methods in the above embodiments.
Compared with the prior art, the method for training and optimizing the label rule provided by the invention comprises the following steps: corresponding word banks, labels and rule systems are maintained according to the industry, automatic marking is carried out through an algorithm model, manual evaluation is provided, and marking accuracy is improved. And for a brand-new service scene cold start stage, a word bank and a label rule which are based on discovery function supplement are provided.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. An automatic label labeling method is characterized by comprising the following steps:
a word grouping step: establishing a plurality of independent word banks based on the business corpus, and performing word grouping on words in each word bank;
label and label rule defining step: based on each word stock, after defining tag grouping according to business elements, selecting or self-defining words from the word stock as tags, and defining tag rules based on an automatic tag labeling matching mode;
labeling the label: and matching the words in the text to be labeled with the words in the word group based on the label rule, labeling the labels of the words in the text to be labeled meeting the matching condition, and storing the label automatic labeling result into a label result table to finish the automatic labeling of the labels.
2. The automatic labeling method of claim 1, further comprising:
and a label labeling result evaluating step: after the automatic label marking result is evaluated, obtaining a label evaluation result, and evaluating and calculating the result of the evaluation task based on the label evaluation result;
and (3) optimizing the label rule: optimizing the label rule based on the output of the result evaluation calculation;
recalculating: and based on the optimized label rule, recalculation is initiated, and automatic labeling of the system label is carried out again.
3. The automatic labeling method of claim 1, further comprising:
and (3) cold starting of a label system: calculating word frequency and importance degree by word segmentation of the new service corpus, extracting keywords, and supplementing the keywords to the word bank and the label; and clustering the new service corpus, extracting a new label rule based on a clustering result, and supplementing the new label rule.
4. The automatic label labeling method according to claim 2, wherein the label labeling result evaluating step comprises:
and an evaluation result calculation step: and evaluating and calculating the evaluation task based on the evaluated label marking result to obtain the coverage rate and the accuracy rate of the automatic marking of the label.
5. An automatic label labeling system adopting the automatic label labeling method according to any one of claims 1 to 4, comprising:
a word grouping module: establishing a plurality of independent word banks based on the business corpus, and performing word grouping on words in each word bank;
the label and label rule definition module: based on each word stock, after defining tag grouping according to business elements, selecting or self-defining words from the word stock as tags, and defining tag rules based on an automatic tag labeling matching mode;
a label labeling module: and matching the words in the text to be labeled with the words in the word group based on the label rule, labeling the labels of the words in the text to be labeled meeting the matching condition, and storing the label automatic labeling result into a label result table to finish the automatic labeling of the labels.
6. The automatic labeling system of claim 5, further comprising:
a label labeling result evaluating module: after the automatic label marking result is evaluated, obtaining a label evaluation result, and evaluating and calculating the result of the evaluation task based on the label evaluation result;
a label rule optimization module: optimizing the label rule based on the output of the result evaluation calculation;
a recalculation module: and based on the optimized label rule, recalculation is initiated, and automatic labeling of the system label is carried out again.
7. The automatic labeling system of claim 5, further comprising:
the label system cold start module: calculating word frequency and importance degree by word segmentation of the new service corpus, extracting keywords, and supplementing the keywords to the word bank and the label; and clustering the new service corpus, extracting a new label rule based on a clustering result, and supplementing the new label rule.
8. The automatic labeling system of claim 6, wherein the label labeling result evaluating module comprises:
an evaluation result calculation module: and evaluating and calculating the evaluation task based on the evaluated label marking result to obtain the coverage rate and the accuracy rate of the automatic marking of the label.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the automatic label labeling method of any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method for automatic labeling of labels according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111240212.3A CN113961725A (en) | 2021-10-25 | 2021-10-25 | Automatic label labeling method, system, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111240212.3A CN113961725A (en) | 2021-10-25 | 2021-10-25 | Automatic label labeling method, system, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113961725A true CN113961725A (en) | 2022-01-21 |
Family
ID=79466695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111240212.3A Pending CN113961725A (en) | 2021-10-25 | 2021-10-25 | Automatic label labeling method, system, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113961725A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116860979A (en) * | 2023-09-04 | 2023-10-10 | 上海柯林布瑞信息技术有限公司 | Medical text labeling method and device based on label knowledge base |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423264A (en) * | 2017-07-10 | 2017-12-01 | 广东华联建设投资管理股份有限公司 | A kind of engineering material borrowing-word extracting method |
CN108363725A (en) * | 2018-01-08 | 2018-08-03 | 浙江大学 | A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label |
CN109858018A (en) * | 2018-12-25 | 2019-06-07 | 中国科学院信息工程研究所 | A kind of entity recognition method and system towards threat information |
CN109857957A (en) * | 2019-01-29 | 2019-06-07 | 掌阅科技股份有限公司 | Establish method, electronic equipment and the computer storage medium of tag library |
CN109918662A (en) * | 2019-03-04 | 2019-06-21 | 腾讯科技(深圳)有限公司 | A kind of label of e-sourcing determines method, apparatus and readable medium |
US20190220486A1 (en) * | 2017-12-08 | 2019-07-18 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for mining general tag, server, and medium |
CN110147499A (en) * | 2019-05-21 | 2019-08-20 | 智者四海(北京)技术有限公司 | Label method, recommended method and recording medium |
CN110222709A (en) * | 2019-04-29 | 2019-09-10 | 上海暖哇科技有限公司 | A kind of multi-tag intelligence marking method and system |
CN110825876A (en) * | 2019-11-07 | 2020-02-21 | 上海德拓信息技术股份有限公司 | Movie comment viewpoint emotion tendency analysis method |
CN112445897A (en) * | 2021-01-28 | 2021-03-05 | 京华信息科技股份有限公司 | Method, system, device and storage medium for large-scale classification and labeling of text data |
-
2021
- 2021-10-25 CN CN202111240212.3A patent/CN113961725A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423264A (en) * | 2017-07-10 | 2017-12-01 | 广东华联建设投资管理股份有限公司 | A kind of engineering material borrowing-word extracting method |
US20190220486A1 (en) * | 2017-12-08 | 2019-07-18 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for mining general tag, server, and medium |
CN108363725A (en) * | 2018-01-08 | 2018-08-03 | 浙江大学 | A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label |
CN109858018A (en) * | 2018-12-25 | 2019-06-07 | 中国科学院信息工程研究所 | A kind of entity recognition method and system towards threat information |
CN109857957A (en) * | 2019-01-29 | 2019-06-07 | 掌阅科技股份有限公司 | Establish method, electronic equipment and the computer storage medium of tag library |
CN109918662A (en) * | 2019-03-04 | 2019-06-21 | 腾讯科技(深圳)有限公司 | A kind of label of e-sourcing determines method, apparatus and readable medium |
CN110222709A (en) * | 2019-04-29 | 2019-09-10 | 上海暖哇科技有限公司 | A kind of multi-tag intelligence marking method and system |
CN110147499A (en) * | 2019-05-21 | 2019-08-20 | 智者四海(北京)技术有限公司 | Label method, recommended method and recording medium |
CN110825876A (en) * | 2019-11-07 | 2020-02-21 | 上海德拓信息技术股份有限公司 | Movie comment viewpoint emotion tendency analysis method |
CN112445897A (en) * | 2021-01-28 | 2021-03-05 | 京华信息科技股份有限公司 | Method, system, device and storage medium for large-scale classification and labeling of text data |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116860979A (en) * | 2023-09-04 | 2023-10-10 | 上海柯林布瑞信息技术有限公司 | Medical text labeling method and device based on label knowledge base |
CN116860979B (en) * | 2023-09-04 | 2023-12-08 | 上海柯林布瑞信息技术有限公司 | Medical text labeling method and device based on label knowledge base |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804641B (en) | Text similarity calculation method, device, equipment and storage medium | |
CN104750705B (en) | Information replying method and device | |
US20190102373A1 (en) | Model-based automatic correction of typographical errors | |
EP2581843B1 (en) | Bigram Suggestions | |
CN106033416A (en) | A string processing method and device | |
CN110209790B (en) | Question-answer matching method and device | |
CN113836925B (en) | Training method and device for pre-training language model, electronic equipment and storage medium | |
CN107291775B (en) | Method and device for generating repairing linguistic data of error sample | |
CN105608113B (en) | Judge the method and device of POI data in text | |
CN112256845A (en) | Intention recognition method, device, electronic equipment and computer readable storage medium | |
CN110990563A (en) | Artificial intelligence-based traditional culture material library construction method and system | |
CN112380348B (en) | Metadata processing method, apparatus, electronic device and computer readable storage medium | |
CN112183102A (en) | Named entity identification method based on attention mechanism and graph attention network | |
CN113704623A (en) | Data recommendation method, device, equipment and storage medium | |
CN112784572A (en) | Marketing scene conversational analysis method and system | |
CN113961725A (en) | Automatic label labeling method, system, equipment and storage medium | |
CN113779364A (en) | Searching method based on label extraction and related equipment thereof | |
CN111339287B (en) | Abstract generation method and device | |
CN110188274B (en) | Search error correction method and device | |
CN109947947B (en) | Text classification method and device and computer readable storage medium | |
US10467291B2 (en) | Method and system for providing query suggestions | |
CN107203512B (en) | Method for extracting key elements from natural language input of user | |
CN114298028B (en) | BIM semantic disambiguation method and system | |
CN112650837B (en) | Text quality control method and system combining classification algorithm and unsupervised algorithm | |
CN111967257B (en) | Word segmentation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |