CN111125312A

CN111125312A - Text labeling method and system

Info

Publication number: CN111125312A
Application number: CN201911354042.4A
Authority: CN
Inventors: 刘宝强; 肖云飞
Original assignee: Shenzhen Skieer Information Technology Co ltd
Current assignee: Shenzhen Skieer Information Technology Co ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-08

Abstract

The invention belongs to the technical field of natural language processing, and discloses a text labeling method and a text labeling system, wherein invalid texts are filtered by using a custom rule; splitting the effective text, and specifically refining the effective text into words and short sentences; dividing corresponding attribute labels and emotions according to the refined words and phrases; performing on similar attribute labels; and forming an association relation between the attribute labels and the emotions so as to generate effective data for the supervised learning of the model. The text labeling system comprises a data filtering module, a labeling module, a data tracking and counting module, a data reviewing module, a user configuration module and a self-starting model training module. The text labeling method and the text labeling system provided by the invention are suitable for various text labeling scenes, and provide a more simple, convenient and efficient labeling mode; the invention simplifies the operation and information filtering process of the user; in the process of inputting the text into the generation model, the assembly line operation is formed, and the overall working efficiency is improved.

Description

Text labeling method and system

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a text labeling method and system.

Background

Currently, the closest prior art: in recent years, natural language processing technology has been rapidly developed into an independent subject and has attracted much attention as a result of the push of technologies and demands for search, information extraction, machine translation, and the like. However, interacting with a computer through natural language is still very difficult, in which not only is it taught how the computer recognizes the natural language, but the computer's misrecognition is also corrected. How to make a machine understand natural language better is a problem that experts and scholars are always trying to solve.

Colloquially, a computer understands natural language, rather, the meaning of a corpus. However, there are many linguistic data, and the linguistic data with the same meaning can have different expression modes, so that the computer understanding difficulty is increased. Therefore, it is important to manually label text to form a standardized corpus before computationally understanding the corpus. Under the marked linguistic data, the difficulty of computer learning is reduced, and the effect is better.

In the current text, people express more emotional tendency of certain aspect of things. For example: "the sound insulation effect is not particularly good, but should be considered to be better in a vehicle of the same price. In this case, attention to the soundproofing effect of the automobile is shown, and negative and positive emotions to the soundproofing effect are shown by comparing the two aspects. During labeling, the association relation between the attribute tags and the emotion is extracted, and the computer can better deduce the quality degree of the attribute tags. Therefore, the importance of text annotation in the technical field of natural language processing can be better reflected.

In summary, the problems of the prior art are as follows:

in the prior art, all attribute labels and emotions thereof in a text cannot be effectively marked to form a standardized corpus; during labeling, the operation and information filtering process of a user cannot be simplified, and in the process of inputting a text into a generation model, assembly line operation cannot be formed, so that the overall working efficiency is improved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a text labeling method and a text labeling system. The invention provides an efficient and visual marking method for text processing, generates an effective training, testing and verifying data set for a natural language processing model, and provides a one-stop solution for data processing, model training and model verification.

The invention is realized in such a way that a text labeling method comprises the following steps:

step one, self-defining a text filtering rule, and screening texts to obtain effective text data;

step two, splitting the text, and refining the text into words and short sentences;

and thirdly, dividing the words and the phrases into attribute labels and emotions, and then extracting the relationship between the attribute labels and the emotions.

Further, in the first step, the text filtering rule is customized by a user, and the user defines different rules according to different text data sources to provide an autonomous and controllable data filtering scheme.

Furthermore, in the third step, the attribute tags and the emotions have an association relationship and are used for defining the emotional tendency of the attribute tags.

Further, in the third step, the same short sentence has different attribute labels, emotions and corresponding association relations.

Further, in the third step, the attribute labels and the emotions have association relations of cross-word and short sentences, and the association relations comprise the same or different emotional tendencies of the multiple words and short sentences in the same attribute label.

Another object of the present invention is to provide a text labeling system comprising:

the data filtering module is used for customizing a filtering rule by a user, and screening effective text data according to the filtering rule after the user sets the rule;

the marking module is connected with the data filtering module and used for taking out and displaying the text from the effective middle text data screened out by the data filtering module, marking the corresponding attribute labels, emotions and the relation among the labels for the text by a user according to a predefined index, and finally storing all marking results;

the data tracking and counting module is connected with the labeling module and is used for screening the text data labels of different users in the labeling module, taking texts from the screened effective text data labels by the users, storing labeling results, recording corresponding user label numbers, and counting the workload of user labeling and data tracking;

and the data review module is connected with the data tracking and counting module, checks the data and the data volume after the data processed by the marking module and the data tracking and counting module is marked by a user, randomly checks the marked result in proportion, returns the wrong and invalid marked result, and finally judges whether the data is available according to the quality of the marked result.

Further, the data review module is also used for performing spot check on data of a certain attribute label, emotion or a certain user mark to verify whether the data meets the standard.

Further, the text labeling system further comprises:

the user configuration module is connected with the data filtering module and used for receiving and managing the user-defined text filtering rules;

and the self-starting model training module is connected with the data reviewing module and is used for starting model training and outputting a verification report after training when the effective marked data reach the standard.

The invention also aims to provide an information data processing terminal for realizing the text labeling method.

Another object of the present invention is to provide a computer-readable storage medium, comprising instructions which, when run on a computer, cause the computer to perform the text annotation method.

In summary, the advantages and positive effects of the invention are: the invention provides an efficient and visual text labeling method and system, which are not only suitable for various text labeling scenes and provide a more simple and efficient labeling mode, but also can better mark all attribute labels and emotions thereof in a text to form a standardized corpus.

According to the text labeling method and system provided by the invention, invalid texts are automatically filtered according to the preset text filtering rules of the user, the texts are split, the granularity of words and short sentences is refined, and attribute labels, emotions and association relations between the words and the short sentences are labeled on the basis of the words and the short sentences. The operation of the user and the information filtering process are simplified. In the process of inputting the text into the generation model, the assembly line operation is formed, and the overall working efficiency is improved.

Drawings

Fig. 1 is a flowchart of a text annotation method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a text annotation system according to an embodiment of the present invention.

In the figure: 1. a data filtering module; 2. a labeling module; 3. a data tracking and counting module; 4. a data review module; 5. a user configuration module; 6. and a self-starting model training module.

Fig. 3 is a schematic diagram of an implementation of the text annotation method according to the embodiment of the present invention.

Fig. 4 is a schematic diagram of an application example of the text annotation method according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the prior art, all attribute labels and emotions thereof in a text cannot be effectively marked to form a standardized corpus; when marking, the operation of the user and the information filtering process cannot be simplified. In the process of inputting the text into the generation model, the assembly line work cannot be formed, and the overall working efficiency is improved.

In view of the problems in the prior art, the present invention provides a text labeling method and system, which are described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a text annotation method provided in an embodiment of the present invention includes:

s101, self-defining a text filtering rule, screening the text and obtaining effective text data.

And S102, splitting the text and refining the text into words and phrases.

And S103, dividing the attribute labels and the emotions into words and phrases, and extracting the relationship between the attribute labels and the emotions.

In the embodiment of the present invention, in step S101, the text filtering rule is customized by a user, and the user may define different rules according to different text data sources, thereby implementing a completely autonomous controllable filtering scheme.

In the embodiment of the present invention, in step S103, the relationship between the attribute tag and the emotion is extracted, and the emotional tendency of the attribute tag is clarified.

For the condition that a plurality of attribute labels or emotions exist in the same short sentence, all the attribute labels or emotions of the short sentence can be marked.

And under the condition that the association relationship between the attribute labels and the emotions exists between words and phrases, the association relationship between the attribute labels and the emotions can be labeled between words and phrases.

As shown in fig. 2, the text labels provided in the embodiment of the present invention include:

and the data filtering module 1 screens out effective text data according to a self-defined filtering rule.

And the labeling module 2 labels the attribute labels, the emotions and the relationship between the attribute labels and the emotions for the filtered effective text data.

And the data tracking and counting module 3 tracks the circulation of each text data and counts according to different indexes.

And the data review module 4 randomly extracts the labeled data according to different attribute labels in proportion for review, and returns the error and invalid labeled data. And the system is also used for carrying out spot check on data marked by a certain attribute label, emotion or a certain user so as to verify whether the data meet the standard.

And the user configuration module 5 is used for receiving the user-defined text filtering rule and managing the rule.

And the self-starting model training module 6 is used for training the self-starting model after the marked data reach the standard, and outputting a verification report after training.

The invention is further described with reference to specific examples.

Examples

As shown in fig. 3, the text annotation method provided in the embodiment of the present invention includes:

s01: and screening out effective text data according to the text filtering rule.

And after the data is imported, filtering and cleaning are performed firstly, and effective text data are screened out. The related invalid text is not deleted and still remains, but is not used for marking.

In addition, the text filtering rule presets a part of rules for processing some common invalid text data. For example: the whole sentence is "666", "23333", "good", "normal", etc.

S02: and splitting the text and refining the text into words and short sentences. And according to the effect screened out in the step S01, splitting the text into a plurality of words or short sentences according to the grammar rule. Multiple words and short sentences split from the same text can still be displayed together in the labeling stage, because context dependency relationship may exist between the words and the sentences.

S03: and attribute labels and emotions are divided.

The data of step S02 is classified. And marking the keywords related to the words and the phrases as a certain attribute label or a certain emotion. Multiple keywords may exist in the same short sentence, or belong to the same attribute tag, or are different, and all the keywords are marked; the same is true for emotions.

The attribute labels and the emotions are preset, the meanings of all indexes are standardized, and the understanding deviation of users in the labeling process is avoided.

S04: and (5) extracting the relation.

And finding out the relation between the attribute label and the emotion according to the attribute label and the emotion division result in the step S03. One tag attribute may correspond to a plurality of emotional words, or a plurality of tag attributes may correspond to one emotional word. The attribute label and the emotion may not have any relationship, and at this time, the process directly goes to step S05, and the labeled data is stored in a library.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A text labeling method is characterized by comprising the following steps:

screening texts by using a text filtering rule to obtain effective text data;

2. The method for annotating text as recited in claim 1, wherein in step one, the text filtering rules are different rules according to different sources of text data, and are used for providing an autonomously controllable data filtering scheme.

3. The method for labeling text according to claim 1, wherein in the third step, the attribute tags and the emotion have an association relationship for clarifying the emotional tendency of the attribute tags.

4. The method for labeling texts according to claim 1, wherein in the third step, the same short sentence has different attribute labels, emotions and corresponding association relations.

5. The text labeling method of claim 1, wherein in the third step, the attribute labels and the emotions have an association relationship of cross-word and short sentence, and the association relationship comprises the same or different emotional tendencies of a plurality of words and short sentences in the same attribute label.

6. A text annotation system for implementing the text annotation method of claim 1, wherein the text annotation system comprises:

the data filtering module is used for screening out effective text data through a filtering rule;

the marking module is connected with the data filtering module and used for taking out and displaying the text from the effective middle text data screened out by the data filtering module, marking the corresponding attribute labels, emotions and the relation among the labels to the text according to the predefined indexes, and finally storing all marking results;

the data tracking and counting module is connected with the labeling module and used for taking texts and storing labeling results of the screened effective text data, recording corresponding labeling numbers and carrying out the workload of statistical labeling and data tracking;

and the data review module is connected with the data tracking and counting module, checks the data and the data volume of the data processed by the labeling module and the data tracking and counting module, randomly checks the labeled result in proportion, returns the wrong and invalid labeled result, and finally judges whether the data is available according to the quality of the labeled result.

7. The system of claim 6, wherein the data review module is further configured to perform a spot check on data of a certain attribute tag, emotion, or a certain label to verify whether the data meets a criterion.

8. The text annotation system of claim 6, wherein said text annotation system further comprises:

the user configuration module is connected with the data filtering module and used for receiving and managing the text filtering rules;

and the self-starting model training module is connected with the data reviewing module and used for starting model training and outputting a verification report after training when the effective marked data reach the standard.

9. An information data processing terminal for implementing the text labeling method according to any one of claims 1 to 5.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the text annotation method of any one of claims 1-5.