CN111125312A - Text labeling method and system - Google Patents

Text labeling method and system Download PDF

Info

Publication number
CN111125312A
CN111125312A CN201911354042.4A CN201911354042A CN111125312A CN 111125312 A CN111125312 A CN 111125312A CN 201911354042 A CN201911354042 A CN 201911354042A CN 111125312 A CN111125312 A CN 111125312A
Authority
CN
China
Prior art keywords
text
data
module
labeling
emotions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911354042.4A
Other languages
Chinese (zh)
Inventor
刘宝强
肖云飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Skieer Information Technology Co ltd
Original Assignee
Shenzhen Skieer Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Skieer Information Technology Co ltd filed Critical Shenzhen Skieer Information Technology Co ltd
Priority to CN201911354042.4A priority Critical patent/CN111125312A/en
Publication of CN111125312A publication Critical patent/CN111125312A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Abstract

The invention belongs to the technical field of natural language processing, and discloses a text labeling method and a text labeling system, wherein invalid texts are filtered by using a custom rule; splitting the effective text, and specifically refining the effective text into words and short sentences; dividing corresponding attribute labels and emotions according to the refined words and phrases; performing on similar attribute labels; and forming an association relation between the attribute labels and the emotions so as to generate effective data for the supervised learning of the model. The text labeling system comprises a data filtering module, a labeling module, a data tracking and counting module, a data reviewing module, a user configuration module and a self-starting model training module. The text labeling method and the text labeling system provided by the invention are suitable for various text labeling scenes, and provide a more simple, convenient and efficient labeling mode; the invention simplifies the operation and information filtering process of the user; in the process of inputting the text into the generation model, the assembly line operation is formed, and the overall working efficiency is improved.

Description

Text labeling method and system
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a text labeling method and system.
Background
Currently, the closest prior art: in recent years, natural language processing technology has been rapidly developed into an independent subject and has attracted much attention as a result of the push of technologies and demands for search, information extraction, machine translation, and the like. However, interacting with a computer through natural language is still very difficult, in which not only is it taught how the computer recognizes the natural language, but the computer's misrecognition is also corrected. How to make a machine understand natural language better is a problem that experts and scholars are always trying to solve.
Colloquially, a computer understands natural language, rather, the meaning of a corpus. However, there are many linguistic data, and the linguistic data with the same meaning can have different expression modes, so that the computer understanding difficulty is increased. Therefore, it is important to manually label text to form a standardized corpus before computationally understanding the corpus. Under the marked linguistic data, the difficulty of computer learning is reduced, and the effect is better.
In the current text, people express more emotional tendency of certain aspect of things. For example: "the sound insulation effect is not particularly good, but should be considered to be better in a vehicle of the same price. In this case, attention to the soundproofing effect of the automobile is shown, and negative and positive emotions to the soundproofing effect are shown by comparing the two aspects. During labeling, the association relation between the attribute tags and the emotion is extracted, and the computer can better deduce the quality degree of the attribute tags. Therefore, the importance of text annotation in the technical field of natural language processing can be better reflected.
In summary, the problems of the prior art are as follows:
in the prior art, all attribute labels and emotions thereof in a text cannot be effectively marked to form a standardized corpus; during labeling, the operation and information filtering process of a user cannot be simplified, and in the process of inputting a text into a generation model, assembly line operation cannot be formed, so that the overall working efficiency is improved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a text labeling method and a text labeling system. The invention provides an efficient and visual marking method for text processing, generates an effective training, testing and verifying data set for a natural language processing model, and provides a one-stop solution for data processing, model training and model verification.
The invention is realized in such a way that a text labeling method comprises the following steps:
step one, self-defining a text filtering rule, and screening texts to obtain effective text data;
step two, splitting the text, and refining the text into words and short sentences;
and thirdly, dividing the words and the phrases into attribute labels and emotions, and then extracting the relationship between the attribute labels and the emotions.
Further, in the first step, the text filtering rule is customized by a user, and the user defines different rules according to different text data sources to provide an autonomous and controllable data filtering scheme.
Furthermore, in the third step, the attribute tags and the emotions have an association relationship and are used for defining the emotional tendency of the attribute tags.
Further, in the third step, the same short sentence has different attribute labels, emotions and corresponding association relations.
Further, in the third step, the attribute labels and the emotions have association relations of cross-word and short sentences, and the association relations comprise the same or different emotional tendencies of the multiple words and short sentences in the same attribute label.
Another object of the present invention is to provide a text labeling system comprising:
the data filtering module is used for customizing a filtering rule by a user, and screening effective text data according to the filtering rule after the user sets the rule;
the marking module is connected with the data filtering module and used for taking out and displaying the text from the effective middle text data screened out by the data filtering module, marking the corresponding attribute labels, emotions and the relation among the labels for the text by a user according to a predefined index, and finally storing all marking results;
the data tracking and counting module is connected with the labeling module and is used for screening the text data labels of different users in the labeling module, taking texts from the screened effective text data labels by the users, storing labeling results, recording corresponding user label numbers, and counting the workload of user labeling and data tracking;
and the data review module is connected with the data tracking and counting module, checks the data and the data volume after the data processed by the marking module and the data tracking and counting module is marked by a user, randomly checks the marked result in proportion, returns the wrong and invalid marked result, and finally judges whether the data is available according to the quality of the marked result.
Further, the data review module is also used for performing spot check on data of a certain attribute label, emotion or a certain user mark to verify whether the data meets the standard.
Further, the text labeling system further comprises:
the user configuration module is connected with the data filtering module and used for receiving and managing the user-defined text filtering rules;
and the self-starting model training module is connected with the data reviewing module and is used for starting model training and outputting a verification report after training when the effective marked data reach the standard.
The invention also aims to provide an information data processing terminal for realizing the text labeling method.
Another object of the present invention is to provide a computer-readable storage medium, comprising instructions which, when run on a computer, cause the computer to perform the text annotation method.
In summary, the advantages and positive effects of the invention are: the invention provides an efficient and visual text labeling method and system, which are not only suitable for various text labeling scenes and provide a more simple and efficient labeling mode, but also can better mark all attribute labels and emotions thereof in a text to form a standardized corpus.
According to the text labeling method and system provided by the invention, invalid texts are automatically filtered according to the preset text filtering rules of the user, the texts are split, the granularity of words and short sentences is refined, and attribute labels, emotions and association relations between the words and the short sentences are labeled on the basis of the words and the short sentences. The operation of the user and the information filtering process are simplified. In the process of inputting the text into the generation model, the assembly line operation is formed, and the overall working efficiency is improved.
Drawings
Fig. 1 is a flowchart of a text annotation method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a text annotation system according to an embodiment of the present invention.
In the figure: 1. a data filtering module; 2. a labeling module; 3. a data tracking and counting module; 4. a data review module; 5. a user configuration module; 6. and a self-starting model training module.
Fig. 3 is a schematic diagram of an implementation of the text annotation method according to the embodiment of the present invention.
Fig. 4 is a schematic diagram of an application example of the text annotation method according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the prior art, all attribute labels and emotions thereof in a text cannot be effectively marked to form a standardized corpus; when marking, the operation of the user and the information filtering process cannot be simplified. In the process of inputting the text into the generation model, the assembly line work cannot be formed, and the overall working efficiency is improved.
In view of the problems in the prior art, the present invention provides a text labeling method and system, which are described in detail below with reference to the accompanying drawings.
As shown in fig. 1, a text annotation method provided in an embodiment of the present invention includes:
s101, self-defining a text filtering rule, screening the text and obtaining effective text data.
And S102, splitting the text and refining the text into words and phrases.
And S103, dividing the attribute labels and the emotions into words and phrases, and extracting the relationship between the attribute labels and the emotions.
In the embodiment of the present invention, in step S101, the text filtering rule is customized by a user, and the user may define different rules according to different text data sources, thereby implementing a completely autonomous controllable filtering scheme.
In the embodiment of the present invention, in step S103, the relationship between the attribute tag and the emotion is extracted, and the emotional tendency of the attribute tag is clarified.
For the condition that a plurality of attribute labels or emotions exist in the same short sentence, all the attribute labels or emotions of the short sentence can be marked.
And under the condition that the association relationship between the attribute labels and the emotions exists between words and phrases, the association relationship between the attribute labels and the emotions can be labeled between words and phrases.
As shown in fig. 2, the text labels provided in the embodiment of the present invention include:
and the data filtering module 1 screens out effective text data according to a self-defined filtering rule.
And the labeling module 2 labels the attribute labels, the emotions and the relationship between the attribute labels and the emotions for the filtered effective text data.
And the data tracking and counting module 3 tracks the circulation of each text data and counts according to different indexes.
And the data review module 4 randomly extracts the labeled data according to different attribute labels in proportion for review, and returns the error and invalid labeled data. And the system is also used for carrying out spot check on data marked by a certain attribute label, emotion or a certain user so as to verify whether the data meet the standard.
And the user configuration module 5 is used for receiving the user-defined text filtering rule and managing the rule.
And the self-starting model training module 6 is used for training the self-starting model after the marked data reach the standard, and outputting a verification report after training.
The invention is further described with reference to specific examples.
Examples
As shown in fig. 3, the text annotation method provided in the embodiment of the present invention includes:
s01: and screening out effective text data according to the text filtering rule.
And after the data is imported, filtering and cleaning are performed firstly, and effective text data are screened out. The related invalid text is not deleted and still remains, but is not used for marking.
In addition, the text filtering rule presets a part of rules for processing some common invalid text data. For example: the whole sentence is "666", "23333", "good", "normal", etc.
S02: and splitting the text and refining the text into words and short sentences. And according to the effect screened out in the step S01, splitting the text into a plurality of words or short sentences according to the grammar rule. Multiple words and short sentences split from the same text can still be displayed together in the labeling stage, because context dependency relationship may exist between the words and the sentences.
S03: and attribute labels and emotions are divided.
The data of step S02 is classified. And marking the keywords related to the words and the phrases as a certain attribute label or a certain emotion. Multiple keywords may exist in the same short sentence, or belong to the same attribute tag, or are different, and all the keywords are marked; the same is true for emotions.
The attribute labels and the emotions are preset, the meanings of all indexes are standardized, and the understanding deviation of users in the labeling process is avoided.
S04: and (5) extracting the relation.
And finding out the relation between the attribute label and the emotion according to the attribute label and the emotion division result in the step S03. One tag attribute may correspond to a plurality of emotional words, or a plurality of tag attributes may correspond to one emotional word. The attribute label and the emotion may not have any relationship, and at this time, the process directly goes to step S05, and the labeled data is stored in a library.
Fig. 4 is a schematic diagram of an application example of the text annotation method according to the embodiment of the present invention.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A text labeling method is characterized by comprising the following steps:
screening texts by using a text filtering rule to obtain effective text data;
step two, splitting the text, and refining the text into words and short sentences;
and thirdly, dividing the words and the phrases into attribute labels and emotions, and then extracting the relationship between the attribute labels and the emotions.
2. The method for annotating text as recited in claim 1, wherein in step one, the text filtering rules are different rules according to different sources of text data, and are used for providing an autonomously controllable data filtering scheme.
3. The method for labeling text according to claim 1, wherein in the third step, the attribute tags and the emotion have an association relationship for clarifying the emotional tendency of the attribute tags.
4. The method for labeling texts according to claim 1, wherein in the third step, the same short sentence has different attribute labels, emotions and corresponding association relations.
5. The text labeling method of claim 1, wherein in the third step, the attribute labels and the emotions have an association relationship of cross-word and short sentence, and the association relationship comprises the same or different emotional tendencies of a plurality of words and short sentences in the same attribute label.
6. A text annotation system for implementing the text annotation method of claim 1, wherein the text annotation system comprises:
the data filtering module is used for screening out effective text data through a filtering rule;
the marking module is connected with the data filtering module and used for taking out and displaying the text from the effective middle text data screened out by the data filtering module, marking the corresponding attribute labels, emotions and the relation among the labels to the text according to the predefined indexes, and finally storing all marking results;
the data tracking and counting module is connected with the labeling module and used for taking texts and storing labeling results of the screened effective text data, recording corresponding labeling numbers and carrying out the workload of statistical labeling and data tracking;
and the data review module is connected with the data tracking and counting module, checks the data and the data volume of the data processed by the labeling module and the data tracking and counting module, randomly checks the labeled result in proportion, returns the wrong and invalid labeled result, and finally judges whether the data is available according to the quality of the labeled result.
7. The system of claim 6, wherein the data review module is further configured to perform a spot check on data of a certain attribute tag, emotion, or a certain label to verify whether the data meets a criterion.
8. The text annotation system of claim 6, wherein said text annotation system further comprises:
the user configuration module is connected with the data filtering module and used for receiving and managing the text filtering rules;
and the self-starting model training module is connected with the data reviewing module and used for starting model training and outputting a verification report after training when the effective marked data reach the standard.
9. An information data processing terminal for implementing the text labeling method according to any one of claims 1 to 5.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the text annotation method of any one of claims 1-5.
CN201911354042.4A 2019-12-24 2019-12-24 Text labeling method and system Pending CN111125312A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911354042.4A CN111125312A (en) 2019-12-24 2019-12-24 Text labeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911354042.4A CN111125312A (en) 2019-12-24 2019-12-24 Text labeling method and system

Publications (1)

Publication Number Publication Date
CN111125312A true CN111125312A (en) 2020-05-08

Family

ID=70502867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911354042.4A Pending CN111125312A (en) 2019-12-24 2019-12-24 Text labeling method and system

Country Status (1)

Country Link
CN (1) CN111125312A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492310A (en) * 2022-02-16 2022-05-13 平安科技(深圳)有限公司 Text labeling method, text labeling device, electronic equipment and storage medium
CN117194615A (en) * 2023-11-02 2023-12-08 国网浙江省电力有限公司 Enterprise compliance data processing method and platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894102A (en) * 2010-07-16 2010-11-24 浙江工商大学 Method and device for analyzing emotion tendentiousness of subjective text
CN105117428A (en) * 2015-08-04 2015-12-02 电子科技大学 Web comment sentiment analysis method based on word alignment model
CN107102980A (en) * 2016-02-19 2017-08-29 北京国双科技有限公司 The extracting method and device of emotion information
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN110362822A (en) * 2019-06-18 2019-10-22 中国平安财产保险股份有限公司 Text marking method, apparatus, computer equipment and storage medium for model training

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894102A (en) * 2010-07-16 2010-11-24 浙江工商大学 Method and device for analyzing emotion tendentiousness of subjective text
CN105117428A (en) * 2015-08-04 2015-12-02 电子科技大学 Web comment sentiment analysis method based on word alignment model
CN107102980A (en) * 2016-02-19 2017-08-29 北京国双科技有限公司 The extracting method and device of emotion information
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN110362822A (en) * 2019-06-18 2019-10-22 中国平安财产保险股份有限公司 Text marking method, apparatus, computer equipment and storage medium for model training

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492310A (en) * 2022-02-16 2022-05-13 平安科技(深圳)有限公司 Text labeling method, text labeling device, electronic equipment and storage medium
CN114492310B (en) * 2022-02-16 2023-06-23 平安科技(深圳)有限公司 Text labeling method, text labeling device, electronic equipment and storage medium
CN117194615A (en) * 2023-11-02 2023-12-08 国网浙江省电力有限公司 Enterprise compliance data processing method and platform
CN117194615B (en) * 2023-11-02 2024-02-20 国网浙江省电力有限公司 Enterprise compliance data processing method and platform

Similar Documents

Publication Publication Date Title
JP6667504B2 (en) Orphan utterance detection system and method
US20230142217A1 (en) Model Training Method, Electronic Device, And Storage Medium
RU2571373C2 (en) Method of analysing text data tonality
US9575936B2 (en) Word cloud display
US10831796B2 (en) Tone optimization for digital content
US11954140B2 (en) Labeling/names of themes
US10515125B1 (en) Structured text segment indexing techniques
US20160299955A1 (en) Text mining system and tool
US10860566B1 (en) Themes surfacing for communication data analysis
US10180988B2 (en) Persona-based conversation
WO2013088287A1 (en) Generation of natural language processing model for information domain
US11908477B2 (en) Automatic extraction of conversation highlights
CN112699645B (en) Corpus labeling method, apparatus and device
US20150169676A1 (en) Generating a Table of Contents for Unformatted Text
CN106383814B (en) English social media short text word segmentation method
CN111178079A (en) Triple extraction method and device
US11423219B2 (en) Generation and population of new application document utilizing historical application documents
CN111125312A (en) Text labeling method and system
Sam et al. A robust methodology for building an artificial intelligent (ai) virtual assistant for payment processing
US20210319481A1 (en) System and method for summerization of customer interaction
US20160034509A1 (en) 3d analytics
US20230350929A1 (en) Method and system for generating intent responses through virtual agents
US20160171900A1 (en) Determining the Correct Answer in a Forum Thread
CN109992651B (en) Automatic identification and extraction method for problem target features
CN114625889A (en) Semantic disambiguation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 518000 1st floor, building 10, new material port, high tech middle first road, science and Technology Park community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen Shukuo Information Technology Co.,Ltd.

Address before: 518000 1st floor, building 10, new material port, high tech middle first road, science and Technology Park community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN SKIEER INFORMATION TECHNOLOGY CO.,LTD.