CN109582925A - A kind of corpus labeling method and system of man-computer cooperation - Google Patents

A kind of corpus labeling method and system of man-computer cooperation Download PDF

Info

Publication number
CN109582925A
CN109582925A CN201811323385.XA CN201811323385A CN109582925A CN 109582925 A CN109582925 A CN 109582925A CN 201811323385 A CN201811323385 A CN 201811323385A CN 109582925 A CN109582925 A CN 109582925A
Authority
CN
China
Prior art keywords
corpus
label
data
man
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811323385.XA
Other languages
Chinese (zh)
Other versions
CN109582925B (en
Inventor
张泽明
肖龙源
蔡振华
李稀敏
刘晓葳
谭玉坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Corp ltd
Original Assignee
Xiamen Kuaishangtong Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Corp ltd filed Critical Xiamen Kuaishangtong Technology Corp ltd
Priority to CN201811323385.XA priority Critical patent/CN109582925B/en
Publication of CN109582925A publication Critical patent/CN109582925A/en
Application granted granted Critical
Publication of CN109582925B publication Critical patent/CN109582925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes

Abstract

The invention discloses a kind of corpus labeling method of man-computer cooperation and systems, obtain corpus data to be marked and carry out artificial observation;The positioning of crucial corpus is carried out to the corpus data according to the location information of user's input;Prominent label is carried out to the crucial corpus of positioning, obtains label corpus;The label corpus is extracted from the corpus data by filtering algorithm;The mark that corpus classification is carried out to the label corpus, obtains mark corpus;To realize the corpus labeling of man-computer cooperation, mark personnel can be assisted to improve annotating efficiency, reduce the workload of mark personnel, and there is certain interactivity, mitigate dull sense.

Description

A kind of corpus labeling method and system of man-computer cooperation
Technical field
The present invention relates to natural language processing technique field, the corpus labeling method of especially a kind of man-computer cooperation and its answer With the system of this method.
Background technique
Corpus is the basic resource of corpus linguistics research and the main money of empiricism speech research method Source.Traditional corpus is mainly used in lexicography, language teaching, conventional language research, based on system in natural language processing Meter or the research of example etc..With the development of internet big data and artificial intelligence technology, corpus is also widely answered With.
What is stored in corpus is the linguistic data really occurred in the actual use of language, such as directly from webpage The user of upper acquisition leaves a message, customer service is talked with etc.;Corpus is the basic resource for carrying linguistry, but and is known not equal to language Know;Real corpus needs that useful resource could be become by processing, and the processing to real corpus may include except dirty data, language Justice mark, part of speech label etc., and when being labeled to corpus, it generally requires manually or machine learning is to each corpus data It is labeled.
But the large-scale data got in reality is not often that corresponding personnel are desired completely useful like that, The processing of large-scale corpus marks, and machine completion can not be depended merely in reality, is more that certain manpower is needed to go to complete to mark Note.The presence of this kind of situation results in the need for spending a certain amount of human resources or financial resource, or even reduces a development teams Efficiency.
Therefore, if the difficulty of this respect can be reduced, human resources are freed from this difficulty, it necessarily can one The raising of the efficiency and progress of quantitative raising project.
Summary of the invention
The present invention can assist to solve the above problems, provide the corpus labeling method and system of a kind of man-computer cooperation Mark personnel improve annotating efficiency, reduce the workload of mark personnel.
To achieve the above object, the technical solution adopted by the present invention are as follows:
A kind of corpus labeling method of man-computer cooperation comprising following steps:
A. it obtains corpus data to be marked and carries out artificial observation;
B. the positioning of crucial corpus is carried out to the corpus data according to the location information of user's input;
C. prominent label is carried out to the crucial corpus of positioning, obtains label corpus;
D. the label corpus is extracted from the corpus data by filtering algorithm;
E. the mark that corpus classification is carried out to the label corpus, obtains mark corpus.
Preferably, in the step a, the corpus data is table text;In the step b, the key The positioning of corpus is the method positioned by cell, and it is corresponding to obtain the cell according to the line information that user inputs Crucial corpus.
Alternatively, the corpus data is document text in the step a;In the step b, the Key Words The positioning of material, is the method positioned by line number, obtains the corresponding Key Words of the line number according to the row number information that user inputs Material.
It preferably, is to carry out inputting the location information by a command window in the step b;Also, described The signal language of the location information is shown in command window to user.
Preferably, in the step c, the prominent label refers to label corpus addition different from original The font color or background color of corpus data.
Preferably, in the step d, the filtering algorithm refers to be extracted from the corpus data according to color condition The label corpus.
Preferably, in the step e, the mark of corpus classification is carried out to the label corpus, is using artificial mark Corpus classification, or the training using machine learning to the label corpus progress corpus classification.
Corresponding, the present invention also provides a kind of corpus labeling systems of man-computer cooperation comprising:
Data acquisition module, for obtaining corpus data to be marked and carrying out artificial observation;
Corpus locating module, the location information for being inputted according to user carry out crucial corpus to the corpus data and determine Position;
Corpus mark module obtains label corpus for carrying out prominent label to the crucial corpus of positioning;
Corpus screening module extracts the label corpus by filtering algorithm from the corpus data;
Corpus labeling module obtains mark corpus for carrying out the mark of corpus classification to the label corpus.
The beneficial effects of the present invention are:
(1) present invention is marked by artificial observation, corpus positioning, corpus, corpus extracts, the method for corpus labeling, is realized The corpus labeling of man-computer cooperation can assist mark personnel to improve annotating efficiency, reduce the workload of mark personnel;
(2) corpus data of the invention uses table text perhaps document text and the positioning of use cell or line number The method of positioning quickly can position and extract crucial corpus;
(3) present invention carries out prominent label to crucial corpus using the method for color mark, and according to color condition to mark Note corpus is screened and is extracted, and more intuitively, improves accuracy;
(4) present invention inputs location information for user by a command window, and the prompt of location information is shown to user Language has certain interactivity, mitigates dull sense.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes a part of the invention, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the general flow chart of the corpus labeling method of the man-computer cooperation of first embodiment of the invention;
Fig. 2 is the command window schematic diagram (table text) of first embodiment of the invention;
Fig. 3 is that the corpus of first embodiment of the invention marks result schematic diagram (table text);
Fig. 4 is the selection result schematic diagram (table text) of the label corpus of first embodiment of the invention;
Fig. 5 is the annotation results schematic diagram (table text) of the corpus classification of first embodiment of the invention;
Fig. 6 is the command window schematic diagram (document text) of second embodiment of the invention.
Specific embodiment
In order to be clearer and more clear technical problems, technical solutions and advantages to be solved, tie below Closing accompanying drawings and embodiments, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used To explain the present invention, it is not intended to limit the present invention.
First embodiment (table text)
As shown in Figure 1, a kind of corpus labeling method of man-computer cooperation of the invention comprising following steps:
A kind of corpus labeling method of man-computer cooperation comprising following steps:
A. it obtains corpus data to be marked and carries out artificial observation;
B. the positioning of crucial corpus is carried out to the corpus data according to the location information of user's input;
C. prominent label is carried out to the crucial corpus of positioning, obtains label corpus;
D. the label corpus is extracted from the corpus data by filtering algorithm;
E. the mark that corpus classification is carried out to the label corpus, obtains mark corpus.
In the present embodiment, the corpus data is table text;The positioning of the crucial corpus is fixed by cell The method of position obtains the corresponding crucial corpus of the cell according to the line information that user inputs.
It is to carry out inputting the location information by a command window in the step b;Also, in the command window The signal language of the location information is shown in mouthful to user;As shown in Fig. 2, the line information is to pass through in the present embodiment Column information is first specified, then further specifies more than one row information on the basis of the column information;The signal language is first It prompts user to input column information, reresents user user and input more than one row information, without repeatedly inputting column information, Save the operating time.
In the step c, the prominent label refers to and is different from original corpus data to label corpus addition Font color or background color;As shown in figure 3, red to label corpus using the red font color of addition in the present embodiment Label.
In the step d, the filtering algorithm refers to extracts the mark according to color condition from the corpus data Remember corpus;The label corpus is screened as shown in figure 4, carrying screening function using excel in the present embodiment, to ignore Other corpus make interface more succinct.
In the step e, the mark of corpus classification is carried out to the label corpus, is using artificial mark corpus class Not, or using machine learning to the label corpus training of corpus classification is carried out;As shown in figure 5, being carried out using another column The corpus classification for recording the label corpus, can be with no treatment to other corpus.
Second embodiment (document text)
The main distinction of the present embodiment and first embodiment is: in the present embodiment, in the step a, and the corpus Data are document text;In the step b, the positioning of the crucial corpus, be by line number position method, according to The row number information of family input obtains the corresponding crucial corpus of the line number.
In addition, additionally providing a kind of command window of optimization in the present embodiment;As shown in fig. 6, the command window is not only Show the signal language of the location information to user, the backchannel that also further the location information of user's input is confirmed, Such as " OK ", " correct ", or " input error " etc., so that user can receive feedback in time, interactivity is more preferable.
Remaining annotation process of the present embodiment is substantially similar to first embodiment, herein without repeating.
3rd embodiment (labeling system)
In addition, the present invention also provides a kind of corresponding systems of the corpus labeling method of man-computer cooperation comprising:
Data acquisition module, for obtaining corpus data to be marked and carrying out artificial observation;
Corpus locating module, the location information for being inputted according to user carry out crucial corpus to the corpus data and determine Position;
Corpus mark module obtains label corpus for carrying out prominent label to the crucial corpus of positioning;
Corpus screening module extracts the label corpus by filtering algorithm from the corpus data;
Corpus labeling module obtains mark corpus for carrying out the mark of corpus classification to the label corpus.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other. For system embodiments, since it is basically similar to the method embodiment, so being described relatively simple, related place referring to The part of embodiment of the method illustrates.
Also, herein, the terms "include", "comprise" or its any other variant are intended to the packet of nonexcludability Contain, so that the process, method, article or equipment for including a series of elements not only includes those elements, but also including Other elements that are not explicitly listed, or further include for elements inherent to such a process, method, article, or device. In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element Process, method, article or equipment in there is also other identical elements.In addition, those of ordinary skill in the art can manage Solution realizes that all or part of the steps of above-described embodiment may be implemented by hardware, and can also be instructed by program relevant Hardware is completed, and the program can store in a kind of computer readable storage medium, and storage medium mentioned above can be with It is read-only memory, disk or CD etc..
The preferred embodiment of the present invention has shown and described in above description, it should be understood that the present invention is not limited to this paper institute The form of disclosure, should not be regarded as an exclusion of other examples, and can be used for other combinations, modifications, and environments, and energy Enough in this paper invented the scope of the idea, modifications can be made through the above teachings or related fields of technology or knowledge.And people from this field The modifications and changes that member is carried out do not depart from the spirit and scope of the present invention, then all should be in the protection of appended claims of the present invention In range.

Claims (8)

1. a kind of corpus labeling method of man-computer cooperation, which comprises the following steps:
A. it obtains corpus data to be marked and carries out artificial observation;
B. the positioning of crucial corpus is carried out to the corpus data according to the location information of user's input;
C. prominent label is carried out to the crucial corpus of positioning, obtains label corpus;
D. the label corpus is extracted from the corpus data by filtering algorithm;
E. the mark that corpus classification is carried out to the label corpus, obtains mark corpus.
2. a kind of corpus labeling method of man-computer cooperation according to claim 1, it is characterised in that: in the step a, The corpus data is table text;In the step b, the positioning of the crucial corpus, is positioned by cell Method obtains the corresponding crucial corpus of the cell according to the line information that user inputs.
3. a kind of corpus labeling method of man-computer cooperation according to claim 1, it is characterised in that: in the step a, The corpus data is document text;In the step b, the positioning of the crucial corpus is the side positioned by line number Method obtains the corresponding crucial corpus of the line number according to the row number information that user inputs.
4. a kind of corpus labeling method of man-computer cooperation according to any one of claims 1 to 3, it is characterised in that: described Step b in, be to carry out inputting the location information by a command window;Also, to user's exhibition in the command window Show the signal language of the location information.
5. a kind of corpus labeling method of man-computer cooperation according to any one of claims 1 to 3, it is characterised in that: described Step c in, the prominent label, refer to the label corpus addition be different from original corpus data font color or Background color.
6. a kind of corpus labeling method of man-computer cooperation according to claim 5, it is characterised in that: in the step d, The filtering algorithm refers to extracts the label corpus according to color condition from the corpus data.
7. a kind of corpus labeling method of man-computer cooperation according to any one of claims 1 to 3, it is characterised in that: described Step e in, the mark of corpus classification is carried out to the label corpus, is that machine or is used using artificial mark corpus classification Device study carries out the training of corpus classification to the label corpus.
8. a kind of corpus labeling system of man-computer cooperation characterized by comprising
Data acquisition module, for obtaining corpus data to be marked and carrying out artificial observation;
Corpus locating module, the location information for being inputted according to user carry out the positioning of crucial corpus to the corpus data;
Corpus mark module obtains label corpus for carrying out prominent label to the crucial corpus of positioning;
Corpus screening module extracts the label corpus by filtering algorithm from the corpus data;
Corpus labeling module obtains mark corpus for carrying out the mark of corpus classification to the label corpus.
CN201811323385.XA 2018-11-08 2018-11-08 Man-machine combined corpus labeling method and system Active CN109582925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811323385.XA CN109582925B (en) 2018-11-08 2018-11-08 Man-machine combined corpus labeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811323385.XA CN109582925B (en) 2018-11-08 2018-11-08 Man-machine combined corpus labeling method and system

Publications (2)

Publication Number Publication Date
CN109582925A true CN109582925A (en) 2019-04-05
CN109582925B CN109582925B (en) 2023-02-14

Family

ID=65921772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811323385.XA Active CN109582925B (en) 2018-11-08 2018-11-08 Man-machine combined corpus labeling method and system

Country Status (1)

Country Link
CN (1) CN109582925B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347921A (en) * 2019-07-04 2019-10-18 有光创新(北京)信息技术有限公司 A kind of the label abstracting method and device of multi-modal data information

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831131A (en) * 2011-06-16 2012-12-19 富士通株式会社 Method and device for establishing labeling webpage linguistic corpus
CN102982036A (en) * 2011-09-05 2013-03-20 北大方正集团有限公司 Method of corpus structuralization and device
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
CN106782509A (en) * 2016-12-02 2017-05-31 乐视控股(北京)有限公司 A kind of corpus labeling method and device and terminal
US20180004730A1 (en) * 2016-06-29 2018-01-04 Shenzhen Gowild Robotics Co., Ltd. Corpus generation device and method, human-machine interaction system
CN107729921A (en) * 2017-09-20 2018-02-23 厦门快商通科技股份有限公司 A kind of machine Active Learning Method and learning system
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831131A (en) * 2011-06-16 2012-12-19 富士通株式会社 Method and device for establishing labeling webpage linguistic corpus
CN102982036A (en) * 2011-09-05 2013-03-20 北大方正集团有限公司 Method of corpus structuralization and device
CN106355628A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Image-text knowledge point marking method and device and image-text mark correcting method and system
US20180004730A1 (en) * 2016-06-29 2018-01-04 Shenzhen Gowild Robotics Co., Ltd. Corpus generation device and method, human-machine interaction system
CN106782509A (en) * 2016-12-02 2017-05-31 乐视控股(北京)有限公司 A kind of corpus labeling method and device and terminal
CN107729921A (en) * 2017-09-20 2018-02-23 厦门快商通科技股份有限公司 A kind of machine Active Learning Method and learning system
CN108255816A (en) * 2018-03-12 2018-07-06 北京神州泰岳软件股份有限公司 A kind of name entity recognition method, apparatus and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
董爱华: "语料库中语料的标注", 《北京印刷学院学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347921A (en) * 2019-07-04 2019-10-18 有光创新(北京)信息技术有限公司 A kind of the label abstracting method and device of multi-modal data information

Also Published As

Publication number Publication date
CN109582925B (en) 2023-02-14

Similar Documents

Publication Publication Date Title
Mu et al. The ACODEA framework: Developing segmentation and classification schemes for fully automatic analysis of online discussions
Onwuegbuzie et al. Qualitative analysis techniques for the review of the literature.
CN109886270B (en) Case element identification method for electronic file record text
CN105975555A (en) Bidirectional recursive neural network-based enterprise abbreviation extraction method
CN108228568B (en) Mathematical problem semantic understanding method
CN102662930A (en) Corpus tagging method and corpus tagging device
CN108153729A (en) A kind of Knowledge Extraction Method towards financial field
CN113886567A (en) Teaching method and system based on knowledge graph
Gurcan et al. Expertise roles and skills required by the software development industry
CN111242565A (en) Resume optimization method and device based on intelligent personnel model
KR20200139008A (en) User intention-analysis based contract recommendation and autocomplete service using deep learning
Kudi et al. Online Examination with short text matching
Ferrari et al. Towards a Dataset for Natural Language Requirements Processing.
US11741318B2 (en) Open information extraction from low resource languages
Li et al. Aligning open educational resources to new taxonomies: How AI technologies can help and in which scenarios
CN109582925A (en) A kind of corpus labeling method and system of man-computer cooperation
Shweta et al. Comparative study of feature engineering for automated short answer grading
Cruz et al. Named-entity recognition for disaster related filipino news articles
US10261993B1 (en) Adaptable text analytics platform
CN115017271A (en) Method and system for intelligently generating RPA flow component block
Tran et al. Named entity recognition for vietnamese spoken texts and its application in smart mobile voice interaction
Pham et al. Extracting positive attributions from scientific papers
Parveen et al. Clause Boundary Identification using Classifier and Clause Markers in Urdu Language
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification
CN112328812A (en) Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant