CN109582925A - A kind of corpus labeling method and system of man-computer cooperation - Google Patents
A kind of corpus labeling method and system of man-computer cooperation Download PDFInfo
- Publication number
- CN109582925A CN109582925A CN201811323385.XA CN201811323385A CN109582925A CN 109582925 A CN109582925 A CN 109582925A CN 201811323385 A CN201811323385 A CN 201811323385A CN 109582925 A CN109582925 A CN 109582925A
- Authority
- CN
- China
- Prior art keywords
- corpus
- label
- data
- man
- mark
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
Abstract
The invention discloses a kind of corpus labeling method of man-computer cooperation and systems, obtain corpus data to be marked and carry out artificial observation;The positioning of crucial corpus is carried out to the corpus data according to the location information of user's input;Prominent label is carried out to the crucial corpus of positioning, obtains label corpus;The label corpus is extracted from the corpus data by filtering algorithm;The mark that corpus classification is carried out to the label corpus, obtains mark corpus;To realize the corpus labeling of man-computer cooperation, mark personnel can be assisted to improve annotating efficiency, reduce the workload of mark personnel, and there is certain interactivity, mitigate dull sense.
Description
Technical field
The present invention relates to natural language processing technique field, the corpus labeling method of especially a kind of man-computer cooperation and its answer
With the system of this method.
Background technique
Corpus is the basic resource of corpus linguistics research and the main money of empiricism speech research method
Source.Traditional corpus is mainly used in lexicography, language teaching, conventional language research, based on system in natural language processing
Meter or the research of example etc..With the development of internet big data and artificial intelligence technology, corpus is also widely answered
With.
What is stored in corpus is the linguistic data really occurred in the actual use of language, such as directly from webpage
The user of upper acquisition leaves a message, customer service is talked with etc.;Corpus is the basic resource for carrying linguistry, but and is known not equal to language
Know;Real corpus needs that useful resource could be become by processing, and the processing to real corpus may include except dirty data, language
Justice mark, part of speech label etc., and when being labeled to corpus, it generally requires manually or machine learning is to each corpus data
It is labeled.
But the large-scale data got in reality is not often that corresponding personnel are desired completely useful like that,
The processing of large-scale corpus marks, and machine completion can not be depended merely in reality, is more that certain manpower is needed to go to complete to mark
Note.The presence of this kind of situation results in the need for spending a certain amount of human resources or financial resource, or even reduces a development teams
Efficiency.
Therefore, if the difficulty of this respect can be reduced, human resources are freed from this difficulty, it necessarily can one
The raising of the efficiency and progress of quantitative raising project.
Summary of the invention
The present invention can assist to solve the above problems, provide the corpus labeling method and system of a kind of man-computer cooperation
Mark personnel improve annotating efficiency, reduce the workload of mark personnel.
To achieve the above object, the technical solution adopted by the present invention are as follows:
A kind of corpus labeling method of man-computer cooperation comprising following steps:
A. it obtains corpus data to be marked and carries out artificial observation;
B. the positioning of crucial corpus is carried out to the corpus data according to the location information of user's input;
C. prominent label is carried out to the crucial corpus of positioning, obtains label corpus;
D. the label corpus is extracted from the corpus data by filtering algorithm;
E. the mark that corpus classification is carried out to the label corpus, obtains mark corpus.
Preferably, in the step a, the corpus data is table text;In the step b, the key
The positioning of corpus is the method positioned by cell, and it is corresponding to obtain the cell according to the line information that user inputs
Crucial corpus.
Alternatively, the corpus data is document text in the step a;In the step b, the Key Words
The positioning of material, is the method positioned by line number, obtains the corresponding Key Words of the line number according to the row number information that user inputs
Material.
It preferably, is to carry out inputting the location information by a command window in the step b;Also, described
The signal language of the location information is shown in command window to user.
Preferably, in the step c, the prominent label refers to label corpus addition different from original
The font color or background color of corpus data.
Preferably, in the step d, the filtering algorithm refers to be extracted from the corpus data according to color condition
The label corpus.
Preferably, in the step e, the mark of corpus classification is carried out to the label corpus, is using artificial mark
Corpus classification, or the training using machine learning to the label corpus progress corpus classification.
Corresponding, the present invention also provides a kind of corpus labeling systems of man-computer cooperation comprising:
Data acquisition module, for obtaining corpus data to be marked and carrying out artificial observation;
Corpus locating module, the location information for being inputted according to user carry out crucial corpus to the corpus data and determine
Position;
Corpus mark module obtains label corpus for carrying out prominent label to the crucial corpus of positioning;
Corpus screening module extracts the label corpus by filtering algorithm from the corpus data;
Corpus labeling module obtains mark corpus for carrying out the mark of corpus classification to the label corpus.
The beneficial effects of the present invention are:
(1) present invention is marked by artificial observation, corpus positioning, corpus, corpus extracts, the method for corpus labeling, is realized
The corpus labeling of man-computer cooperation can assist mark personnel to improve annotating efficiency, reduce the workload of mark personnel;
(2) corpus data of the invention uses table text perhaps document text and the positioning of use cell or line number
The method of positioning quickly can position and extract crucial corpus;
(3) present invention carries out prominent label to crucial corpus using the method for color mark, and according to color condition to mark
Note corpus is screened and is extracted, and more intuitively, improves accuracy;
(4) present invention inputs location information for user by a command window, and the prompt of location information is shown to user
Language has certain interactivity, mitigates dull sense.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes a part of the invention, this hair
Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the general flow chart of the corpus labeling method of the man-computer cooperation of first embodiment of the invention;
Fig. 2 is the command window schematic diagram (table text) of first embodiment of the invention;
Fig. 3 is that the corpus of first embodiment of the invention marks result schematic diagram (table text);
Fig. 4 is the selection result schematic diagram (table text) of the label corpus of first embodiment of the invention;
Fig. 5 is the annotation results schematic diagram (table text) of the corpus classification of first embodiment of the invention;
Fig. 6 is the command window schematic diagram (document text) of second embodiment of the invention.
Specific embodiment
In order to be clearer and more clear technical problems, technical solutions and advantages to be solved, tie below
Closing accompanying drawings and embodiments, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used
To explain the present invention, it is not intended to limit the present invention.
First embodiment (table text)
As shown in Figure 1, a kind of corpus labeling method of man-computer cooperation of the invention comprising following steps:
A kind of corpus labeling method of man-computer cooperation comprising following steps:
A. it obtains corpus data to be marked and carries out artificial observation;
B. the positioning of crucial corpus is carried out to the corpus data according to the location information of user's input;
C. prominent label is carried out to the crucial corpus of positioning, obtains label corpus;
D. the label corpus is extracted from the corpus data by filtering algorithm;
E. the mark that corpus classification is carried out to the label corpus, obtains mark corpus.
In the present embodiment, the corpus data is table text;The positioning of the crucial corpus is fixed by cell
The method of position obtains the corresponding crucial corpus of the cell according to the line information that user inputs.
It is to carry out inputting the location information by a command window in the step b;Also, in the command window
The signal language of the location information is shown in mouthful to user;As shown in Fig. 2, the line information is to pass through in the present embodiment
Column information is first specified, then further specifies more than one row information on the basis of the column information;The signal language is first
It prompts user to input column information, reresents user user and input more than one row information, without repeatedly inputting column information,
Save the operating time.
In the step c, the prominent label refers to and is different from original corpus data to label corpus addition
Font color or background color;As shown in figure 3, red to label corpus using the red font color of addition in the present embodiment
Label.
In the step d, the filtering algorithm refers to extracts the mark according to color condition from the corpus data
Remember corpus;The label corpus is screened as shown in figure 4, carrying screening function using excel in the present embodiment, to ignore
Other corpus make interface more succinct.
In the step e, the mark of corpus classification is carried out to the label corpus, is using artificial mark corpus class
Not, or using machine learning to the label corpus training of corpus classification is carried out;As shown in figure 5, being carried out using another column
The corpus classification for recording the label corpus, can be with no treatment to other corpus.
Second embodiment (document text)
The main distinction of the present embodiment and first embodiment is: in the present embodiment, in the step a, and the corpus
Data are document text;In the step b, the positioning of the crucial corpus, be by line number position method, according to
The row number information of family input obtains the corresponding crucial corpus of the line number.
In addition, additionally providing a kind of command window of optimization in the present embodiment;As shown in fig. 6, the command window is not only
Show the signal language of the location information to user, the backchannel that also further the location information of user's input is confirmed,
Such as " OK ", " correct ", or " input error " etc., so that user can receive feedback in time, interactivity is more preferable.
Remaining annotation process of the present embodiment is substantially similar to first embodiment, herein without repeating.
3rd embodiment (labeling system)
In addition, the present invention also provides a kind of corresponding systems of the corpus labeling method of man-computer cooperation comprising:
Data acquisition module, for obtaining corpus data to be marked and carrying out artificial observation;
Corpus locating module, the location information for being inputted according to user carry out crucial corpus to the corpus data and determine
Position;
Corpus mark module obtains label corpus for carrying out prominent label to the crucial corpus of positioning;
Corpus screening module extracts the label corpus by filtering algorithm from the corpus data;
Corpus labeling module obtains mark corpus for carrying out the mark of corpus classification to the label corpus.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight
Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other.
For system embodiments, since it is basically similar to the method embodiment, so being described relatively simple, related place referring to
The part of embodiment of the method illustrates.
Also, herein, the terms "include", "comprise" or its any other variant are intended to the packet of nonexcludability
Contain, so that the process, method, article or equipment for including a series of elements not only includes those elements, but also including
Other elements that are not explicitly listed, or further include for elements inherent to such a process, method, article, or device.
In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element
Process, method, article or equipment in there is also other identical elements.In addition, those of ordinary skill in the art can manage
Solution realizes that all or part of the steps of above-described embodiment may be implemented by hardware, and can also be instructed by program relevant
Hardware is completed, and the program can store in a kind of computer readable storage medium, and storage medium mentioned above can be with
It is read-only memory, disk or CD etc..
The preferred embodiment of the present invention has shown and described in above description, it should be understood that the present invention is not limited to this paper institute
The form of disclosure, should not be regarded as an exclusion of other examples, and can be used for other combinations, modifications, and environments, and energy
Enough in this paper invented the scope of the idea, modifications can be made through the above teachings or related fields of technology or knowledge.And people from this field
The modifications and changes that member is carried out do not depart from the spirit and scope of the present invention, then all should be in the protection of appended claims of the present invention
In range.
Claims (8)
1. a kind of corpus labeling method of man-computer cooperation, which comprises the following steps:
A. it obtains corpus data to be marked and carries out artificial observation;
B. the positioning of crucial corpus is carried out to the corpus data according to the location information of user's input;
C. prominent label is carried out to the crucial corpus of positioning, obtains label corpus;
D. the label corpus is extracted from the corpus data by filtering algorithm;
E. the mark that corpus classification is carried out to the label corpus, obtains mark corpus.
2. a kind of corpus labeling method of man-computer cooperation according to claim 1, it is characterised in that: in the step a,
The corpus data is table text;In the step b, the positioning of the crucial corpus, is positioned by cell
Method obtains the corresponding crucial corpus of the cell according to the line information that user inputs.
3. a kind of corpus labeling method of man-computer cooperation according to claim 1, it is characterised in that: in the step a,
The corpus data is document text;In the step b, the positioning of the crucial corpus is the side positioned by line number
Method obtains the corresponding crucial corpus of the line number according to the row number information that user inputs.
4. a kind of corpus labeling method of man-computer cooperation according to any one of claims 1 to 3, it is characterised in that: described
Step b in, be to carry out inputting the location information by a command window;Also, to user's exhibition in the command window
Show the signal language of the location information.
5. a kind of corpus labeling method of man-computer cooperation according to any one of claims 1 to 3, it is characterised in that: described
Step c in, the prominent label, refer to the label corpus addition be different from original corpus data font color or
Background color.
6. a kind of corpus labeling method of man-computer cooperation according to claim 5, it is characterised in that: in the step d,
The filtering algorithm refers to extracts the label corpus according to color condition from the corpus data.
7. a kind of corpus labeling method of man-computer cooperation according to any one of claims 1 to 3, it is characterised in that: described
Step e in, the mark of corpus classification is carried out to the label corpus, is that machine or is used using artificial mark corpus classification
Device study carries out the training of corpus classification to the label corpus.
8. a kind of corpus labeling system of man-computer cooperation characterized by comprising
Data acquisition module, for obtaining corpus data to be marked and carrying out artificial observation;
Corpus locating module, the location information for being inputted according to user carry out the positioning of crucial corpus to the corpus data;
Corpus mark module obtains label corpus for carrying out prominent label to the crucial corpus of positioning;
Corpus screening module extracts the label corpus by filtering algorithm from the corpus data;
Corpus labeling module obtains mark corpus for carrying out the mark of corpus classification to the label corpus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811323385.XA CN109582925B (en) | 2018-11-08 | 2018-11-08 | Man-machine combined corpus labeling method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811323385.XA CN109582925B (en) | 2018-11-08 | 2018-11-08 | Man-machine combined corpus labeling method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109582925A true CN109582925A (en) | 2019-04-05 |
CN109582925B CN109582925B (en) | 2023-02-14 |
Family
ID=65921772
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811323385.XA Active CN109582925B (en) | 2018-11-08 | 2018-11-08 | Man-machine combined corpus labeling method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109582925B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347921A (en) * | 2019-07-04 | 2019-10-18 | 有光创新(北京)信息技术有限公司 | A kind of the label abstracting method and device of multi-modal data information |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831131A (en) * | 2011-06-16 | 2012-12-19 | 富士通株式会社 | Method and device for establishing labeling webpage linguistic corpus |
CN102982036A (en) * | 2011-09-05 | 2013-03-20 | 北大方正集团有限公司 | Method of corpus structuralization and device |
CN106355628A (en) * | 2015-07-16 | 2017-01-25 | 中国石油化工股份有限公司 | Image-text knowledge point marking method and device and image-text mark correcting method and system |
CN106782509A (en) * | 2016-12-02 | 2017-05-31 | 乐视控股(北京)有限公司 | A kind of corpus labeling method and device and terminal |
US20180004730A1 (en) * | 2016-06-29 | 2018-01-04 | Shenzhen Gowild Robotics Co., Ltd. | Corpus generation device and method, human-machine interaction system |
CN107729921A (en) * | 2017-09-20 | 2018-02-23 | 厦门快商通科技股份有限公司 | A kind of machine Active Learning Method and learning system |
CN108255816A (en) * | 2018-03-12 | 2018-07-06 | 北京神州泰岳软件股份有限公司 | A kind of name entity recognition method, apparatus and system |
-
2018
- 2018-11-08 CN CN201811323385.XA patent/CN109582925B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831131A (en) * | 2011-06-16 | 2012-12-19 | 富士通株式会社 | Method and device for establishing labeling webpage linguistic corpus |
CN102982036A (en) * | 2011-09-05 | 2013-03-20 | 北大方正集团有限公司 | Method of corpus structuralization and device |
CN106355628A (en) * | 2015-07-16 | 2017-01-25 | 中国石油化工股份有限公司 | Image-text knowledge point marking method and device and image-text mark correcting method and system |
US20180004730A1 (en) * | 2016-06-29 | 2018-01-04 | Shenzhen Gowild Robotics Co., Ltd. | Corpus generation device and method, human-machine interaction system |
CN106782509A (en) * | 2016-12-02 | 2017-05-31 | 乐视控股(北京)有限公司 | A kind of corpus labeling method and device and terminal |
CN107729921A (en) * | 2017-09-20 | 2018-02-23 | 厦门快商通科技股份有限公司 | A kind of machine Active Learning Method and learning system |
CN108255816A (en) * | 2018-03-12 | 2018-07-06 | 北京神州泰岳软件股份有限公司 | A kind of name entity recognition method, apparatus and system |
Non-Patent Citations (1)
Title |
---|
董爱华: "语料库中语料的标注", 《北京印刷学院学报》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347921A (en) * | 2019-07-04 | 2019-10-18 | 有光创新(北京)信息技术有限公司 | A kind of the label abstracting method and device of multi-modal data information |
Also Published As
Publication number | Publication date |
---|---|
CN109582925B (en) | 2023-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mu et al. | The ACODEA framework: Developing segmentation and classification schemes for fully automatic analysis of online discussions | |
Onwuegbuzie et al. | Qualitative analysis techniques for the review of the literature. | |
CN109886270B (en) | Case element identification method for electronic file record text | |
CN105975555A (en) | Bidirectional recursive neural network-based enterprise abbreviation extraction method | |
CN108228568B (en) | Mathematical problem semantic understanding method | |
CN102662930A (en) | Corpus tagging method and corpus tagging device | |
CN108153729A (en) | A kind of Knowledge Extraction Method towards financial field | |
CN113886567A (en) | Teaching method and system based on knowledge graph | |
Gurcan et al. | Expertise roles and skills required by the software development industry | |
CN111242565A (en) | Resume optimization method and device based on intelligent personnel model | |
KR20200139008A (en) | User intention-analysis based contract recommendation and autocomplete service using deep learning | |
Kudi et al. | Online Examination with short text matching | |
Ferrari et al. | Towards a Dataset for Natural Language Requirements Processing. | |
US11741318B2 (en) | Open information extraction from low resource languages | |
Li et al. | Aligning open educational resources to new taxonomies: How AI technologies can help and in which scenarios | |
CN109582925A (en) | A kind of corpus labeling method and system of man-computer cooperation | |
Shweta et al. | Comparative study of feature engineering for automated short answer grading | |
Cruz et al. | Named-entity recognition for disaster related filipino news articles | |
US10261993B1 (en) | Adaptable text analytics platform | |
CN115017271A (en) | Method and system for intelligently generating RPA flow component block | |
Tran et al. | Named entity recognition for vietnamese spoken texts and its application in smart mobile voice interaction | |
Pham et al. | Extracting positive attributions from scientific papers | |
Parveen et al. | Clause Boundary Identification using Classifier and Clause Markers in Urdu Language | |
Suriyachay et al. | Thai named entity tagged corpus annotation scheme and self verification | |
CN112328812A (en) | Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |