CN111339325A - Data marking system and data marking method - Google Patents

Data marking system and data marking method Download PDF

Info

Publication number
CN111339325A
CN111339325A CN201811596379.1A CN201811596379A CN111339325A CN 111339325 A CN111339325 A CN 111339325A CN 201811596379 A CN201811596379 A CN 201811596379A CN 111339325 A CN111339325 A CN 111339325A
Authority
CN
China
Prior art keywords
data
information
tagging
marking
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811596379.1A
Other languages
Chinese (zh)
Inventor
张如莹
林柏霖
潘桓毅
谢佳恩
黄玟瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial Technology Research Institute ITRI
Original Assignee
Industrial Technology Research Institute ITRI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial Technology Research Institute ITRI filed Critical Industrial Technology Research Institute ITRI
Publication of CN111339325A publication Critical patent/CN111339325A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data marking system and a data marking method are provided. The data marking system comprises a marking database, an unmarked database, a marking data amplification module and an operating platform, wherein the operating platform is in signal connection with the marking database, the unmarked database and the marking data amplification module and comprises a marking mode editing interface, the marking mode editing interface can be used for inputting data and executing editing operation to generate at least one confirmed marking mode, and the marking data amplification module executes operation according to the at least one confirmed marking mode and the unmarked database to generate at least one new marking data and stores the at least one new marking data into the marking database.

Description

Data marking system and data marking method
Technical Field
The present invention relates to a data processing system and a data processing method.
Background
Most artificial intelligence works through training and learning, and usually, labeled data is used as a sample for training and learning. With the demand of market application, when the artificial intelligence is to solve more complicated problems, a larger amount of marking data is necessary to maintain a certain accuracy, and the required marking data is different when the artificial intelligence is applied in different fields, so that the wider application and better performance of the artificial intelligence need to rely on a large amount of marking data as a postshield.
Data marking is commonly performed by manual marking, which consumes labor and time cost, and automatic marking technology is developed as an aid to shorten development time or cost. At present, the automatic marking operation adopts a recursive mode, namely, after a marking system carries out automatic marking prediction on a text, the prediction result of the whole text is manually inspected and corrected, and the corrected prediction result is fed back to the system to establish a prediction module.
Disclosure of Invention
The invention provides a data marking system and a data marking method.
In an exemplary embodiment, the present invention provides a data tagging system, comprising a tag database, an unmarked database, a tag data amplification module, and an operating platform, wherein the operating platform is in signal connection with the tag database, the unmarked database, and the tag data amplification module, and comprises a tag pattern editing interface, and the tag pattern editing interface is capable of inputting data and performing editing operation to generate at least one confirmed tag pattern, wherein the tag data amplification module performs operation to generate at least one new tag data according to the at least one confirmed tag pattern and the unmarked database, and stores the at least one new tag data into the tag database.
In an exemplary embodiment, the present invention provides a data marking method, which is applicable to a data marking system and includes receiving data or editing operation, generating at least one confirmed marking mode according to the received data or editing operation, performing an operation with an unmarked database according to the at least one confirmed marking mode to generate at least one new marking data, and storing the at least one new marking data in a marking database.
Based on the above, the data tagging system and the data tagging method provided in the embodiments of the present invention, wherein the operating platform can receive the editing operation to generate the confirmation tagging mode, and the tagging data amplification module performs the operation according to the confirmation tagging mode generated after the editing operation to generate the new augmented tagging data to amplify the tagging database, and simultaneously improve the accuracy in the tagging database.
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
FIG. 1 is a block diagram of a data marking system according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data marking method according to an embodiment of the invention;
FIG. 3 is a block diagram of a data marking system according to another embodiment of the present invention;
FIG. 4 is a flowchart illustrating a data marking method according to another embodiment of the present invention;
FIG. 5 is a block diagram of a data marking system according to another embodiment of the present invention;
FIG. 6 is a flowchart illustrating a data marking method according to another embodiment of the present invention;
FIG. 7 is a block diagram of a data marking system according to another embodiment of the present invention;
FIG. 8 is a flowchart illustrating a data marking method according to another embodiment of the present invention;
FIG. 9 is a block diagram of a data marking system according to another embodiment of the present invention;
FIG. 10 is a flowchart illustrating a data marking method according to another embodiment of the present invention;
FIG. 11 is a block diagram of a data marking system according to another embodiment of the present invention;
fig. 12 is a flowchart illustrating a data marking method according to another embodiment of the invention.
Description of the symbols:
1 data marking system
2 tag database
3 unmarked database
4-labeled data amplification module
41 amplification Unit
42 marker mode data set
43 flag pattern generation unit
5 operating platform
51 marking mode editing interface
52 data tag prediction interface
53 manually marked interface
6 automatic data marking module
S1-S5
S11, S21-S24, S31, S51, S52, S61-S63, S71 and S72
Detailed Description
The term "signal connection" as used throughout this specification, including the claims, may refer to any direct or indirect connection means. For example, if a processor is described herein as being signally connected to memory, it is to be understood that the processor may be directly connected to the memory, or the processor may be indirectly connected to the memory via other devices or some connection means. Further, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. Components/parts/steps in different embodiments using the same reference numerals or using the same terms may be referred to one another in relation to the description.
Fig. 1 is a data marking system 1 according to an embodiment of the present invention. The data marking system 1 comprises a marking database 2, an unmarked database 3, a marking data amplification module 4 and an operation platform 5 which is in signal connection with the marking database 2, the unmarked database 3 and the marking data amplification module 4. The marked database 2 is used for storing marked data, and the unmarked database 3 is used for storing unmarked data.
The operation platform 5 comprises a mark pattern editing interface 51, and the mark pattern editing interface 51 is used for inputting data and performing editing operation to generate at least one confirmation mark pattern. The operation platform 5 of this embodiment is provided for the user to input data and perform editing, adding and deleting, and the operation platform 5 can also input data and perform editing, adding and deleting through the application programming interface.
The tag data expansion module 4 performs an operation according to the at least one confirmed tag pattern and the unlabeled database 3 to generate at least one new tag data, and stores the at least one new tag data in the tag database 2 to expand the tag database 2.
Referring further to fig. 2, a flow chart of a data marking method according to an embodiment of the invention is shown and is applicable to the data marking system 1 shown in fig. 1. The steps of the data marking method of this embodiment are described as follows: after the step S1 is executed, the tag pattern editing interface 51 of the operation platform 5 receives the data or editing operation, and then the step S2 is executed, in which the tag pattern editing interface 51 generates at least one confirmed tag pattern according to the received data or editing operation. And the confirmation mark pattern is mark pattern (pattern) data defining a rule for performing data marking.
Then, step S3 is executed, the tag data amplification module 4 performs operations with the unlabeled database 3 according to the at least one confirmed tag pattern to generate at least one new tag data, and step S4 is executed to store the at least one new tag data in the tag database 2. Specifically, in steps S3 and S4 of the present embodiment, the tag data amplification module 4 performs a tagging algorithm on the data in the unlabeled database 3 according to the at least one confirmed tagging mode to tag the data in the unlabeled database 3 and generate the at least one new tag data, and further stores the generated new tag data in the tag database 2 to amplify the tag database 2.
The data tagging system 1 and the data tagging method shown in fig. 1 and 2 can be used for processing corpus data, video data or audio data. In the case of processing corpus data, the unlabeled database 3 is a corpus database, and the labeled database 2 is a labeled corpus, i.e. stores labeled corpus data. The confirmation mark mode is used for marking the linguistic data, which can include at least one of word shape information, syntax information and semantic information. The specific data format of the mark pattern, taking the mark pattern related to the product retention of the 3C product as an example, may be [. Number _ mean '. Year', 'Limited', 'quantity' ], where Number denotes a numeric class semantic category.
In the case that the data marking system 1 and the data marking method shown in fig. 1 and 2 are used for processing image data, the unmarked database 3 is an image database, and the marked database 2 is a marked image database, that is, the marked image data is stored. The confirmation marking mode is a rule for marking the image data, and may include at least one of feature information, line information, light source information, contour information, color information, and material information.
In the case that the data tagging system 1 and the data tagging method shown in fig. 1 and 2 are used for processing sound data, the unmarked database 3 is a sound database, and the tagged database 2 is a tagged sound database, i.e. stores tagged sound data. The confirmation mark pattern is a rule for marking the sound data, and may include at least one of energy information, frequency information, rhythm information, and language information.
Referring to fig. 3, a data marking system 1 according to another embodiment of the invention is shown. The data labeling system 1 of the present embodiment is similar to that shown in fig. 1, and also includes a label database 2, an unlabeled database 3, a label data amplification module 4, and an operation platform 5 connected to the label database 2, the unlabeled database 3, and the label data amplification module 4. The data tagging system 1 of the present embodiment further comprises an automatic data tagging module 6 in signal connection with the tag database 2. The marked database 2 is used for storing marked data, and the unmarked database 3 is used for storing unmarked data.
The tagged data amplification module 4 of this embodiment can access the unlabeled database 3 and the tagged database 2, and includes a tagged pattern data set 42 for storing tagged patterns and an amplification unit 41 for executing a tagging algorithm.
The operation platform 5 includes a tag pattern editing interface 51 in signal connection with the amplification unit 41, and a data tag prediction interface 52 in signal connection with the automatic data tag module 6 and for inputting data. The operation platform 5 of this embodiment is used for the user to input data and perform editing, adding and deleting, and the operation platform 5 can also input data through the application programming interface or perform editing, adding and deleting.
The tagged mode editing interface 51 is used for inputting data and performing editing operations, and the data tagged prediction interface 52 is used for inputting data and displaying prediction results. The automatic data tagging module 6 of the present embodiment may be configured to perform data tagging predictions.
Referring to fig. 3 and 4, fig. 4 is a flowchart of a data marking method according to another embodiment of the present invention, and is suitable for the data marking system 1 shown in fig. 3. The steps of the data marking method of this embodiment are described as follows: in step S1, the markup mode editing interface 51 of the operation platform 5 receives data or editing operation, and in step S21, the markup mode editing interface 51 receives at least one markup mode. Then, in step S22, the markup mode editing interface 51 sets the received at least one markup mode as the at least one verification markup mode. In the present embodiment, the user performs editing on the marking pattern editing interface 51 to input the marking pattern, and the marking pattern editing interface 51 takes the marking pattern input by the user as the confirmation marking pattern. It should be noted that, in other embodiments, the markup mode editing interface 51 can also receive data input via an external application programming interface or perform editing, adding, and deleting.
After obtaining the at least one confirmation mark pattern, step S23 is executed, the tag data amplification module 4 stores the at least one confirmation mark pattern in the mark pattern data set 42, in this embodiment, the tag data amplification module 4 performs a check according to the mark pattern in the mark pattern data set 42 and the at least one confirmation mark pattern, and stores and updates the at least one confirmation mark pattern in the mark pattern data set 42 after the check is confirmed, wherein the tag data amplification module 4 checks whether the at least one confirmation mark pattern is repeated or conflicts with the data in the mark pattern data set 42, so as to eliminate the repetition or conflict between mark patterns, and in other embodiments, the amplification unit 41 may perform the check.
Then, step S31 is executed, the amplification unit 41 of the tag data amplification module 4 executes the tagging algorithm according to the tag pattern data set 42 and the unlabeled database 3 to generate at least one new tag data, and step S4 is executed to store the at least one new tag data in the tag database 2. Specifically, in step S31 of the present embodiment, the amplification unit 41 performs a labeling algorithm on the data in the unlabeled database 3 according to the at least one confirmed labeled pattern in the labeled pattern dataset 42 and the labeled patterns that have been stored in the labeled pattern dataset 42 to label the data in the unlabeled database 3 and generate the at least one newly added labeled data. After performing step S31, the amplification unit 41 stores the generated additional tag data in the tag database 2 to amplify the tag database 2. The marking algorithm of the present embodiment may be a character string searching algorithm or a long term priority algorithm.
In the case that there is no data in the marker pattern data set 42, step S23 of the present embodiment may not be executed selectively, in which case in step S31, the amplification unit 41 marks the data in the unlabeled database 3 according to the at least one confirmed marker pattern generated in step S22 only.
The data marking method of this embodiment may further perform step S51, where the data marking prediction interface 52 of the operating platform 5 receives an unmarked data, and performs step S52, where the automatic data marking module 6 performs data marking prediction on the unmarked data according to the marking database 2, and transmits a prediction result corresponding to the unmarked data to the operating platform 5, and the automatic data marking module 6 of this embodiment may perform Conditional Random Field, Maximum-inverse markov Model, and recovery Neural Network algorithm. Then, the operation platform 5 displays the prediction result corresponding to the unmarked data.
The data tagging system 1 and the data tagging method shown in fig. 3 and 4 can be used for processing corpus data, video data or audio data. In the case of processing corpus data, the unlabeled database 3 is a corpus database, and the labeled database 2 is a labeled corpus, i.e. stores labeled corpus data, and the unlabeled data is corpus data. The confirmation mark mode is used for marking the linguistic data, which can include at least one of word shape information, syntax information and semantic information. The specific data format of the mark pattern, taking the mark pattern related to the product retention of the 3C product as an example, may be [. Number _ mean '. Year', 'Limited', 'quantity' ], where Number denotes a numeric class semantic category.
In the case that the data tagging system 1 and the data tagging method shown in fig. 3 and 4 are used for processing image data, the unmarked database 3 is an image database, and the tagged database 2 is a tagged image database, i.e. storing tagged image data, and furthermore, the unmarked data is image data. The confirmation marking mode is a rule for marking the image data, and may include at least one of feature information, line information, light source information, contour information, color information, and material information.
In the case that the data tagging system 1 and the data tagging method shown in fig. 3 and 4 are used for processing voice data, the unmarked database 3 is a voice database, and the tagged database 2 is a tagged voice database, i.e. stores tagged voice data, and furthermore, the unmarked data is voice data. The confirmation mark pattern is a rule for marking the sound data, and may include at least one of energy information, frequency information, rhythm information, and language information.
Referring to fig. 5, a data marking system 1 according to another embodiment of the invention is shown. The data tagging system 1 of the present embodiment is similar to the embodiment shown in fig. 3, and also includes a tag database 2, an unlabeled database 3, a tag data amplification module 4, an automatic data tagging module 6 in signal connection with the tag database 2, and an operation platform 5 in signal connection with the tag database 2, the unlabeled database 3 and the tag data amplification module 4. Similarly, the tag data amplification module 4 of the present embodiment includes an amplification unit 41 and a tag pattern data set 42. The main difference between the present embodiment and the embodiment shown in fig. 3 is that the operation platform 5 of the present embodiment includes a tag pattern editing interface 51 in signal connection with an amplification unit 41 of the tag data amplification module 4, a data tag prediction interface 52 in signal connection with the automatic data tag module 6 and capable of inputting data, and a manual tag interface 53 in signal connection with the tag database 2, wherein the manual tag interface 53 is capable of inputting data and performing data tagging.
Referring to fig. 5 and fig. 6, fig. 6 is a flowchart of a data marking method according to another embodiment of the present invention, and is suitable for the data marking system 1 shown in fig. 5. Similar to the embodiment shown in fig. 4, the data tagging method of the present embodiment also performs steps S1, S21 to S23, S31 and S4 to generate at least one confirmed tag pattern and at least one new tag-added data to be stored in the tag database 2, and also performs steps S51 and S52 to perform data tag prediction on an unmarked data.
The data marking method of the embodiment may further perform step S61, in which the manual marking interface 53 of the operation platform 5 receives and displays the unmarked data, and perform step S62, in which the manual marking interface 53 receives at least one data marking operation corresponding to the unmarked data and generates a marking result corresponding to the unmarked data; that is, the user may enter unmarked data that is expected to be manually marked into the manual marking interface 53 and perform a data marking operation on the manual marking interface 53. Then, in step S63, the manual tagging interface 53 stores the tagging result corresponding to the unmarked data in the tagging database 2.
In other embodiments, the manual tagging interface 53 may be integrated with the data tagging and prediction interface 52 to provide for input of untagged data and further configured to perform manual tagging or tagging prediction.
The data tagging system 1 and the data tagging method shown in fig. 5 and 6 can be used for processing corpus data, video data or audio data. In the case of processing corpus data, the unlabeled database 3 is a corpus database, and the labeled database 2 is a labeled corpus, i.e. stores labeled corpus data, and the input unlabeled data is corpus data. The confirmation mark mode is used for marking the linguistic data, which can include at least one of word shape information, syntax information and semantic information. The specific data format of the mark pattern, taking the mark pattern related to the product retention of the 3C product as an example, may be [. Number _ mean '. Year', 'Limited', 'quantity' ], where Number denotes a numeric class semantic category.
In the case that the data tagging system 1 and the data tagging method shown in fig. 5 and 6 are used for processing image data, the unmarked database 3 is an image database, and the tagged database 2 is a tagged image database, i.e. storing tagged image data, and the input unmarked data is image data. The confirmation marking mode is a rule for marking the image data, and may include at least one of feature information, line information, light source information, contour information, color information, and material information.
In the case that the data tagging system 1 and the data tagging method shown in fig. 5 and 6 are used to process voice data, the unmarked database 3 is a voice database, and the tagged database 2 is a tagged voice database, i.e. stores tagged voice data, and the input unmarked data is voice data. The confirmation mark pattern is a rule for marking the sound data, and may include at least one of energy information, frequency information, rhythm information, and language information.
Referring to fig. 7, a data marking system 1 according to another embodiment of the invention is shown. The data tagging system 1 of the present embodiment is similar to the embodiment shown in fig. 1, and also includes a tag database 2, an unlabeled database 3, a tag data amplification module 4, and an operation platform 5 in signal connection with the tag database 2, the unlabeled database 3, and the tag data amplification module 4. The marked database 2 is used for storing marked data, and the unmarked database 3 is used for storing unmarked data.
The tagged data amplification module 4 of this embodiment can access the unlabeled database 3 and the tagged database 2, and includes a tagged pattern dataset 42 that can store tagged patterns, a tagged pattern generation unit 43 that can execute a pattern generation algorithm, and an amplification unit 41 that can execute a tagging algorithm.
The operation platform 5 includes a tag pattern editing interface 51 for signal connection between the tag pattern generating unit 43 and the amplification unit 41, and a data tag prediction interface 52 for data input. The operation platform 5 of the embodiment can be used for the user to input and execute editing, adding and deleting, and in addition, the operation platform 5 can also be used for inputting data through the application programming interface or executing editing, adding and deleting.
Referring further to fig. 8, a flowchart of a data marking method according to another embodiment of the invention is shown and applied to the data marking system 1 shown in fig. 7. The steps of the data marking method of this embodiment are described as follows: in step 71, the labeled pattern generating unit 43 of the labeled data amplification module 4 executes a pattern generating algorithm according to the unlabeled database 3 to generate at least one candidate labeled pattern. After the tag data expansion module 4 transmits the at least one candidate tag pattern to the operation platform 5, step 72 is executed, and the operation platform 5 displays the at least one candidate tag pattern. The pattern generation algorithm executed by the marking pattern generation unit 43 of the present embodiment may be N-Gram, Apriori algorithmm, Apriori all algorithmm, or Apriori sound algorithmm.
After generating and displaying the at least one candidate mark pattern, step S11 is executed, in which the operation platform 5 receives data or editing operation, wherein the operation platform 5 receives the editing operation corresponding to the at least one candidate mark pattern. In step S11 of the present embodiment, the user performs a editing operation on the at least one candidate mark pattern via the mark pattern editing interface 51 of the operating platform 5, and can modify, add, and delete the at least one candidate mark pattern one or more times. Then, step S24 is executed, in which the operating platform 5 generates at least one confirmation mark pattern according to the at least one editing operation, wherein the mark pattern editing interface 51 generates at least one confirmation mark pattern according to the at least one candidate mark pattern and the received editing operation, in other words, the at least one confirmation mark pattern in this embodiment is generated after the user modifies, adds, and deletes the at least one candidate mark pattern.
After obtaining the at least one confirmation mark pattern, step S23 is executed, the tag data expansion module 4 stores the at least one confirmation mark pattern in the tag pattern data set 42, in this embodiment, the tag data expansion module 4 performs a check according to the mark pattern in the tag pattern data set 42 and the at least one confirmation mark pattern, and stores and updates the at least one confirmation mark pattern in the tag pattern data set 42 after the check is confirmed, wherein the check performed by the tag data expansion module 4 checks whether the at least one confirmation mark pattern is repeated or conflicts with the data in the tag pattern data set 42, so as to eliminate the repetition or conflict between the mark patterns, and in other embodiments, the expansion unit 41 may perform the check.
Then, step S31 is executed, the amplification unit 41 of the tag data amplification module 4 executes the tagging algorithm according to the tag pattern data set 42 and the unlabeled database 3 to generate at least one new tag data, and step S4 is executed to store the at least one new tag data in the tag database 2. Specifically, in step S31 of the present embodiment, the amplification unit 41 performs a labeling algorithm on the data in the unlabeled database 3 according to the at least one confirmed labeled pattern in the labeled pattern dataset 42 and the labeled pattern that is originally stored in the labeled pattern dataset 42 to label the data in the unlabeled database 3 and generate the at least one new labeled data, which is then stored in the labeled database 2 to amplify the labeled database 2. The marking algorithm of the present embodiment may be a character string searching algorithm or a long term priority algorithm.
In the case that there is no data in the marker pattern data set 42, step S23 of the present embodiment may not be executed selectively, in which case in step S31, the amplification unit 41 marks the data in the unlabeled database 3 according to the at least one confirmed marker pattern generated in step S22 only.
The data tagging system 1 and the data tagging method shown in fig. 7 and 8 can be used for processing corpus data, video data or audio data. In the case of processing corpus data, the unlabeled database 3 is a corpus database, and the labeled database 2 is a labeled corpus, i.e. stores labeled corpus data, and the candidate labeling patterns and the confirmation labeling patterns are used for labeling corpus data, which may include at least one of morphological information, syntactic information, and semantic information. The specific data format of the mark pattern, taking the mark pattern related to the product retention of the 3C product as an example, may be [. Number _ mean '. Year', 'Limited', 'quantity' ], where Number denotes a numeric class semantic category.
In the case that the data marking system 1 and the data marking method shown in fig. 7 and 8 are used for processing image data, the unmarked database 3 is an image database, and the marked database 2 is a marked image database, that is, the marked image data is stored, and the candidate marking mode and the confirmation marking mode are rules for marking the image data, which may include at least one of feature information, line information, light source information, contour information, color information, and texture information.
In the case of the data tagging system 1 and the data tagging method shown in fig. 7 and 8 for processing voice data, the unmarked database 3 is a voice database, and the tagged database 2 is a tagged voice database, i.e. stores tagged voice data, and the candidate tag patterns and the confirmation tag patterns are rules for tagging voice data, which may include at least one of energy information, frequency information, rhythm information, and language information.
Referring to fig. 9, a data marking system 1 according to another embodiment of the invention is shown. The data tagging system 1 of the present embodiment is similar to the embodiment shown in fig. 7, and also includes a tag database 2, an unlabeled database 3, a tag data amplification module 4, and an operation platform 5 in signal connection with the tag database 2, the unlabeled database 3, and the tag data amplification module 4. Similarly, the tag data amplification module 4 includes an amplification unit 41, a tag pattern data set 42 and a tag pattern generation unit 43, and the operation platform 5 also includes a tag pattern editing interface 51 and a data tag prediction interface 52; the difference between the present embodiment and fig. 7 is that the data tagging system 1 of the present embodiment further comprises an automatic data tagging module 6 in signal connection with the tag database 2, wherein the automatic data tagging module 6 is configured to perform data tagging prediction.
Referring to fig. 9 and 10, fig. 10 is a flowchart illustrating a data marking method according to another embodiment of the present invention, and is applicable to the data marking system 1 shown in fig. 9. Similar to the embodiment shown in fig. 8, the data tagging method of the present embodiment also performs steps S71 and S72 to generate at least one candidate tag pattern, performs steps S11, S24 and S23 to obtain at least one confirmed tag pattern and store the confirmed tag pattern in the tag pattern data set 42, and performs steps S31 and S4 to generate at least one new augmented tag data to augment the tag database 2.
The data marking method of this embodiment further performs step S51, the data marking prediction interface 52 of the operating platform 5 receives an unmarked data, and performs step S52, the automatic data marking module 6 performs data marking prediction on the unmarked data according to the marking database 2, and transmits a prediction result corresponding to the unmarked data to the operating platform 5, and the automatic data marking module 6 of this embodiment may execute a Conditional Random Field, Maximum-inverse markov Model, or a RecurrentNeural Network algorithm. Then, the operation platform 5 displays the prediction result corresponding to the unmarked data.
The data tagging system 1 and the data tagging method shown in fig. 9 and 10 can be used for processing corpus data, video data or audio data. In the case of processing corpus data, the unlabeled database 3 is a corpus database, and the labeled database 2 is a labeled corpus, i.e. stores labeled corpus data. In addition, the unlabeled data is corpus data, and the candidate tagging patterns and the confirmation tagging patterns are used for tagging corpus data, which may include at least one of morphological information, syntactic information, and semantic information. The specific data format of the mark pattern, taking the mark pattern related to the product retention of the 3C product as an example, may be [. Number _ mean '. Year', 'Limited', 'quantity' ], where Number denotes a numeric class semantic category.
In the case that the data marking system 1 and the data marking method shown in fig. 9 and 10 are used to process image data, the unmarked database 3 is an image database, and the marked database 2 is a marked image database, that is, the marked image data is stored. In addition, the unmarked data is audio data, and the candidate marking mode and the confirmation marking mode are rules for marking the image data, which may include at least one of feature information, line information, light source information, contour information, color information, and material information.
In the case where the data tagging system 1 and the data tagging method shown in fig. 9 and 10 are used to process voice data, the unmarked database 3 is a voice database, and the tagged database 2 is a tagged voice database, i.e., stores tagged voice data. In addition, the unmarked data is video data, and the candidate mark pattern and the confirmation mark pattern are rules for marking the audio data, which may include at least one of energy information, frequency information, rhythm information, and language information.
Referring to fig. 11, a data marking system 1 according to another embodiment of the invention is shown. The data tagging system 1 of the present embodiment is similar to the embodiment shown in fig. 9, and also includes a tag database 2, an unlabeled database 3, a tag data amplification module 4, an automatic data tagging module 6 in signal connection with the tag database 2, and an operation platform 5 in signal connection with the tag database 2, the unlabeled database 3 and the tag data amplification module 4. In addition, the tag data amplification module 4 of the present embodiment also includes a tag pattern data set 42, an amplification unit 41, and a tag pattern generation unit 43. The main difference between the present embodiment and the embodiment shown in fig. 9 is that the operation platform 5 of the present embodiment includes a mark pattern editing interface 51 in signal connection with the amplification unit 41, a data mark prediction interface 52 in signal connection with the automatic data mark module 6 and capable of inputting data, and a manual mark interface 53 in signal connection with the mark database 2, wherein the manual mark interface 53 is capable of inputting data and performing data marking.
Referring to fig. 11 and 12, fig. 12 is a flowchart illustrating a data marking method according to another embodiment of the present invention, and is applicable to the data marking system 1 shown in fig. 11. Similar to the embodiment shown in fig. 10, the data tagging method of the present embodiment also performs steps S71 and S72 to generate at least one candidate tag pattern, steps S11, S24 and S23 to obtain at least one confirmed tag pattern and store the confirmed tag pattern in the tag pattern data set 42, steps S31 and S4 to generate at least one new added tag data according to the at least one confirmed tag pattern to augment the tag database 2, and steps S51 and S52 to perform data tag prediction on unmarked data.
The data marking method of the embodiment may further perform step S61, in which the manual marking interface 53 of the operation platform 5 receives and displays an unmarked data, and perform step S62, in which the manual marking interface 53 receives at least one data marking operation corresponding to the unmarked data and generates a marking result corresponding to the unmarked data; that is, the user may enter unmarked data that is expected to be manually marked into the manual marking interface 53 and perform a data marking operation on the manual marking interface 53. Then, in step S63, the manual tagging interface 53 stores the tagging result corresponding to the unmarked data in the tagging database 2.
In other embodiments, the manual tagging interface 53 may be integrated with the data tagging and prediction interface 52 to provide for input of untagged data and further configured to perform manual tagging or tagging prediction.
The data tagging system 1 and the data tagging method shown in fig. 11 and 12 can be used for processing corpus data, video data or audio data. In the case of processing corpus data, the unlabeled database 3 is a corpus database, and the labeled database 2 is a labeled corpus, i.e. stores labeled corpus data. In addition, the input unlabeled data is corpus data, and the candidate labeling pattern and the confirmation labeling pattern are used for labeling the corpus data, which may include at least one of morphological information, syntactic information, and semantic information. The specific data format of the mark pattern, taking the mark pattern related to the product retention of the 3C product as an example, may be [. Number _ mean '. Year', 'Limited', 'quantity' ], where Number denotes a numeric class semantic category.
In the case that the data marking system 1 and the data marking method shown in fig. 11 and 12 are used to process image data, the unmarked database 3 is an image database, and the marked database 2 is a marked image database, that is, the marked image data is stored. In addition, the input unmarked data is voice data, and the candidate marking mode and the confirmation marking mode are rules for marking the image data, which may include at least one of feature information, line information, light source information, contour information, color information, and material information.
In the case where the data tagging system 1 and the data tagging method shown in fig. 11 and 12 are used to process voice data, the unmarked database 3 is a voice database, and the tagged database 2 is a tagged voice database, i.e., stores tagged voice data. The candidate mark pattern and the confirmation mark pattern are rules for marking the sound data, and may include at least one of energy information, frequency information, rhythm information, language information, and the like.
It is supplementary to say that the data marking method of the present invention can be executed by one or more servers and provides services with the operating platform via the internet.
Although the present invention has been described with reference to the above embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (26)

1. A data tagging system comprising:
a tag database;
an unlabeled database;
a labeled data amplification module; and
the operation platform is in signal connection with the marking database, the unmarked database and the marking data amplification module and comprises a marking mode editing interface, and the marking mode editing interface can be used for inputting data and executing editing operation to generate at least one confirmation marking mode;
the tag data amplification module executes operation according to the at least one confirmed tag mode and the unmarked database to generate at least one new tag data, and stores the at least one new tag data in the tag database.
2. The data tagging system of claim 1, wherein the tagging mode editing interface displays at least one candidate tagging mode and generates the at least one confirmation tagging mode according to the at least one candidate tagging mode and user editing of the at least one candidate tagging mode.
3. The data tagging system of claim 2, wherein the tagged data amplification module comprises:
a marking mode generating unit which is connected with the marking mode editing interface by signals and executes a mode generating algorithm according to the unmarked database to generate at least one candidate marking mode; and
and the amplification unit is in signal connection with the marking mode editing interface, executes a marking algorithm according to the unmarked database and the at least one confirmed marking mode to generate at least one new marking data, and stores the at least one new marking data into the marking database.
4. The data tagging system of claim 1, further comprising an automated data tagging module in signal communication with the tagging database, wherein the operations platform further comprises:
the data marking and predicting interface is used for inputting unmarked data and displaying a prediction result;
the automatic data marking module executes data marking prediction on the unmarked data according to the marking database and transmits a prediction result corresponding to the unmarked data to the data marking prediction interface.
5. The data tagging system of claim 1, wherein the tagging mode authoring interface is configured to allow a user to input at least one tagging mode and to set the at least one tagging mode as the at least one verification tagging mode.
6. The data tagging system of claim 1, wherein the editing operation comprises at least one of a modification, an addition, and a deletion.
7. The data tagging system of claim 1, wherein the operating platform further comprises a manual tagging interface for inputting unmarked data and performing data tagging, the manual tagging interface being capable of displaying the unmarked data and storing the result of the user performing the data tagging on the unmarked data in the tagging database.
8. The data tagging system of claim 1, wherein the tag database is a corpus tag database, an image tag database or an audio tag database, the unmarked database is a corpus database, an image database or an audio database, and the at least one verification tag pattern is a corpus tag pattern, an image tag pattern or an audio tag pattern.
9. The data tagging system of claim 2, wherein the at least one candidate tagging mode is a voice tagging mode, a video tagging mode, or an audio tagging mode.
10. The data tagging system of claim 1, wherein the at least one verification tagging mode comprises at least one of word shape information, syntax information, and semantic information, or at least one of feature information, line information, light source information, contour information, color information, and material information, or at least one of energy information, audio information, rhythm information, and language information.
11. The data tagging system of claim 2, wherein the at least one candidate tagging mode comprises at least one of word shape information, syntax information, and semantic information, or at least one of feature information, line information, light source information, contour information, color information, and material information, or at least one of energy information, audio information, rhythm information, and language information.
12. The data tagging system of claim 5, wherein the at least one tagging mode input by the user comprises at least one of word shape information, syntax information, and semantic information, or at least one of feature information, line information, light source information, contour information, color information, and material information, or at least one of energy information, audio information, rhythm information, and language information.
13. A data tagging system according to claim 4 or 7 wherein the untagged data is text data, video data or audio data.
14. A data marking method is suitable for a data marking system and comprises the following steps:
receiving data or editing operation;
generating at least one confirmation mark mode according to the received data or editing operation;
performing an operation according to the at least one verification mark mode and the unmarked database to generate at least one newly added mark data; and
and storing the at least one new mark data into a mark database.
15. The data tagging method of claim 14, further comprising an operation platform of the data tagging system displaying at least one candidate tagging mode, wherein the operation platform receives editing operations corresponding to the at least one candidate tagging mode and generates at least one confirmation tagging mode according to the at least one editing operation, and the step of the operation platform generating the at least one confirmation tagging mode according to the at least one editing operation comprises:
the operating platform generates the at least one confirmation marking mode according to the at least one candidate marking mode and the received editing operation.
16. The data tagging method of claim 15, further comprising a tag data amplification module of the data tagging system executing a pattern generation algorithm according to the unlabeled database to generate the at least one candidate tag pattern, wherein the at least one new tag data is generated by the tag data amplification module executing a tagging algorithm according to the at least one confirmed tag pattern and the unlabeled database.
17. The data tagging method of claim 14, further comprising:
an operating platform of the data marking system receives unmarked data;
the automatic data marking module of the data marking system executes data marking prediction on the unmarked data according to the marking database and transmits the prediction result corresponding to the unmarked data to the operation platform; and
the operating platform displays the prediction result corresponding to the unmarked data.
18. The data tagging method of claim 14, wherein the at least one verification tag pattern is generated by an operating platform of the data tagging system according to the at least one editing operation, and the step of generating the at least one verification tag pattern by the operating platform according to the at least one editing operation comprises:
the operation platform receives at least one marking mode; and
the operating platform sets the received at least one mark mode as the at least one confirmation mark mode.
19. The data tagging method of claim 14 wherein the at least one editing operation is at least one of modification, addition and deletion.
20. The data tagging method of claim 14, further comprising:
the operating platform of the data marking system receives and displays unmarked data;
the operation platform receives at least one data marking operation corresponding to the unmarked data and generates a marking result corresponding to the unmarked data; and
and storing the marking result corresponding to the unmarked data into the marking database.
21. The data tagging method of claim 14, wherein the tag database is a corpus tag database, an image tag database or an audio tag database, the unmarked database is a corpus database, an image database or an audio database, and the confirmation tag pattern is a corpus tag pattern, an image tag pattern or an audio tag pattern.
22. The data tagging method of claim 15, wherein the at least one candidate tagging mode is a voice tagging mode, a video tagging mode, or an audio tagging mode.
23. The data tagging method of claim 14, wherein the verification tagging mode comprises at least one of word shape information, syntax information, and semantic information, or at least one of feature information, line information, light source information, contour information, color information, and material information, or at least one of energy information, audio information, rhythm information, and language information.
24. The data tagging method of claim 15, wherein the at least one candidate tagging mode comprises at least one of word shape information, syntax information, and semantic information, or at least one of feature information, line information, light source information, contour information, color information, and material information, or at least one of energy information, audio information, rhythm information, and language information.
25. The data tagging method of claim 18, wherein the at least one tagging mode received by the operating platform comprises at least one of word shape information, syntax information, and semantic information, or at least one of feature information, line information, light source information, contour information, color information, and material information, or at least one of energy information, audio information, rhythm information, and language information.
26. The data tagging method of claim 17 or 20, wherein the untagged data is text data, video data or audio data.
CN201811596379.1A 2018-12-19 2018-12-25 Data marking system and data marking method Pending CN111339325A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
TW107145816 2018-12-19
TW107145816A TWI701565B (en) 2018-12-19 2018-12-19 Data tagging system and method of tagging data

Publications (1)

Publication Number Publication Date
CN111339325A true CN111339325A (en) 2020-06-26

Family

ID=71181954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811596379.1A Pending CN111339325A (en) 2018-12-19 2018-12-25 Data marking system and data marking method

Country Status (2)

Country Link
CN (1) CN111339325A (en)
TW (1) TWI701565B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003016821A1 (en) * 2001-08-10 2003-02-27 Matsushita Electric Industrial Co., Ltd. Mark delivery system, center apparatus, terminal, map data delivery system, center apparatus, and terminal
CN101777131A (en) * 2010-02-05 2010-07-14 西安电子科技大学 Method and device for identifying human face through double models
CN102722719A (en) * 2012-05-25 2012-10-10 西安电子科技大学 Intrusion detection method based on observational learning
US20150095300A1 (en) * 2010-06-20 2015-04-02 Remeztech Ltd. System and method for mark-up language document rank analysis
CN106850591A (en) * 2017-01-13 2017-06-13 北京蓝海讯通科技股份有限公司 Data markers apparatus and method
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
CN107067025A (en) * 2017-02-15 2017-08-18 重庆邮电大学 A kind of data automatic marking method based on Active Learning
CN107895359A (en) * 2016-10-04 2018-04-10 安讯士有限公司 Using image analysis algorithm with give neutral net provide training data
CN108875769A (en) * 2018-01-23 2018-11-23 北京迈格威科技有限公司 Data mask method, device and system and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003016821A1 (en) * 2001-08-10 2003-02-27 Matsushita Electric Industrial Co., Ltd. Mark delivery system, center apparatus, terminal, map data delivery system, center apparatus, and terminal
CN101777131A (en) * 2010-02-05 2010-07-14 西安电子科技大学 Method and device for identifying human face through double models
US20150095300A1 (en) * 2010-06-20 2015-04-02 Remeztech Ltd. System and method for mark-up language document rank analysis
CN102722719A (en) * 2012-05-25 2012-10-10 西安电子科技大学 Intrusion detection method based on observational learning
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
CN107895359A (en) * 2016-10-04 2018-04-10 安讯士有限公司 Using image analysis algorithm with give neutral net provide training data
CN106850591A (en) * 2017-01-13 2017-06-13 北京蓝海讯通科技股份有限公司 Data markers apparatus and method
CN107067025A (en) * 2017-02-15 2017-08-18 重庆邮电大学 A kind of data automatic marking method based on Active Learning
CN108875769A (en) * 2018-01-23 2018-11-23 北京迈格威科技有限公司 Data mask method, device and system and storage medium

Also Published As

Publication number Publication date
TWI701565B (en) 2020-08-11
TW202024946A (en) 2020-07-01

Similar Documents

Publication Publication Date Title
CN110555205B (en) Negative semantic recognition method and device, electronic equipment and storage medium
CN105446986A (en) Web page processing method and device
CN111656453A (en) Hierarchical entity recognition and semantic modeling framework for information extraction
US20220414463A1 (en) Automated troubleshooter
CN116245177B (en) Geographic environment knowledge graph automatic construction method and system and readable storage medium
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN110245349A (en) A kind of syntax dependency parsing method, apparatus and a kind of electronic equipment
CN113947086A (en) Sample data generation method, training method, corpus generation method and apparatus
US11036996B2 (en) Method and apparatus for determining (raw) video materials for news
CN110633724A (en) Intention recognition model dynamic training method, device, equipment and storage medium
CN111353314A (en) Story text semantic analysis method for animation generation
CN111597302B (en) Text event acquisition method and device, electronic equipment and storage medium
CN115878818B (en) Geographic knowledge graph construction method, device, terminal and storage medium
CN112148879B (en) Computer readable storage medium for automatically labeling code with data structure
CN109344393B (en) Method and system for extracting main statement
CN110956043A (en) Domain professional vocabulary word embedding vector training method, system and medium based on alias standardization
CN112733517B (en) Method for checking requirement template conformity, electronic equipment and storage medium
CN111339325A (en) Data marking system and data marking method
CN115169370A (en) Corpus data enhancement method and device, computer equipment and medium
CN110851572A (en) Session labeling method and device, storage medium and electronic equipment
CN111898762B (en) Deep learning model catalog creation
CN114564942A (en) Text error correction method, storage medium and device for supervision field
CN112966501A (en) New word discovery method, system, terminal and medium
CN117494806B (en) Relation extraction method, system and medium based on knowledge graph and large language model
CN114490928B (en) Implementation method, system, computer equipment and storage medium of semantic search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200626