CN106897424A - Information labeling system and method - Google Patents
Information labeling system and method Download PDFInfo
- Publication number
- CN106897424A CN106897424A CN201710103917.8A CN201710103917A CN106897424A CN 106897424 A CN106897424 A CN 106897424A CN 201710103917 A CN201710103917 A CN 201710103917A CN 106897424 A CN106897424 A CN 106897424A
- Authority
- CN
- China
- Prior art keywords
- information
- labeling
- classification
- result
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 500
- 238000000034 method Methods 0.000 title claims abstract description 82
- 238000013145 classification model Methods 0.000 claims abstract description 102
- 238000012552 review Methods 0.000 claims description 127
- 238000012937 correction Methods 0.000 claims description 45
- 238000012549 training Methods 0.000 claims description 34
- 238000012550 audit Methods 0.000 claims description 8
- 238000013507 mapping Methods 0.000 description 33
- 239000000284 extract Substances 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000035807 sensation Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of information labeling system and method, the inefficiency and accuracy technical problem not high existing for information labeling mode of the prior art are at least can solve the problem that.The system includes:Information classification module, it is suitable to the default information classification model of each information input that will be got, the information classification result exported according to information classification model is that each information for getting adds corresponding tag along sort respectively, and each information added after tag along sort is supplied into information labeling module;Information labeling module, including information distribution submodule and multiple classification annotation submodules, each classification annotation submodule correspond respectively to the information classification result of at least one type;Wherein, information distribution submodule is suitable to for each information added after tag along sort to be distributed to corresponding classification annotation submodule;Each classification annotation submodule is suitable to be labeled the information of information distribution submodule distribution.
Description
Technical Field
The invention relates to the technical field of communication, in particular to an information labeling system and method.
Background
With the development and popularization of internet technology, the information propagation speed and the information propagation range are dramatically increased, and information labeling is often required to realize efficient management of a large amount of information including internet information (for example, information classification, learning, and the like). The labeled information can be used for multiple purposes, for example, not only as an original corpus to train a classifier for accurate classification, but also directly used for realizing management operations such as learning and editing of related contents.
In the conventional information labeling method, a large amount of information is generally assigned to a corresponding labeler for labeling by a manual assignment method. In the manual distribution process, the summary content of the information is acquired in a mode of manually reading the information content, and the current information is distributed according to the acquired summary content. In this process, it is often done by manual experience or personal sensation.
However, the inventor finds that the existing information labeling mode has at least the following defects in the process of implementing the invention: on one hand, when the summary content of the information is acquired by manually reading the information content, because the amount of the information in the internet is huge, a large amount of human resources are consumed, the execution efficiency is low, and errors are easily caused by fatigue. On the other hand, in the information labeling process, there are often a large number of contents with strong specialties, and these contents should be labeled by professional labeling personnel with strong specialties (for example, there are many pieces of information of military weapons, military situations, and the like in military articles, and these pieces of information can only be accurately labeled for personnel who have trained the military articles), but the existing information labeling method cannot classify the information in advance, so that the information with strong specialties cannot be accurately assigned to the corresponding professional, and therefore, the efficiency and accuracy of information labeling are further reduced.
Disclosure of Invention
In view of the above, the present invention has been made to provide an information labeling system and method that overcomes or at least partially solves the above problems.
According to an aspect of the present invention, there is provided an information labeling system including: the information classification module is suitable for inputting each acquired information into a preset information classification model, respectively adding a corresponding classification label to each acquired information according to an information classification result output by the information classification model, and providing each information added with the classification label to the information labeling module; the information labeling module further comprises: the information distribution sub-module and the plurality of classification marking sub-modules are respectively corresponding to at least one type of information classification result; the information distribution submodule is suitable for distributing each information added with the classification label to the corresponding classification labeling submodule; each classification labeling submodule is suitable for labeling the information distributed by the information distribution submodule.
According to another aspect of the present invention, there is provided an information labeling method, including: inputting each acquired information into a preset information classification model; respectively adding corresponding classification labels to the acquired information according to the information classification result output by the information classification model; and distributing each information added with the classification label to a corresponding classification labeling submodule, and labeling the distributed information by each classification labeling submodule.
The information labeling system and the information labeling method provided by the invention have the advantages that firstly, each acquired information is input into a preset information classification model through an information classification module, and corresponding classification labels are respectively added to each acquired information according to an information classification result output by the information classification model; and then, distributing each piece of information added with the classification label to a corresponding classification labeling submodule through an information distribution submodule in the information labeling module, and finally labeling the distributed information by each classification labeling submodule. Therefore, the scheme of the invention can preliminarily classify the acquired information by replacing the manual reading mode with the preset information classification model, thereby effectively saving human resources; in addition, the scheme of the invention can also add corresponding classification labels to each information according to the result of the primary classification so as to distribute each information to the classification labeling sub-modules of corresponding types, thereby improving the efficiency and the accuracy of information classification labeling and providing a plurality of conveniences for users.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a block diagram illustrating an information annotation system according to an embodiment of the present invention;
FIG. 2 is a block diagram illustrating an information annotation system according to a second embodiment of the present invention;
FIG. 3 is a flowchart illustrating an information annotation method according to a third embodiment of the present invention;
fig. 4 shows a flowchart of an information annotation method according to the fourth embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The invention provides an information labeling system and method, which can at least solve the technical problems of low efficiency and low accuracy of an information labeling mode in the prior art.
Example one
Fig. 1 shows a block diagram of an information annotation system according to an embodiment of the present invention. As shown in fig. 1, the information labeling system includes an information classification module 11 and an information labeling module 12. The information labeling module 12 further includes an information distribution sub-module 121 and a plurality of classification labeling sub-modules 122.
The information classification module 11 is adapted to input each piece of acquired information into a preset information classification model, add a corresponding classification label to each piece of acquired information according to an information classification result output by the information classification model, and provide each piece of information with the classification label added to the information labeling module 12.
The preset information classification model may be established in various ways, for example, by acquiring a large amount of information in the internet as corpus data and performing machine learning and training on the acquired corpus data. Specifically, the preset information classification model can extract or generate (i.e., generate according to semantics in each sentence in the input information and semantic relationships between the sentences) a keyword in the input information, analyze and judge an information type of the input information according to the keyword, and further output an information classification result of the input information according to the judgment result. Wherein the information classification result is information that can indicate an information type of the input information. For example, if the information type of the input information is military type, the information classification result may be similar information including "military" information, such as "military type" or "military".
Specifically, after acquiring information, the information classification module 11 firstly inputs the acquired information into a preset information classification model, acquires an information classification result corresponding to the information through the preset information classification model, and adds a corresponding classification label to each acquired information according to the acquired information classification result. The classification label may be the information classification result, or may be information including classification information related to the information classification result. For example, if the information classification result is "military class", the classification label may be corresponding to "military class", or may be corresponding to similar information including "military" information. In a specific implementation, the classification label may be "finance class", "military class", "sports class", or "entertainment class", and the like, which can indicate the type of the information.
The information labeling module 12 further includes an information distribution sub-module 121 and a plurality of classification labeling sub-modules 122, wherein each classification labeling sub-module 122 corresponds to at least one type of information classification result. The information distribution submodule 121 is adapted to distribute each information added with the classification label to the corresponding classification labeling submodule 122. Each classification labeling sub-module 122 is adapted to label the information distributed by the information distribution sub-module 121.
Specifically, the information distribution sub-module 121 is connected to the information classification module 11 and the plurality of classification sub-modules 122, receives the information, which is sent by the information classification module 11 and to which the classification label is added, and distributes each received information to the corresponding classification sub-module 122. There are a plurality of classification labeling sub-modules 122, and fig. 1 only uses one as a schematic illustration, and those skilled in the art will understand that the number of the classification labeling sub-modules is two or more. Since each classification labeling sub-module 122 corresponds to at least one type of information classification result, when the information distribution sub-module 121 distributes each information added with the classification label, a mapping relationship can be established between each information and each classification labeling sub-module 122 according to the information classification result in each information, and each information added with the classification label can be distributed according to the established mapping relationship. The mapping relationship may be a one-to-one mapping relationship, a one-to-many, many-to-one, or many-to-many mapping relationship, or a customized mapping relationship may be set, and a user may set a customized mapping relationship required by the user according to an actual situation. In short, the mapping relationship may be set by a person skilled in the art according to an actual situation, and the present invention is not limited thereto.
Specifically, the classification labeling sub-module 122 labels the received information after receiving the information distributed by the information distribution sub-module 121. In the labeling process, the number and labeling mode of each information label may be set by those skilled in the art according to actual conditions, which is not limited by the present invention. In addition, in order to improve the accuracy of the labeling, a plurality of auditing processes can be further set in the labeling process, and the accuracy of information labeling is ensured by performing multi-layer auditing on the labels in the information.
Therefore, in the information labeling system provided by the invention, firstly, each acquired information is input into a preset information classification model through the information classification module 11, and corresponding classification labels are respectively added to each acquired information according to the information classification result output by the information classification model; then, the information distribution submodule 121 in the information labeling module 12 distributes each information added with the classification label to the corresponding classification labeling submodule, and finally, each classification labeling submodule 122 labels the distributed information. Therefore, the scheme of the invention can preliminarily classify the acquired information in a way of replacing manual reading by the preset information classification model, thereby effectively saving human resources; in addition, the scheme of the invention can also add corresponding classification labels to the information according to the result of the preliminary classification, so that the information is distributed to classification labeling sub-modules of corresponding types, the efficiency and the accuracy of information classification labeling are improved, and a plurality of convenience is provided for users.
Example two
Fig. 2 is a block diagram illustrating a structure of an information annotation system according to a second embodiment of the present invention. As shown in fig. 2, the system includes: an information classification module 21, an information labeling module 22, an audit statistics module 23 and a training module 24. The information labeling module 22 further includes an information distribution sub-module 221 and a plurality of classification labeling sub-modules 222, wherein each classification labeling sub-module 222 further includes at least two initial review labeling units 2221, a result review unit 2222, and a review labeling unit 2223.
The information classification module 21 is adapted to input each piece of acquired information into a preset information classification model, add a corresponding classification label to each piece of acquired information according to an information classification result output by the information classification model, and provide each piece of information with the classification label added to the information labeling module 22.
Specifically, the preset information classification model may be established in various ways, for example, by acquiring a large amount of information in the internet as corpus data, and performing machine learning and training on the acquired corpus data. The preset information classification model can extract or generate (namely generate according to semantics in each sentence in the input information and semantic relations among the sentences) keywords in the input information, analyze and judge the information type of the input information according to the keywords, and further output the information classification result of the input information according to the judgment result. The information classification result is information that can indicate the information type of the input information, and for example, if the information type of the input information is military type, the information classification result may be similar information including "military" information, such as "military type" or "military".
Specifically, after acquiring information, the information classification module 21 firstly inputs the acquired information into a preset information classification model, acquires an information classification result corresponding to the information through the preset information classification model, and respectively adds corresponding classification tags to each acquired information according to the information classification result. The classification label added in each piece of information comprises: the information classification result of the information, and a preset number of keywords extracted from the information. Specifically, the information classification result of the information is the information classification result output by the preset information classification model; the preset number of keywords extracted from the information include: after the information is input into the preset information classification model, the preset information classification model extracts or generates (namely generates according to the semantics of each sentence in the input information and the semantic relationship among the sentences) the keywords in the information. The keywords are provided to the corresponding classification and labeling sub-module 222, so that the classification and labeling sub-module 222 displays the keywords in the labeling process. Namely: the keywords provided by the information classification module 21 can be displayed in the classification labeling sub-module 222, so as to help the classification labeling sub-module 222 to quickly lock the label or the labeling range of the labeled information in the labeling process, thereby saving the time of the labeling process and improving the labeling efficiency of the labeling process. Here, the number and the sequence of the extracted and presented keywords may be set by those skilled in the art according to the needs in the actual labeling process, for example, the number of the extracted keywords may be 5, 8, 10, and so on, the presentation sequence of the keywords may be corresponding weights set according to the related criteria such as the frequency of occurrence or the density of occurrence, and then the presentation sequence of each keyword is determined according to the size of the weights, and so on. In short, the invention does not limit the number and sequence of the extracted and displayed keywords as long as the method can conveniently meet different requirements in the labeling work and effectively improve the efficiency of the labeling process.
The information labeling module 22 is connected to the information classifying module 21, and is configured to receive each piece of information sent by the information classifying module 21 after the classification label is added, and label each piece of information. Wherein, the information labeling module 22 further comprises: an information distribution sub-module 221 and a plurality of classification sub-modules 222, wherein each classification sub-module 222 corresponds to at least one type of information classification result.
The information distribution sub-module 221 is adapted to distribute each information added with the classification label to the corresponding classification labeling sub-module 222. Specifically, the information distribution sub-module 221 is connected to the information classification module 21 and the classification sub-module 222, receives the information with the added classification label sent by the information classification module 21, and distributes each received information to the corresponding classification sub-module 222. There are a plurality of classification labeling sub-modules 222, and fig. 2 only uses one as a schematic illustration, and those skilled in the art will understand that the number thereof should be two or more. Since each classification labeling submodule 222 corresponds to at least one type of information classification result, when the information distribution submodule 221 distributes each information added with the classification label, a mapping relationship can be established between each information and each classification labeling submodule 222 according to the classification label in the information, and each information added with the classification label can be distributed according to the established mapping relationship. The mapping relationship may be a one-to-one mapping relationship, a one-to-many, many-to-one or many-to-many mapping relationship, or a customized mapping relationship may be set, and a user may set a customized mapping relationship according to an actual situation. In short, the mapping relationship may be set by a person skilled in the art according to an actual situation, and the present invention is not limited thereto.
For example, the following steps are carried out: for a plurality of mapping relations mentioned above, the following examples can be referred to. If 3 pieces of information, namely information a, information B and information C, are received, the classification label in the information a is "economy class", the classification label in the information B is "sports class", and the classification label in the information C is "entertainment class". Corresponding to the multiple classification labeling sub-modules 222, 3 related classification labeling sub-modules are selected, namely a classification labeling sub-module a, a classification labeling sub-module B and a classification labeling sub-module C. The type corresponding to the classification labeling submodule a is "economy type", the type corresponding to the classification labeling submodule B is "sports type", and the type corresponding to the classification labeling submodule C is "entertainment type", so that the information distribution submodule 221 can distribute the information a to the classification labeling submodule a, the information B to the classification labeling submodule B, and the information C to the classification labeling submodule C when distributing 3 pieces of information, namely the information a, the information B, and the information C; or the information A can be distributed to a classification labeling submodule A, the information B can be distributed to a classification labeling submodule B and a classification labeling submodule C (some sports stars are also entertainment stars), and the information C can be distributed to a classification labeling submodule C; or, the information a and the information C are distributed to the classification labeling submodule a (some information in the entertainment information also belongs to economic information, such as research on economic benefit in the movie box-office market, and the like), the information B is distributed to the classification labeling submodule B, the information C is distributed to the classification labeling submodule C, and the like.
Each classification labeling sub-module 222 is adapted to label the information distributed by the information distribution sub-module 221, wherein each classification labeling sub-module 222 further includes at least two initial review labeling units 2221, a result auditing unit 2222, and a review labeling unit 2223.
Specifically, after receiving the information added with classification distributed by the information distribution submodule, the classification labeling submodule 222 first determines the received information, that is: judging whether the information type of the information distributed by the information distribution submodule 221 matches the type of the information classification result corresponding to the classification labeling submodule 222, where the information type of the information distributed by the information distribution submodule judged by the classification labeling submodule is: the type of the information obtained by the re-judgment of the annotating person may be the same as the type of the information classification result output by the information classification model for the information (at this time, the classification of the information classification model is correct), or may be different from the type of the information classification result output by the information classification model for the information (at this time, the classification of the information classification model is wrong). Correspondingly, when the judgment result is yes, the information classification result output by the information classification model is correct, and the distribution process of the information distribution submodule is also accurate, and then the information distributed by the information distribution submodule 221 is further marked; if the judgment result is negative, it indicates that the information classification result output by the information classification model is incorrect and/or the distribution process of the information distribution submodule is incorrect, then an information classification error correction message for redistributing the information is sent to the information distribution submodule 221; the information classification error correction message comprises an information identifier and an information type of the information.
Specifically, when the classification labeling sub-module 222 determines that the information type of the information distributed by the information distribution sub-module 221 matches the type of the information classification result corresponding to the classification labeling sub-module 222 (that is, when the determination result is yes), which indicates that the information distributed to the classification labeling sub-module 222 is correct at this time, the information distributed by the information distribution sub-module 221 is labeled, that is, the information distributed to the corresponding classification labeling sub-module 222 is labeled. The specific process can be as follows:
firstly, each classification labeling sub-module 222 further includes at least two initial review labeling units 2221, a result review unit 2222, and a review labeling unit 2223; when the information distributed to the classification labeling sub-module 222 is labeled, the information is firstly sent to each of the first review labeling units 2221 in the classification labeling sub-module 222, wherein the number of the first review labeling units 2221 is set to be at least two, so that the accuracy of the information labeling result is improved, the problem that errors or errors in labeling cannot be found in time due to the fact that only a single labeling result is used is avoided, and the accuracy of information labeling is effectively improved. In fig. 2, fig. 2 only uses two review mark units 2221 as a schematic illustration, but those skilled in the art can understand that the number of review mark units 2221 is not limited to two, and may be two or more. After the at least two initial review labeling units 2221 label each piece of information distributed by the information distribution sub-module 221 to obtain an initial review labeling result corresponding to each piece of information, the initial review labeling unit 2221 sends the obtained initial review labeling result to the result auditing unit 2222. In the process of labeling the information, in order to improve the speed and efficiency of information labeling, keywords of the information extracted from the information classification model may be displayed in the display interface of the preliminary review labeling unit 2221, and a preliminary review labeling result is determined by selecting the displayed keywords, that is, a proper keyword is selected as the preliminary review labeling result of the information; or the information can be appropriately labeled by referring to the displayed keywords, so as to obtain an initial examination labeling result. The purpose of selecting and referring to the displayed keywords is to quickly lock the label or label range of the information, so that the label efficiency of the information is effectively improved.
The result auditing unit 2222 is configured to provide the information to the review labeling unit 2223 for labeling when the primary review labeling results obtained by the at least two primary review labeling units 2221 for the same information are different. Specifically, after receiving the initial review labeling results obtained for the same information and sent by the two or more initial review labeling units 2221, the result auditing unit 2222 first determines the consistency of the multiple initial review labeling results for the same information, and if the determination result is yes, that is: the result auditing unit 2222 receives that the primary audit marking results sent by each primary audit marking unit 2221 for the same information are the same, and then takes the primary audit marking result of the information as the final marking result; if the result of the determination is negative, that is, the initial review labeling results sent by the initial review labeling units 2221 and received by the result auditing unit 2222 for the same information are different, the initial review labeling result of the information is sent to the review labeling unit 2223 for labeling. Optionally, in order to improve the accuracy of the final labeling result, if the determination result is yes, the result auditing unit 2222 may further send the initial labeling result to the review labeling unit 2223 for further auditing, that is: even if the initial review labeling results for the same information are the same, the result auditing unit 2222 can still send the initial review labeling results to the review labeling unit 2223, so that the accuracy of the initial review labeling results is further rechecked; or, under the condition that the initial review labeling results are consistent, sampling the information with consistent initial review labeling results, sending the sampled initial review labeling results to the review labeling unit 2223, and the like, wherein the mode of adopting to review the initial review labeling results can be set by the technical personnel in the field according to the actual condition, as long as the accuracy of the initial review labeling results can be effectively improved.
Further, the result auditing unit 2222 is further configured to: and counting and querying the labeling counting result corresponding to the initial review labeling unit 2221 in the classification labeling sub-module 222. Specifically, after receiving the initial review labeling result sent by the initial review labeling unit 2221, the result auditing unit 2222 further counts the initial review labeling result and obtains a corresponding labeling statistical result. Wherein the labeling statistical result comprises at least one of the following: the total number of the marked information, the number of review in the marked information, the correct marked number in the marked information and the wrong marked number in the marked information. Specifically, when the annotation statistical result is obtained, the result auditing unit 2222 may obtain the annotation statistical result in real time, or may also set a certain obtaining period, and the result auditing unit 2222 obtains the annotation statistical result of each classified annotation submodule, the preliminary annotation unit, and/or the review annotation unit periodically according to the set obtaining period, so as to evaluate the annotation statistical result of each annotation unit within a certain time range according to the statistical result, and calculate and obtain the required information on the quality of the annotation information, such as the number of the annotation information, the annotation error rate, the annotation accuracy, the annotation information review rate, and the like. Moreover, after counting the number of each labeled statistical result, the result auditing unit 2222 may further draw a labeled number graph of each labeled statistical result according to the statistical result (for example, a bar-shaped statistical graph capable of showing the labeled number in each case, or a sector statistical graph showing the ratio of each labeled statistical result to the total labeled statistical result, etc.).
Finally, the review labeling unit 2223 is configured to perform review on the information provided by the result auditing unit 2222, and determine a review labeling result corresponding to the information. And the determined review marking result is the final marking result of the information. Specifically, the review labeling unit 2223 receives the initial review labeling result sent by the result auditing unit 2222, and further performs review and correction on the received initial review labeling result, so as to obtain a review labeling result with higher accuracy. And finally, taking the obtained review labeling result as a final labeling result of the information. Therefore, the final marking result is determined by at least two rounds of auditing and checking, so that the information system can effectively ensure the accuracy of information marking; meanwhile, in the matching process of the units, the efficiency of information labeling is effectively improved. Furthermore, in order to ensure the accuracy of information labeling, those skilled in the art may further increase an auditing level for the initial review labeling result, for example, the auditing level for the initial review labeling result may be set to be a second level or a third level, and the accuracy of information labeling is ensured through further auditing of more levels.
When the classification labeling submodule 222 determines that the information type of the information distributed by the information distribution submodule 221 does not match the type of the information classification result corresponding to the classification labeling submodule 222 (i.e., if the determination result is negative), it indicates that the information classification result output by the information classification model is incorrect and/or the distribution process of the information distribution submodule is incorrect, and then sends an information classification error correction message for redistributing the information to the information distribution submodule 221; the information classification error correction message comprises the information type of the information and the information identifier of the information, which are obtained after the judgment of the labeling personnel is carried out again. Specifically, the information identifier may be identification information capable of uniquely identifying the information, such as a sorting code of the information in the system, and the information type is a re-labeled information type obtained in the information labeling sub-module 222. The judgment process can feed back and correct wrong information classification results output by the information classification model, so that the accuracy of the information classification model for information classification is continuously improved, and the classification efficiency of the acquired information in preliminary classification is improved.
Correspondingly, for the case that the information classification model may have classification errors or inaccurate classification in the information classification process, the information classification module 21 is further configured to: acquiring information classification error correction messages sent by the classification labeling sub-modules 222, and correcting the information classification models according to the information classification error correction messages; and/or acquiring the labeling result obtained by each classification labeling submodule 222, and correcting the information classification model according to the labeling result.
Specifically, when receiving the information classification error correction information sent by each classification labeling sub-module 222, the information classification module 21 sends the received information classification error correction information to the information classification model, where the information classification error correction information includes the information identifier and the information type of the information classification error correction information. After receiving the information classification error correction information, the information classification model searches for corresponding information according to the information identification contained in the information classification error correction information, and corrects errors existing in the information according to the information type contained in the information classification error correction information, so that the correction of the information classification model is completed. Or, after each classification labeling submodule 222 finishes labeling each information, the information classification module 21 obtains a labeling result obtained in each classification labeling submodule 222, and corrects the information classification model according to the obtained labeling result. The information classification model 21 may obtain the labeling result in various ways, for example, the labeling result may be obtained periodically, for example, the labeling result is obtained according to a frequency of 3 min/time; or the labeling result of each classification labeling sub-module 222 may also be obtained in real time, for example, a real-time obtaining module is established, and the labeling result of each classification labeling sub-module 222 is monitored and obtained in real time; or, an acquisition mode setting module can be further arranged, and a user can set a corresponding acquisition mode or acquisition frequency in the acquisition mode setting module according to own needs, so that the acquisition requirements of the user for the labeling results of different information are met. Wherein, the obtaining of the labeling result also includes: information identification and information type. Therefore, after the information classification module 21 obtains the labeling result obtained by each classification labeling submodule 222, the process of correcting the information classification model by the information classification module is consistent with the above method of correcting the information classification model by the information classification error correction information, and details are not repeated here. Here, the obtaining manner of the labeling result of each classification labeling submodule 222 may be set by a person skilled in the art in various forms according to the actual requirement, and the present invention is not limited to this.
The auditing and counting module 23 is used for counting and inquiring the marking and counting results of each classification marking sub-module, the initial examination marking unit and/or the review marking unit. Wherein the labeling statistical result comprises at least one of the following: the total number of the marked information, the number of review in the marked information, the correct marked number in the marked information and the wrong marked number in the marked information. Specifically, when the annotation statistical result is obtained, the auditing statistical module 23 may obtain the annotation statistical result in real time, or may also set a certain obtaining period, and the auditing statistical module 23 periodically obtains the annotation statistical result of each classification annotation submodule, preliminary examination annotation unit and/or review annotation unit according to the set obtaining period, so as to evaluate the annotation statistical result of each annotation unit in a certain time range according to the statistical result, and calculate and obtain the required information about the quality of the annotation information, such as the number of the annotation information, the annotation error rate, the annotation accuracy rate, the annotation information review rate, and the like. Moreover, after the auditing and counting module 23 counts the number of each labeled statistical result, it may further draw a labeled number graph of each labeled statistical result according to the statistical result (for example, a bar-shaped statistical graph capable of showing the labeled number of each condition, or a sector statistical graph showing the proportion of each labeled statistical result occupying the total labeled statistical result, etc.).
The training module 24 is an optional module adapted to train a classifier for classifying information according to the labeling result obtained by each classification labeling submodule 222. Specifically, after each classification labeling sub-module 222 completes the process of labeling the distributed information, the training module 24 obtains the labeling result in each classification labeling sub-module 222, and further trains the labeled information according to the obtained labeling result, that is: the labeling result in each classification labeling submodule 222 is used as training data, and a classifier with higher classification precision is obtained by training the training data, so that the purpose of more finely classifying information is achieved. In other words, when the information labeling system further includes the training module 24, the labeled information obtained by the information classification module 21 and the information labeling module 22 can be automatically added to the training corpus, and a classifier for classifying the information can be obtained through training the training corpus. The classifier differs from the above-mentioned information classification model, and the main differences between the two are as follows: the classification mode and the classification quantity of the information classification model are determined by the quantity and the type of classification labeling sub-modules contained in the information labeling module; and the classification function is to accurately provide the information to be labeled to professionals in related fields so as to realize accurate labeling of the information. The classification mode and the classification quantity of the classifier are determined by actual project requirements (for example, the classification can be divided into more levels and more detailed types); and the classification function is to set a classifier according to the actual project requirement so as to realize accurate classification of the on-line information. Therefore, the training module is arranged in the information labeling system, so that the classifier can be automatically generated according to the labeled information, and the use of a user is facilitated. Of course, the training module may be omitted when it is not necessary to generate a classifier based on the labeled information.
Therefore, in the information labeling system provided by the invention, firstly, each acquired information is input into a preset information classification model through the information classification module 21, and corresponding classification labels are respectively added to each acquired information according to the information classification result output by the information classification model; then, each piece of information added with the classification label is distributed to the corresponding classification labeling sub-module 222 through the information distribution sub-module 221 in the information labeling module 22, finally, the distributed information is labeled through at least two primary review labeling units 2221, result review units 2222 and review labeling units 2223 included in each classification labeling sub-module 222, after the labeling is completed, the labeling statistical results of each classification labeling sub-module, primary review labeling unit and/or review labeling unit are counted and inquired through the review statistical module 23, and a classifier for classifying the information is trained through the training module 24 according to the labeling results obtained by each classification labeling sub-module 222. Therefore, in the scheme provided by the invention, on one hand, the acquired information can be preliminarily classified by using a preset information classification model instead of manual reading, namely, the human resources are effectively saved and the labeling efficiency is improved by adopting a grouping labeling method, wherein the information classification model can continuously correct the information classification error correction message fed back by a labeling person in the using process, so that the classification result is increasingly accurate, and powerful guarantee is provided for the accuracy of subsequent labeling; on the other hand, when information labeling is performed, the accuracy and the specialty of information labeling are further ensured through multi-layer labeling strategies such as primary examination and review, the efficiency and the quality of information classification labeling are effectively improved, the number of labeled information of various conditions occurring in the labeling process can be further counted by the scheme in the embodiment, macroscopic quantity information of each labeled result is provided, the user can conveniently count the quality of the labeled result and related information in the labeled result, finally, training data can be further provided for a classifier with higher training classification precision according to the labeled result, and convenience is brought to the user.
EXAMPLE III
Fig. 3 shows a flowchart of an information annotation method according to a third embodiment of the present invention. As shown in fig. 3, the method comprises the steps of:
step S310: and inputting each acquired information into a preset information classification model.
The preset information classification model may be established in various ways, for example, by acquiring a large amount of information in the internet as corpus data and performing machine learning and training on the acquired corpus data. Specifically, the preset information classification model can extract or generate (i.e., generate according to semantics in each sentence in the input information and semantic relationships between the sentences) a keyword in the input information, analyze and judge an information type of the input information according to the keyword, and further output an information classification result of the input information according to the judgment result. The information classification result is information that can indicate the information type of the input information, and for example, if the information type of the input information is military type, the information classification result may be similar information including "military" information, such as "military type" or "military".
Specifically, in this step, the acquired information is input into a preset information classification model, and an information classification result corresponding to the information is acquired through the preset information classification model, so that in the subsequent step (corresponding to step S320), corresponding classification tags are added to the acquired information according to the acquired information classification result.
Step S320: and respectively adding corresponding classification labels to the obtained information according to the information classification result output by the information classification model.
Specifically, after the information classification result of the corresponding information is obtained in step S310, a corresponding classification label is added to each information according to the information classification result. The classification label may be the information classification result, or may include information corresponding to the information classification result. For example, if the information classification result is "military class", it may be corresponding to similar information including "military" information. In a specific implementation, the classification label may be "finance class", "military class", "sports class", or "entertainment class", and the like, which can indicate the type of the information.
Step S330: and distributing each information added with the classification label to a corresponding classification labeling submodule, and labeling the distributed information by each classification labeling submodule.
The classification labeling submodule is provided with a plurality of classification labeling submodules, and each classification labeling submodule corresponds to at least one type of information classification result. Specifically, after the classification label of each piece of information is added in step S320, each piece of information to which the classification label is added is distributed to the corresponding classification labeling sub-module. In the distribution process, a mapping relation can be established between each piece of information and each classification labeling submodule according to information classification results in the information, and each piece of information added with the classification label is distributed according to the established mapping relation. The mapping relationship may be a one-to-one mapping relationship, a one-to-many, many-to-one, or many-to-many mapping relationship, or a customized mapping relationship may be set, so that a user may set a customized mapping relationship according to an actual situation. In short, the mapping relationship may be set by a person skilled in the art according to an actual situation, and the present invention is not limited thereto. And after receiving the distributed information, the classification labeling submodule labels the received information so as to fulfill the aim of classifying and labeling the received information. In addition, in order to improve the accuracy of the labeling process, a verification process for the labeling result can be set, that is, verification is performed for the label of each piece of information, so that the accuracy of the label is ensured. In the labeling process, the number and labeling mode of each information label may be set by those skilled in the art according to actual conditions, which is not limited by the present invention.
Therefore, in the information labeling method provided by the invention, firstly, each acquired information is input into a preset information classification model; then, respectively adding corresponding classification labels to the obtained information according to the information classification result output by the information classification model; and finally, distributing each piece of information added with the classification label to a corresponding classification labeling submodule, and labeling the distributed information by each classification labeling submodule. Therefore, the scheme of the invention can preliminarily classify the acquired information by a preset information classification model instead of a manual reading mode, thereby effectively saving human resources and reducing the burden of labeling personnel; in addition, the scheme of the invention can also add corresponding classification labels to the information according to the result of the preliminary classification, so that the information is distributed to classification labeling sub-modules of corresponding types, the efficiency and the accuracy of information classification labeling are improved, and a plurality of convenience is provided for users.
Example four
Fig. 4 shows a flowchart of an information annotation method according to the fourth embodiment of the present invention. As shown in fig. 4, the method comprises the steps of:
step S410: and inputting each acquired information into a preset information classification model.
Specifically, the preset information classification model may be established in various ways, for example, by acquiring a large amount of information in the internet as corpus data, and performing machine learning and training on the acquired corpus data. The preset information classification model can extract or generate (namely generate according to semantics in each sentence in the input information and semantic relations among the sentences) keywords in the input information, analyze and judge the information type of the input information according to the keywords, and further output the information classification result of the input information according to the judgment result. The information classification result is information that can indicate the information type of the input information, and for example, if the information type of the input information is military type, the information classification result may be similar information including "military" information, such as "military type" or "military".
Specifically, in this step, after the information is acquired, the acquired information is first input into a preset information classification model, and an information classification result corresponding to the information is acquired through the preset information classification model, so that in a subsequent step (corresponding to step S420), corresponding classification tags are respectively added to each acquired information according to the information classification result.
Step S420: and respectively adding corresponding classification labels to the acquired information according to the information classification result output by the information classification model.
The classification label added in each piece of information comprises: the information classification result of the information, and a preset number of keywords extracted from the information. Specifically, the information classification result of the information is the information classification result output from the preset information classification model acquired in step S410; the preset number of keywords extracted from the information include: after the information is input into the preset information classification model, the preset information classification model extracts or generates (namely generates according to the semantics of each sentence in the input information and the semantic relationship among the sentences) the keywords in the information. The keywords are provided to the classification labeling sub-module corresponding to the subsequent step (corresponding to step S430) for the classification labeling sub-module to display in the labeling process. Namely: the provided keywords can be displayed in the classification labeling submodule to help the classification labeling submodule to quickly lock the label or the label range of the labeled information in the labeling process, so that the time of the labeling process is saved, and the labeling efficiency of the labeling process is improved. Here, the number and the sequence of the extracted and presented keywords may be set by those skilled in the art according to the needs in the actual labeling process, for example, the number of the extracted keywords may be 5, 8, 10, and so on, the presentation sequence of the keywords may be corresponding weights set according to the related criteria such as the frequency of occurrence or the density of occurrence, and then the presentation sequence of each keyword is determined according to the size of the weights, and so on. In short, the invention does not limit the number and sequence of the extracted and displayed keywords as long as the method can conveniently meet different requirements in the labeling work and effectively improve the efficiency of the labeling process.
Step S430: and distributing each information added with the classification label to a corresponding classification labeling submodule, and labeling the distributed information by each classification labeling submodule.
Specifically, each classification labeling submodule corresponds to at least one type of information classification result, so that when distributing each information added with a classification label, a mapping relation can be established between each information and each classification labeling submodule according to the classification label in the information, and each information added with the classification label can be distributed according to the established mapping relation. The mapping relationship may be a one-to-one mapping relationship, a one-to-many, many-to-one, or many-to-many mapping relationship, or a user-defined mapping relationship selection window may be set, and a user may set a user-defined mapping relationship according to an actual situation. In short, the mapping relationship may be set by a person skilled in the art according to an actual situation, and the present invention is not limited thereto.
When each classification labeling submodule labels distributed information, the process may specifically be: firstly, judging the received information, namely: judging whether the information type of the received distributed information is matched with the type of the information classification result corresponding to the classification marking submodule, if so, indicating that the information classification result corresponding to the classification marking submodule is consistent with the information classification result output by the information classification model, and further marking the distributed information; if the judgment result is negative, the information classification result corresponding to the classification labeling submodule is not consistent with the information classification result output by the information classification model, and then information classification error correction information used for redistributing the information is sent; the information classification error correction message comprises an information identifier and an information type of the information.
When the information type of the received distributed information is judged to be matched with the type of the information classification result corresponding to the classification labeling submodule, namely the judgment result is yes, labeling the distributed information, wherein the specific process can be as follows: marking the distributed information through at least two initial review marking units to obtain initial review marking results corresponding to the information; when the initial examination labeling results obtained by the at least two initial examination labeling units aiming at the same information are different, the information is reexamined, and the reexamination labeling result corresponding to the information is determined. Specifically, when information distributed to each classification labeling submodule is labeled, the information is firstly distributed to each corresponding initial review labeling unit. The number of the initial examination marking units is set to be at least two, so that the accuracy of the information marking result is improved, the problem that errors or errors in the marking cannot be found in time due to the fact that only a single marking result is used is avoided, and therefore the accuracy of the information marking is effectively improved. In the process of labeling information, in order to improve the speed and efficiency of information labeling, keywords extracted aiming at the information in the information classification model can be displayed in a display interface, and during labeling, the displayed keywords can be selected, namely, proper keywords are selected as the initial examination labeling result of the information; or the information can be appropriately labeled by referring to the displayed keywords, so as to obtain an initial examination labeling result. The displayed keywords are selected and referred to for the purpose of quickly locking the information type and the information range of the information, and therefore the information labeling efficiency is effectively improved.
After the initial examination marking result of the information is obtained, the consistency of each initial examination result obtained by the same information is judged, and if the judgment result is yes, the initial examination marking result of the information is used as a final marking result; if the judgment result is no, the information is reviewed, and the review labeling result of the information is taken as the final labeling result, namely: and when the initial examination marking results aiming at the same information are inconsistent, the information is reviewed, and the final marking result of the information is determined through review. Optionally, in order to increase the accuracy of the final labeling result, if the determination result is yes, the initial review labeling result of the information may be reviewed, so that the initial review labeling result is further reviewed; or when the initial examination marking results are consistent, sampling the information with consistent initial examination marking results, and reviewing the sampled initial examination marking results, so as to improve the accuracy of the initial examination marking results, and the like. In this embodiment, the manner of reviewing the primary review labeling results consistent with the same information is set by a person skilled in the art according to actual conditions, and as long as the accuracy of the primary review labeling results can be effectively improved, the present invention does not limit the manner of reviewing the primary review labeling results consistent with the same information. Therefore, the final marking result is determined by at least two rounds of auditing and checking, so that the information system can effectively ensure the accuracy of information marking, and simultaneously, the efficiency of information marking is effectively improved in the matching process of each unit.
Specifically, when the type of the received distributed information is judged not to match the type of the information classification result corresponding to the classification labeling submodule, that is, the judgment result is no, an information classification error correction message for redistributing the information is sent; the information classification error correction message comprises an information identifier and an information type of the information. Specifically, the information identifier may be identification information capable of uniquely identifying the information, such as a sequencing code of the information in the system, and the information type is a re-labeled information type obtained in the information labeling submodule. The judgment process can feed back and correct wrong information classification results output by the information classification model, so that the accuracy of the information classification model for information classification is continuously improved, and the classification efficiency of the acquired information in preliminary classification is improved.
Aiming at the information classification error correction message sent by each classification labeling submodule, obtaining the information classification error correction message sent by each classification labeling submodule, and correcting an information classification model according to the information classification error correction message; and/or acquiring the labeling result obtained by each classification labeling submodule, and correcting the information classification model according to the labeling result.
Specifically, when information classification error correction information sent by each classification labeling submodule is received, the received information classification error correction information is sent to an information classification model, wherein the information classification error correction information comprises an information identifier and an information type of the information classification error correction information. After receiving the information classification error correction information, the information classification model searches for corresponding information according to the information identification contained in the information classification error correction information, and corrects errors existing in the information according to the information type contained in the information classification error correction information, so that the correction of the information classification model is completed. Or after each classification labeling submodule labels each information, acquiring a labeling result obtained in each classification labeling submodule, and correcting the information classification model according to the acquired labeling result. The method for obtaining the labeling result may be various, for example, the labeling result may be obtained periodically, for example, the labeling result may be obtained according to a frequency of 3 min/time; or the labeling result of each classification labeling submodule can be obtained in real time, for example, the labeling result of each classification labeling submodule is monitored and obtained in real time; or a user-defined acquisition mode can be set so that the user sets a corresponding acquisition mode or acquisition frequency according to the own needs, thereby meeting the acquisition requirements of the user for the labeling results of different information. Here, the obtaining manner of the labeling result for each classification labeling submodule may be set by a person skilled in the art in various forms according to the needs of the actual situation, and the present invention is not limited to this.
Step S440: counting and inquiring the marking statistical results corresponding to the primary review marking units in the classification marking sub-modules; and/or counting and inquiring the labeling counting results of each classification labeling submodule, each initial review labeling unit and/or each review labeling unit.
Specifically, after the primary review labeling results of each primary review labeling unit in each classification labeling sub-module are obtained, the labeling statistical results corresponding to each primary review labeling unit in each classification labeling sub-module are further counted and queried. Wherein the labeling statistical result comprises at least one of the following: the total number of the marked information, the number of review in the marked information, the correct marked number in the marked information and the wrong marked number in the marked information. Specifically, when the annotation statistical result is counted, the annotation statistical result may be counted in real time, or a certain obtaining period may be set, and the annotation statistical result of each primary review annotation unit in each classification annotation sub-module is periodically obtained according to the set obtaining period, so as to evaluate the annotation statistical result of each primary review annotation unit in a certain time range according to the statistical result, and calculate and obtain the required information about the quality of the annotation information, such as the number of the annotation information, the annotation error rate, the annotation accuracy rate, the annotation information review rate, and the like. After the statistical process is completed, a label number graph of each label statistical result may be further drawn according to the statistical result (for example, a bar statistical graph capable of showing the label number of each condition, or a sector statistical graph showing the proportion of each label statistical result occupying the total label statistical result, etc.).
Or after the final labeling result in each classification labeling submodule, further counting and inquiring the labeling counting result of each classification labeling submodule, each initial review labeling unit and/or each review labeling unit. The process of the statistics and query may be real-time statistics and query of the labeling statistical results, or may also set a certain statistics or query period, and periodically statistics or query of the labeling statistical results of each classification labeling submodule, preliminary review labeling unit and/or review labeling unit according to the set period, so as to evaluate the labeling statistical results of each labeling unit within a certain time range according to the statistical results, and calculate and obtain the required information about the quality of the labeling information, such as the number of the labeling information, the labeling error rate, the labeling accuracy rate, the labeling information review rate, and the like. After the number statistics of the labeling statistical results is completed, a labeling number graph of the labeling statistical results can be further drawn according to the statistical results (for example, a bar-shaped statistical graph capable of showing the labeling number of various situations, or a sector statistical graph showing the proportion of the labeling statistical results in the total labeling statistical results, etc.).
Step S450: and training a classifier for classifying the information according to the labeling result obtained by each classification labeling submodule.
Specifically, after each classification labeling submodule finishes labeling distributed information, a labeling result in each classification labeling submodule is obtained, and the labeled information is further trained according to the obtained labeling result, that is: and (3) taking the labeling result in each classification labeling submodule as training data, and training the training data to obtain a classifier with higher classification precision, so as to achieve the purpose of more finely classifying information. Here, it should be noted that this step is an optional step, and this step may be taken or chosen according to the needs of the actual situation. In this step, the labeled information can be automatically added to the training corpus, and a classifier for classifying the information can be obtained through training of the training corpus. Wherein, the classifier is different from the above-mentioned information classification model, and the main difference between the two is as follows: the classification mode and the classification quantity of the information classification model are determined by the quantity and the type of classification labeling sub-modules contained in the information labeling module; and the classification function is to accurately provide the information to be labeled to professionals in related fields so as to realize accurate labeling of the information. The classification mode and the classification quantity of the classifier are determined by actual project requirements (for example, the classification can be divided into more levels and more detailed types); and the classification function is to set a classifier according to the actual project requirement so as to realize accurate classification of the on-line information. Therefore, the classifier can be automatically generated according to the labeled information in the step, so that the use of a user is facilitated. Of course, this step may be omitted when there is no need to generate a classifier from the labeled information.
Therefore, in the information labeling method provided by the invention, firstly, the acquired information is input into a preset information classification model, and corresponding classification labels are respectively added to the acquired information according to the information classification result output by the information classification model; then distributing each information added with the classification label to corresponding classification labeling sub-modules, labeling the distributed information by each classification labeling sub-module, and further counting and inquiring labeling counting results corresponding to each primary examination labeling unit in each classification labeling sub-module after the labeling is finished; and/or counting and inquiring the marking statistical results of all the classification marking sub-modules, the initial review marking units and/or the review marking units, and finally training a classifier for classifying information according to the marking results obtained by all the classification marking sub-modules. Therefore, in the scheme provided by the invention, on one hand, the acquired information can be preliminarily classified by using a preset information classification model instead of manual reading, namely, the human resources are effectively saved and the labeling efficiency is improved by adopting a grouping labeling method, wherein the information classification model can continuously correct the information classification error correction message fed back by a labeling person in the using process, so that the classification result is increasingly accurate, and powerful guarantee is provided for the accuracy of subsequent labeling; on the other hand, when information labeling is performed, the accuracy and the specialty of information labeling are further ensured through multi-layer labeling strategies such as primary examination and review, the efficiency and the quality of information classification labeling are effectively improved, the number of labeled information of various conditions occurring in the labeling process can be further counted by the scheme in the embodiment, macroscopic quantity information of each labeled result is provided, the user can conveniently count the quality of the labeled result and related information in the labeled result, finally, training data can be further provided for a classifier with higher training classification precision according to the labeled result, and convenience is brought to the user.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in a cluster-based data caching apparatus module according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
The invention discloses: a1, an information annotation system, comprising: an information classification module and an information labeling module, wherein,
the information classification module is suitable for inputting each acquired information into a preset information classification model, respectively adding a corresponding classification label to each acquired information according to an information classification result output by the information classification model, and providing each information added with the classification label to the information labeling module;
the information labeling module further comprises: the information distribution sub-module and the plurality of classification marking sub-modules are respectively corresponding to at least one type of information classification result;
the information distribution submodule is suitable for distributing each information added with the classification label to the corresponding classification labeling submodule; each classification labeling submodule is suitable for labeling the information distributed by the information distribution submodule.
A2, the system according to A1, wherein each category labeling submodule is specifically configured to:
judging whether the information type of the information distributed by the information distribution submodule is matched with the type of the information classification result corresponding to the classification marking submodule;
when the judgment result is yes, marking the information distributed by the information distribution submodule; if the judgment result is negative, sending an information classification error correction message for redistributing the information to the information distribution submodule; the information classification error correction message comprises an information identifier and an information type of the information.
A3, the system of A2, wherein the information classification module is further adapted to: acquiring information classification error correction messages sent by all classification labeling sub-modules, and correcting the information classification models according to the information classification error correction messages; and/or the presence of a gas in the gas,
the information classification module is further adapted to: and acquiring a labeling result obtained by each classification labeling submodule, and correcting the information classification model according to the labeling result.
A4, the system according to any one of A1-A3, wherein the added category label in each piece of information includes: the information classification result of the information and a preset number of keywords extracted from the information; the preset number of keywords are used for being provided for the corresponding classification labeling sub-modules so that the classification labeling sub-modules can display the keywords in the labeling process.
A5, the system of any one of A1-A4, wherein each class label submodule further comprises: the system comprises at least two primary audit marking units, a result audit unit and a review marking unit; wherein,
the at least two primary review labeling units are used for labeling each piece of information distributed by the information distribution submodule respectively to obtain a primary review labeling result corresponding to each piece of information;
the result auditing unit is used for providing the information for the review marking unit to mark when the primary review marking results obtained by the at least two primary review marking units aiming at the same information are different;
and the review marking unit is used for reviewing the information provided by the result auditing unit and determining a review marking result corresponding to the information.
A6, the system according to A5, wherein the result auditing unit is further configured to: when the initial examination labeling results obtained by the at least two initial examination labeling units aiming at the same information are the same, taking the initial examination labeling result of the information as a final labeling result;
and when the primary examination labeling results obtained by the at least two primary examination labeling units aiming at the same information are different, taking the review labeling result of the information as the final labeling result.
A7, the system according to A5 or A6, wherein the result auditing unit in each classification labeling sub-module is further used for counting and querying the labeling statistical results corresponding to each initial examination labeling unit in the classification labeling sub-module;
and/or, the system further comprises: the auditing and counting module is used for counting and inquiring the marking and counting results of each classification marking sub-module, the primary examination marking unit and/or the secondary examination marking unit;
wherein the annotation statistics comprise at least one of: the total number of the marked information, the number of review in the marked information, the correct marked number in the marked information and the wrong marked number in the marked information.
A8, the system of any one of A1-A7, wherein the system further comprises:
and the training module is suitable for training a classifier for classifying information according to the labeling result obtained by each classification labeling submodule.
A9, the system of any of A1-A8, wherein the information comprises at least one of: news text information, commentary information, sports information, entertainment information, and financial information.
The invention also discloses: b10, an information labeling method, comprising:
inputting each acquired information into a preset information classification model;
respectively adding corresponding classification labels to the acquired information according to the information classification result output by the information classification model;
and distributing each information added with the classification label to a corresponding classification labeling submodule, and labeling the distributed information by each classification labeling submodule.
B11, according to the method of B10, wherein the step of labeling the distributed information by each classification labeling submodule specifically comprises:
judging whether the information type of the distributed information is matched with the type of the information classification result corresponding to the classification marking sub-module;
when the judgment result is yes, labeling the distributed information; when the judgment result is negative, sending an information classification error correction message for redistributing the information; the information classification error correction message comprises an information identifier and an information type of the information.
B12, the method according to B11, wherein the method further comprises the steps of: acquiring information classification error correction messages sent by all classification labeling sub-modules, and correcting the information classification models according to the information classification error correction messages; and/or the presence of a gas in the gas,
and acquiring a labeling result obtained by each classification labeling submodule, and correcting the information classification model according to the labeling result.
B13, the method according to any one of B10-B12, wherein the added classification label in each piece of information comprises: the information classification result of the information and a preset number of keywords extracted from the information; the preset number of keywords are used for being provided for the corresponding classification labeling sub-modules so that the classification labeling sub-modules can display the keywords in the labeling process.
B14, the method according to any one of B10-B13, wherein the step of labeling the distributed information by each classification labeling submodule further comprises:
marking the distributed information through at least two initial review marking units to obtain initial review marking results corresponding to the information;
and when the primary examination labeling results obtained by the at least two primary examination labeling units aiming at the same information are different, performing a secondary examination on the information, and determining a secondary examination labeling result corresponding to the information.
B15, the method according to B14, wherein the step of labeling the distributed information by each classification labeling submodule further comprises: when the initial examination labeling results obtained by the at least two initial examination labeling units aiming at the same information are the same, taking the initial examination labeling result of the information as a final labeling result;
and when the primary examination labeling results obtained by the at least two primary examination labeling units aiming at the same information are different, taking the review labeling result of the information as the final labeling result.
B16, the method according to B14 or B15, wherein the method further comprises the steps of: counting and inquiring the marking statistical results corresponding to the primary review marking units in the classification marking sub-modules;
and/or counting and inquiring the labeling counting results of each classification labeling submodule, each initial review labeling unit and/or each review labeling unit;
wherein the annotation statistics comprise at least one of: the total number of the marked information, the number of review in the marked information, the correct marked number in the marked information and the wrong marked number in the marked information.
B17, the method according to any one of B10-B16, wherein after the step of labeling the distributed information by each classification labeling submodule, the method further comprises:
and training a classifier for classifying the information according to the labeling result obtained by each classification labeling submodule.
B18, the method according to any of B10-B17, wherein the information comprises at least one of: news text information, commentary information, sports information, entertainment information, and financial information.
Claims (10)
1. An information annotation system comprising: an information classification module and an information labeling module, wherein,
the information classification module is suitable for inputting each acquired information into a preset information classification model, respectively adding a corresponding classification label to each acquired information according to an information classification result output by the information classification model, and providing each information added with the classification label to the information labeling module;
the information labeling module further comprises: the information distribution sub-module and the plurality of classification marking sub-modules are respectively corresponding to at least one type of information classification result;
the information distribution submodule is suitable for distributing each information added with the classification label to the corresponding classification labeling submodule; each classification labeling submodule is suitable for labeling the information distributed by the information distribution submodule.
2. The system of claim 1, wherein each classification labeling submodule is specifically configured to:
judging whether the information type of the information distributed by the information distribution submodule is matched with the type of the information classification result corresponding to the classification marking submodule;
when the judgment result is yes, marking the information distributed by the information distribution submodule; if the judgment result is negative, sending an information classification error correction message for redistributing the information to the information distribution submodule; the information classification error correction message comprises an information identifier and an information type of the information.
3. The system of claim 2, wherein the information classification module is further adapted to: acquiring information classification error correction messages sent by all classification labeling sub-modules, and correcting the information classification models according to the information classification error correction messages; and/or the presence of a gas in the gas,
the information classification module is further adapted to: and acquiring a labeling result obtained by each classification labeling submodule, and correcting the information classification model according to the labeling result.
4. The system of any of claims 1-3, wherein the category labels added to each message include: the information classification result of the information and a preset number of keywords extracted from the information; the preset number of keywords are used for being provided for the corresponding classification labeling sub-modules so that the classification labeling sub-modules can display the keywords in the labeling process.
5. The system of any of claims 1-4, wherein each taxonomy labeling sub-module further comprises: the system comprises at least two primary audit marking units, a result audit unit and a review marking unit; wherein,
the at least two primary review labeling units are used for labeling each piece of information distributed by the information distribution submodule respectively to obtain a primary review labeling result corresponding to each piece of information;
the result auditing unit is used for providing the information for the review marking unit to mark when the primary review marking results obtained by the at least two primary review marking units aiming at the same information are different;
and the review marking unit is used for reviewing the information provided by the result auditing unit and determining a review marking result corresponding to the information.
6. The system of claim 5, wherein the result auditing unit is further to: when the initial examination labeling results obtained by the at least two initial examination labeling units aiming at the same information are the same, taking the initial examination labeling result of the information as a final labeling result;
and when the primary examination labeling results obtained by the at least two primary examination labeling units aiming at the same information are different, taking the review labeling result of the information as the final labeling result.
7. The system according to claim 5 or 6, wherein the result auditing unit in each classification labeling sub-module is further configured to count and query the labeling statistical results corresponding to each primary review labeling unit in the classification labeling sub-module;
and/or, the system further comprises: the auditing and counting module is used for counting and inquiring the marking and counting results of each classification marking sub-module, the primary examination marking unit and/or the secondary examination marking unit;
wherein the annotation statistics comprise at least one of: the total number of the marked information, the number of review in the marked information, the correct marked number in the marked information and the wrong marked number in the marked information.
8. The system of any of claims 1-7, wherein the system further comprises:
and the training module is suitable for training a classifier for classifying information according to the labeling result obtained by each classification labeling submodule.
9. The system of any of claims 1-8, wherein the information comprises at least one of: news text information, commentary information, sports information, entertainment information, and financial information.
10. An information labeling method comprises the following steps:
inputting each acquired information into a preset information classification model;
respectively adding corresponding classification labels to the acquired information according to the information classification result output by the information classification model;
and distributing each information added with the classification label to a corresponding classification labeling submodule, and labeling the distributed information by each classification labeling submodule.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710103917.8A CN106897424A (en) | 2017-02-24 | 2017-02-24 | Information labeling system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710103917.8A CN106897424A (en) | 2017-02-24 | 2017-02-24 | Information labeling system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106897424A true CN106897424A (en) | 2017-06-27 |
Family
ID=59184157
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710103917.8A Pending CN106897424A (en) | 2017-02-24 | 2017-02-24 | Information labeling system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106897424A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107368990A (en) * | 2017-07-31 | 2017-11-21 | 安徽云软信息科技有限公司 | A kind of cargoes imported and exported statistical system |
CN108509969A (en) * | 2017-09-06 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Data mask method and terminal |
CN108536662A (en) * | 2018-04-16 | 2018-09-14 | 苏州大学 | A kind of data mask method and device |
CN108647319A (en) * | 2018-05-10 | 2018-10-12 | 思派(北京)网络科技有限公司 | A kind of labeling system and its method based on short text clustering |
CN108959251A (en) * | 2018-06-27 | 2018-12-07 | 北京明略软件系统有限公司 | A kind of label switching method and apparatus, computer readable storage medium |
CN109344397A (en) * | 2018-09-03 | 2019-02-15 | 东软集团股份有限公司 | The extracting method and device of text feature word, storage medium and program product |
CN109409412A (en) * | 2018-09-28 | 2019-03-01 | 新华三大数据技术有限公司 | Image processing method and device |
CN109492997A (en) * | 2018-10-31 | 2019-03-19 | 四川长虹电器股份有限公司 | A kind of image labeling plateform system based on SpringBoot |
CN109829379A (en) * | 2018-12-28 | 2019-05-31 | 广州华多网络科技有限公司 | Information processing method, device, server and storage medium |
CN109920066A (en) * | 2019-03-05 | 2019-06-21 | 广东三维家信息科技有限公司 | The method and apparatus of automatic marking typesetting |
CN110297914A (en) * | 2019-06-14 | 2019-10-01 | 中译语通科技股份有限公司 | Corpus labeling method and device |
CN110544467A (en) * | 2019-09-04 | 2019-12-06 | 中国联合网络通信集团有限公司 | Voice data auditing method, device, equipment and storage medium |
CN110717137A (en) * | 2019-10-09 | 2020-01-21 | 北京明略软件系统有限公司 | Message distribution processing method and message system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530282A (en) * | 2013-10-23 | 2014-01-22 | 北京紫冬锐意语音科技有限公司 | Corpus tagging method and equipment |
CN104216882A (en) * | 2013-05-29 | 2014-12-17 | 华为技术有限公司 | File labeling method and device |
CN104317891A (en) * | 2014-10-23 | 2015-01-28 | 华为软件技术有限公司 | Method and device for tagging pages |
CN104462614A (en) * | 2015-01-14 | 2015-03-25 | 苏州大学 | Active learning method and device based on network data |
CN104978328A (en) * | 2014-04-03 | 2015-10-14 | 北京奇虎科技有限公司 | Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device |
CN105046274A (en) * | 2015-07-13 | 2015-11-11 | 浪潮软件集团有限公司 | Automatic labeling method for electronic commerce commodity category |
CN105117429A (en) * | 2015-08-05 | 2015-12-02 | 广东工业大学 | Scenario image annotation method based on active learning and multi-label multi-instance learning |
-
2017
- 2017-02-24 CN CN201710103917.8A patent/CN106897424A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104216882A (en) * | 2013-05-29 | 2014-12-17 | 华为技术有限公司 | File labeling method and device |
CN103530282A (en) * | 2013-10-23 | 2014-01-22 | 北京紫冬锐意语音科技有限公司 | Corpus tagging method and equipment |
CN104978328A (en) * | 2014-04-03 | 2015-10-14 | 北京奇虎科技有限公司 | Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device |
CN104317891A (en) * | 2014-10-23 | 2015-01-28 | 华为软件技术有限公司 | Method and device for tagging pages |
CN104462614A (en) * | 2015-01-14 | 2015-03-25 | 苏州大学 | Active learning method and device based on network data |
CN105046274A (en) * | 2015-07-13 | 2015-11-11 | 浪潮软件集团有限公司 | Automatic labeling method for electronic commerce commodity category |
CN105117429A (en) * | 2015-08-05 | 2015-12-02 | 广东工业大学 | Scenario image annotation method based on active learning and multi-label multi-instance learning |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107368990A (en) * | 2017-07-31 | 2017-11-21 | 安徽云软信息科技有限公司 | A kind of cargoes imported and exported statistical system |
CN108509969A (en) * | 2017-09-06 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Data mask method and terminal |
CN108509969B (en) * | 2017-09-06 | 2021-11-09 | 腾讯科技(深圳)有限公司 | Data labeling method and terminal |
CN108536662A (en) * | 2018-04-16 | 2018-09-14 | 苏州大学 | A kind of data mask method and device |
CN108647319B (en) * | 2018-05-10 | 2021-07-06 | 思派(北京)网络科技有限公司 | Labeling system and method based on short text clustering |
CN108647319A (en) * | 2018-05-10 | 2018-10-12 | 思派(北京)网络科技有限公司 | A kind of labeling system and its method based on short text clustering |
CN108959251A (en) * | 2018-06-27 | 2018-12-07 | 北京明略软件系统有限公司 | A kind of label switching method and apparatus, computer readable storage medium |
CN108959251B (en) * | 2018-06-27 | 2022-10-18 | 北京明略软件系统有限公司 | Label switching method and device and computer readable storage medium |
CN109344397A (en) * | 2018-09-03 | 2019-02-15 | 东软集团股份有限公司 | The extracting method and device of text feature word, storage medium and program product |
CN109344397B (en) * | 2018-09-03 | 2023-08-08 | 东软集团股份有限公司 | Text feature word extraction method and device, storage medium and program product |
CN109409412A (en) * | 2018-09-28 | 2019-03-01 | 新华三大数据技术有限公司 | Image processing method and device |
CN109492997A (en) * | 2018-10-31 | 2019-03-19 | 四川长虹电器股份有限公司 | A kind of image labeling plateform system based on SpringBoot |
CN109829379A (en) * | 2018-12-28 | 2019-05-31 | 广州华多网络科技有限公司 | Information processing method, device, server and storage medium |
CN109829379B (en) * | 2018-12-28 | 2021-09-21 | 广州华多网络科技有限公司 | Information processing method, information processing apparatus, server, and storage medium |
CN109920066A (en) * | 2019-03-05 | 2019-06-21 | 广东三维家信息科技有限公司 | The method and apparatus of automatic marking typesetting |
CN109920066B (en) * | 2019-03-05 | 2023-07-18 | 广东三维家信息科技有限公司 | Automatic labeling typesetting method and device |
CN110297914A (en) * | 2019-06-14 | 2019-10-01 | 中译语通科技股份有限公司 | Corpus labeling method and device |
CN110544467A (en) * | 2019-09-04 | 2019-12-06 | 中国联合网络通信集团有限公司 | Voice data auditing method, device, equipment and storage medium |
CN110717137A (en) * | 2019-10-09 | 2020-01-21 | 北京明略软件系统有限公司 | Message distribution processing method and message system |
CN110717137B (en) * | 2019-10-09 | 2022-08-09 | 北京明略软件系统有限公司 | Message distribution processing method and message system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106897424A (en) | Information labeling system and method | |
US20200184146A1 (en) | Techniques for combining human and machine learning in natural language processing | |
US12039447B2 (en) | Information processing method and terminal, and computer storage medium | |
CN112115299B (en) | Video searching method, video searching device, video recommending method, electronic equipment and storage medium | |
US10565533B2 (en) | Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches | |
CN102768659B (en) | Method and system for identifying repeated account | |
CN103678564B (en) | Internet product research system based on data mining | |
CN103294778B (en) | A kind of method and system pushing information | |
CN104281622B (en) | Information recommendation method and device in a kind of social media | |
US8799275B2 (en) | Information retrieval based on semantic patterns of queries | |
US9116985B2 (en) | Computer-implemented systems and methods for taxonomy development | |
CN106934254B (en) | Analysis method and device for open source license | |
US20160189052A1 (en) | Systems and methods for recognizing ambiguity in metadata | |
CN108491388B (en) | Data set acquisition method, classification method, device, equipment and storage medium | |
US20200005194A1 (en) | Machine learning for associating skills with content | |
JP2011040069A (en) | Method and apparatus to classify text communications | |
CN109271509B (en) | Live broadcast room topic generation method and device, computer equipment and storage medium | |
CN105653547B (en) | Method and device for extracting text keywords | |
CN108549723B (en) | Text concept classification method and device and server | |
CN106777236B (en) | Method and device for displaying query result based on deep question answering | |
WO2020237872A1 (en) | Method and apparatus for testing accuracy of semantic analysis model, storage medium, and device | |
CN110895586A (en) | Method and device for generating news page, computer equipment and storage medium | |
CN107958270B (en) | Category identification method and device, electronic equipment and computer readable storage medium | |
CN110543484A (en) | prompt word recommendation method and device, storage medium and processor | |
CN115186780B (en) | Discipline knowledge point classification model training method, system, storage medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170627 |