CN115659969B - Document labeling method, device, electronic equipment and storage medium - Google Patents
Document labeling method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115659969B CN115659969B CN202211592980.XA CN202211592980A CN115659969B CN 115659969 B CN115659969 B CN 115659969B CN 202211592980 A CN202211592980 A CN 202211592980A CN 115659969 B CN115659969 B CN 115659969B
- Authority
- CN
- China
- Prior art keywords
- document
- label
- keyword
- keywords
- marked
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 claims abstract description 52
- 238000000605 extraction Methods 0.000 claims description 43
- 238000012216 screening Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 12
- 230000015654 memory Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 description 10
- 238000013473 artificial intelligence Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000004519 manufacturing process Methods 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 229910052704 radon Inorganic materials 0.000 description 1
- SYUHGPGVQRZVTB-UHFFFAOYSA-N radon atom Chemical compound [Rn] SYUHGPGVQRZVTB-UHFFFAOYSA-N 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of document labeling, and provides a document labeling method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a document to be marked and a tag list; extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked; and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked. According to the method, the device, the electronic equipment and the storage medium, the target label of the document to be marked is determined by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked, the reliability and the accuracy of determining the target label are ensured, the method is not limited by the acquisition quantity of marked samples, the implementation is easy, and the reliability of the target label is high.
Description
Technical Field
The present invention relates to the field of document labeling technologies, and in particular, to a method and apparatus for labeling a document, an electronic device, and a storage medium.
Background
Automatic labeling of documents aims at labeling a given document with one or more labels, facilitating subsequent classification, searching, summarization, etc. of the document.
In the prior art, the traditional machine learning document labeling method and the deep learning document labeling method are both supervised learning methods, and the training of the model is dependent on a large amount of labeling data. However, in practical application, only a part of unlabeled documents and label lists can be obtained in some scenes, and in other scenes, only the label list can be obtained due to the problems of data privacy and the like, and the defect of a labeling sample directly influences the reliability of automatic labeling of the documents.
Disclosure of Invention
The invention provides a document labeling method, a device, electronic equipment and a storage medium, which are used for solving the defect that a document labeling method with supervised learning in the prior art depends on a large amount of labeling data for training.
The invention provides a document labeling method, which comprises the following steps:
acquiring a document to be marked and a tag list;
extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked;
and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
According to the document labeling method provided by the invention, the determining the target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
and determining the target label of the document to be marked based on the label scores of the labels.
According to the document labeling method provided by the invention, the label scores of a plurality of labels of the document to be labeled are determined based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, and the method comprises the following steps:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
wherein ,representing the first document to be annotatedThe label score of the individual labels is determined,represent the firstThe number of the keywords to be used for the production of the key words,represent the firstThe number of tags to be used in the process of the label,represents the total number of keywords to be displayed,is the firstKeywords and the firstThe degree of similarity of the individual tags,is the firstWord frequency of keywords in the document to be tagged,is toAnd carrying out normalized word frequency.
According to the document labeling method provided by the invention, the determining the target label of the document to be labeled based on the label scores of the labels comprises the following steps:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold scores and/or the preset label number of the document to be marked, and determining the label obtained by screening as the target label of the document to be marked.
According to the document labeling method provided by the invention, the keyword extraction is carried out on the document to be labeled to obtain a plurality of keywords, and the method comprises the following steps:
keyword extraction is carried out on the document to be marked by applying a keyword extraction model, so that a plurality of keywords are obtained;
the keyword extraction model is trained based on sample texts and sample keywords corresponding to the sample texts.
According to the document labeling method provided by the invention, the sample text and the sample keyword corresponding to the sample text are acquired, and the method comprises the following steps:
acquiring paper documents related to each label in the label list, wherein the paper documents carry paper keywords;
and determining the sample text based on the paper document, and determining the sample keyword corresponding to the sample text based on the paper keyword.
According to the document labeling method provided by the invention, the method for determining the sample text based on the paper document comprises the following steps:
sample text is determined based on the headlines and summaries in the paper document.
The invention also provides a document labeling device, which comprises:
the acquisition unit is used for acquiring the document to be marked and the tag list;
the keyword extraction unit is used for extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be marked;
and the label determining unit is used for determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the document marking method as described above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document annotation method as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a document marking method as described in any one of the above.
According to the document labeling method, the device, the electronic equipment and the storage medium, the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled are combined, the target label of the document to be labeled is determined, the reliability and the accuracy of the determination of the target label are guaranteed through the combination of the similarity and the word frequency, the limitation of the acquisition quantity of labeling samples is avoided, the implementation is easy, and the reliability of the target label is high.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a document labeling method provided by the invention;
FIG. 2 is a flowchart of step 130 in the method for labeling documents according to the present invention;
FIG. 3 is a schematic flow chart of an acquisition step of a sample text and a sample keyword corresponding to the sample text provided by the invention;
FIG. 4 is a second flow chart of the document labeling method according to the present invention;
FIG. 5 is a schematic diagram of a document labeling apparatus according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the related art, automatic labeling of documents aims at labeling a given document with one or more labels, so that the documents can be conveniently classified, searched, abstracted and the like. In a document management scenario, such as an artificial intelligence scenario, a big data scenario, a blockchain scenario, etc., there is usually an existing tag library, and when a new document is put in storage, the new document needs to be tagged with a tag in the existing tag library.
The common document labeling method is a text classification method, and the text classification method solves the problem that a text label is used as a multi-classification task. The traditional text classification method firstly obtains text characteristics by using BoW (Bag of Words), TF-IDF (Term Frequency-Inverse Document Frequency) and other methods, then builds a text classification model by using Naive Bayes (Naive Bayesian algorithm), SVM (Support Vector Machine ), radon forest and other machine learning algorithms, and since the Bert model in 2019 was proposed, a deep learning text classification model based on Bert (Bidirectional Encoder Representation from Transformers) model is the mainstream text classification method.
In the labeling scenario of english text, a text classification method using only a label name without label data is proposed, however, the method relies on using the Bert model to predict synonyms of labels. In order to obtain synonyms with correct semantics, their labels must be the smallest unit of non-partitionable words, such as the common words good, bad, commerce, economy.
However, in the labeling scene of the chinese text, the label length is usually equal to or greater than 2, for example, "artificial intelligence", however, "artificial intelligence" is divided into 4 token in the Bert model, so that the Bert model is difficult to give the phrase of the correct semantic meaning finally, and the method cannot be directly applied to the labeling scene of the chinese text.
In view of the above problems, the present invention provides a document labeling method, and fig. 1 is one of the flow charts of the document labeling method provided by the present invention, as shown in fig. 1, the method includes:
and 110, acquiring a document to be annotated and a tag list.
Specifically, a document to be annotated and a tag list may be obtained, where the document to be annotated is a document to be annotated, and the document to be annotated may be a document formed by text directly input by a user, a document formed by text obtained by performing voice transcription on acquired audio, or a document formed by text obtained by acquiring an image through image acquisition equipment such as a scanner, a mobile phone, a camera, and the like, and performing OCR (Optical Character Recognition ) on the image.
The tag list here refers to a set of each tag, and the tag list may be preset or crawled on a web page, which is not particularly limited in the embodiment of the present invention.
And 120, extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked.
Specifically, after the document to be annotated is obtained, keyword extraction can be performed on the document to be annotated to obtain a plurality of keywords. The keyword extraction may be a keyword extraction model, where the keyword extraction model may be a Bert (Bidirectional Encoder Representation from Transformers) model, an LSTM-CRF (Long Short Term Memory-Conditional Random Field algorithm, long-term short-term memory network-conditional random field) algorithm, a BERT-CRF algorithm, or the like, which is not limited in particular in the embodiment of the present invention.
The keywords reflect the key points in the document to be marked, and can be "artificial intelligence", "blockchain", large data "," natural language processing ", artificial intelligence", "large data", "natural language processing", "blockchain", and the like.
After obtaining the keywords, the word frequency of each keyword in the document to be marked can be counted, where the word frequency refers to the number of times each keyword appears in the document to be marked, for example, the word frequency of each keyword in the document to be marked can be [ ("artificial intelligence", 5), ("big data", 2), ("natural language processing", 1) ] and the like.
And 130, determining a target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
Specifically, after the word frequency of each keyword in the document to be tagged is obtained through statistics, the target tag of the document to be tagged can be determined based on the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be tagged. The target label refers to a final label of a document to be marked, and the target label can be one, a plurality of target labels or a blank, and the embodiment of the invention is not limited in particular.
The similarity between each keyword and each tag in the tag list may be calculated by adopting methods such as cosine similarity and Pearson correlation coefficient (Pearson Correlation Coefficient), and before similarity calculation, word2vec embedded representation (Embedding) may be used to perform word encoding on each keyword and each tag in the tag list, and then similarity calculation is performed based on the vector after word encoding.
The degree of similarity between each keyword and each tag in the tag list reflects the degree of matching of each keyword and each tag in the tag list. It can be appreciated that the higher the similarity between each keyword and each tag in the tag list, the more matched each keyword and each tag in the tag list; the lower the similarity between each keyword and each tag in the tag list, the less matching each keyword and each tag in the tag list.
The word frequency of each keyword in the document to be marked is considered to reflect the frequency of occurrence of each keyword in the document to be marked, and the frequency of occurrence of a certain keyword in the document to be marked can reflect the importance degree of the keyword in the document to be marked.
For example, the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be tagged can be used as the criterion of the target tag of the document to be tagged, so as to obtain the target tag of the document to be tagged.
According to the method provided by the embodiment of the invention, the target label of the document to be marked is determined by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked, the reliability and the accuracy of the determination of the target label are ensured by combining the similarity and the word frequency, the method is not limited by the acquisition quantity of marked samples, the implementation is easy, and the reliability of the target label is strong.
Based on the above embodiment, fig. 2 is a schematic flow chart of step 130 in the document labeling method provided by the present invention, and as shown in fig. 2, step 130 includes:
and step 132, determining the target label of the document to be marked based on the label scores of the labels.
Specifically, after the word frequency of each keyword in the document to be marked is obtained, the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be marked can be weighted to obtain tag scores of a plurality of tags of the document to be marked, wherein the tag scores reflect the score of each tag as a target tag or reflect the probability of each tag as a target tag, which can be 0.5, 0.8, 0.7, and the like.
The word frequency of each keyword in the document to be marked is considered to reflect the frequency of occurrence of each keyword in the document to be marked, and the frequency of occurrence of a certain keyword in the document to be marked can reflect the importance degree of the keyword in the document to be marked. It can be understood that the greater the word frequency of the keyword in the document to be annotated, the more the keyword can influence the label score of the label of the document to be annotated, which is similar to the keyword; the smaller the word frequency of the keyword in the document to be marked is, the less the keyword affects the label scores of labels of the document to be marked, which are similar to the keyword, so that the word frequency of each keyword in the document to be marked can be used as the judgment basis of the label scores of a plurality of labels of the document to be marked.
After the label scores of the plurality of labels of the document to be annotated are obtained, the target label of the document to be annotated can be determined based on the label scores of the plurality of labels. The target label here refers to the final label of the document to be annotated.
For example, the plurality of tags may be filtered based on their tag scores, and those tags having higher scores among the tag scores of the plurality of tags may be determined as target tags of the document to be annotated.
According to the method provided by the embodiment of the invention, the target label of the document to be marked is determined based on the label scores of the labels, and the label scores reflect the scores of the labels as the target labels or reflect the probabilities of the labels as the target labels, so that the reliability and the accuracy of the target label of the document to be marked are ensured.
Based on the above embodiment, step 131 includes:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
wherein ,representing the first document to be annotatedThe label score of the individual labels is determined,represent the firstThe number of the keywords to be used for the production of the key words,represent the firstThe number of tags to be used in the process of the label,represents the total number of keywords to be displayed,is the firstKeywords and the firstThe degree of similarity of the individual tags,is the firstWord frequency of the individual keywords in the document to be tagged,is toAnd carrying out normalized word frequency.
Based on the above embodiment, step 132 includes:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold scores and/or the preset label number of the document to be marked, and determining the label obtained by screening as the target label of the document to be marked.
Specifically, after obtaining the tag scores of the plurality of tags, the plurality of tags can be screened based on the tag scores of the plurality of tags and the threshold score, and the tag obtained by screening is determined as the target tag of the document to be marked; the labels can be screened based on the label scores of the labels and the preset label number of the document to be marked, and the label obtained by screening is determined to be the target label of the document to be marked; the method and the device can also screen the labels based on the label scores of the labels, the threshold scores and/or the preset label number of the document to be marked, and determine the label obtained by screening as the target label of the document to be marked.
The threshold score here refers to a threshold tag score, and may be preset or set according to actual situations. The preset number of labels of the document to be marked refers to the number of labels required by the document to be marked, which can be preset or set according to actual conditions, and the embodiment of the invention is not limited in particular.
For example, the threshold score is 0.5, the preset label number of the document to be marked is 5, the label scores of the labels are 0.6, 0.7 and 0.8, the label score of 0.6 corresponds to an artificial intelligence label, the label score of 0.7 corresponds to a support vector machine, the label score of 0.8 corresponds to a natural language process, and the labels are screened based on the label scores of the labels and the threshold score and/or the preset label number of the document to be marked, so that the labels obtained by screening are determined to be target labels of the document to be marked, namely the artificial intelligence label, the support vector machine label and the natural language process label.
In addition, before the labels are screened based on the label scores of the labels, the threshold scores and/or the preset label number of the document to be marked, the label scores of the labels can be ranked, and the labels can be screened based on the ranked label scores of the labels. The label scores of the plurality of labels may be ranked from high to low, or from low to high, which is not particularly limited in the embodiment of the present invention.
According to the method provided by the embodiment of the invention, based on the label scores of the labels, the labels are screened by combining the threshold scores and/or the conditions of the preset label number of the document to be marked, and the label obtained by screening is determined to be the target label of the document to be marked, so that the accuracy of determining the target label of the document to be marked is ensured.
Based on the above embodiment, step 120 includes:
step 121, keyword extraction is performed on the document to be annotated by applying a keyword extraction model, so as to obtain a plurality of keywords;
the keyword extraction model is trained based on sample texts and sample keywords corresponding to the sample texts.
Specifically, in order to be able to extract keywords of a document to be annotated, it is necessary to acquire a keyword extraction model by, before step 121, the following steps:
sample texts and sample keywords corresponding to the sample texts can be collected in advance, and an initial keyword extraction model, namely an initial model of a training keyword extraction model, can be constructed. Here, the initial keyword extraction model may include a Bert model and a classification layer, where the classification layer may be a softmax layer, or may be a CRF (Conditional Random Field algorithm ), which is not specifically limited in this embodiment of the present invention.
After the initial keyword extraction model is obtained, the pre-collected sample text and the sample keywords corresponding to the sample text can be applied to train the initial keyword extraction model:
the sample text can be input into an initial keyword extraction model, and the initial keyword extraction model is used for extracting keywords of the sample text to obtain and output predicted keywords of the sample text.
After obtaining the predicted keywords based on the initial keyword extraction model, the predicted keywords can be compared with sample keywords corresponding to the sample texts collected in advance, a loss function value is obtained through calculation according to the difference degree between the predicted keywords and the sample keywords, parameter iteration is carried out on the initial keyword extraction model based on the loss function value, and the initial keyword extraction model after parameter iteration is completed is recorded as a keyword extraction model.
It can be understood that the greater the degree of difference between the predicted keywords and the sample keywords corresponding to the sample text collected in advance, the greater the loss function value; the smaller the degree of difference of the predicted keyword and the sample keyword corresponding to the sample text collected in advance, the smaller the loss function value.
That is, in the training process of the initial keyword extraction model, keyword extraction is learned for the document to be annotated, so as to extract keywords that can be used for determining the target label of the document to be annotated.
In the related art, when a sample text and a sample keyword corresponding to the sample text are applied to perform keyword extraction model training, the sample keyword corresponding to the sample text is generally difficult to obtain, and in order to solve the above problem, in the embodiment of the present invention, the sample text is determined based on paper documents related to each tag in the tag list, and the sample keyword corresponding to the sample text is a paper keyword carried in the paper documents.
Based on the above embodiments, fig. 3 is a flowchart illustrating a sample text and a sample keyword corresponding to the sample text, where, as shown in fig. 3, the step of obtaining the sample text and the sample keyword corresponding to the sample text includes:
Specifically, the paper documents related to each label in the label list can be obtained, and the paper documents carry paper keywords, namely the paper keywords do not need to be manually marked, so that a great amount of time cost is saved, and the obtaining efficiency of the follow-up sample text and the sample keywords corresponding to the sample text is improved.
It will be appreciated that after each tag in the tag list is obtained, the open source data set may be matched to the paper documents associated with each tag, where the open source data set may be crawled from the download website of each paper document.
After the paper documents related to each tag in the tag list are acquired, a sample text may be determined based on the paper documents. For example, the paper document may be directly taken as sample text, and for example, text that can represent the core ideas in the paper document may be taken as sample text.
Thereupon, sample keywords corresponding to the sample text may be determined based on the paper keywords. For example, the paper keywords carried by the paper document itself can be used as sample keywords corresponding to the sample text.
For example, the sample text and the sample keyword corresponding to the sample text may be [ (sample text 1, [ sample keyword 1 corresponding to sample text 1,.]) ] ], sample text 2, [ sample keyword 1 corresponding to sample text 2,.], sample text N, [ sample keyword 1 corresponding to sample text N, ] ], or the like.
According to the method provided by the embodiment of the invention, the sample text is determined based on the paper document, the paper document carries the paper keywords, and the sample keywords corresponding to the sample text are determined based on the paper keywords, namely, the sample keywords corresponding to the sample text do not need manual labeling, so that a great amount of time cost is saved.
In the related art, when a sample text and a sample keyword corresponding to the sample text are applied to perform keyword extraction model training, the whole document is generally used for the sample text, so that the training cost of the keyword extraction model is increased, and the training efficiency of the keyword extraction model is reduced.
Based on the above embodiment, step 320 includes:
sample text is determined based on the headlines and summaries in the paper document.
Specifically, after the paper documents related to the respective tags in the tag list are acquired, the sample text may be determined based on the headlines and abstracts in the paper documents. For example, headlines and summaries in paper documents may be directly taken as sample text.
According to the method provided by the embodiment of the invention, the sample text is determined based on the title and the abstract in the paper document, and compared with the traditional method based on the whole document, the sample text is determined, so that the training efficiency of the keyword extraction model is improved.
Based on any one of the above embodiments, the present invention provides a document labeling method, and fig. 4 is a second schematic flow chart of the document labeling method provided by the present invention, as shown in fig. 4, the method includes:
in step 410, a list of documents and tags to be annotated may be obtained.
The step of obtaining the sample text and the sample keyword corresponding to the sample text comprises the following steps:
the paper documents related to each label in the label list can be obtained, and the paper documents carry paper keywords;
sample text may be determined based on titles and summaries in the paper document, and sample keywords corresponding to the sample text may be determined based on the paper keywords.
In step 430, tag scores of a plurality of tags of the document to be tagged may be determined based on the similarity between each keyword and each tag in the tag list, and the word frequency of each keyword in the document to be tagged.
Wherein, the label scores of the plurality of labels of the document to be annotated can be determined based on the following formula:
wherein ,representing the first document to be annotatedThe label score of the individual labels is determined,represent the firstThe number of the keywords to be used for the production of the key words,represent the firstThe number of tags to be used in the process of the label,represents the total number of keywords to be displayed,is the firstKeywords and the firstThe degree of similarity of the individual tags,is the firstWord frequency of the individual keywords in the document to be tagged,is toAnd carrying out normalized word frequency.
In step 440, the labels may be screened based on the label scores of the labels, the threshold scores and/or the preset label number of the document to be labeled, and the label obtained by screening may be determined as the target label of the document to be labeled.
The document marking device provided by the invention is described below, and the document marking device described below and the document marking method described above can be referred to correspondingly.
Based on any one of the above embodiments, the present invention provides a document labeling device, and fig. 5 is a schematic structural diagram of the document labeling device provided by the present invention, as shown in fig. 5, where the device includes:
an obtaining unit 510, configured to obtain a document to be annotated and a tag list;
the keyword extraction unit 520 is configured to extract keywords from the document to be annotated, obtain a plurality of keywords, and count word frequencies of the keywords in the document to be annotated;
a determining tag unit 530, configured to determine a target tag of the document to be tagged based on the similarity between the keywords and the tags in the tag list, and the word frequency of the keywords in the document to be tagged.
The device provided by the embodiment of the invention combines the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be tagged to determine the target tag of the document to be tagged, and the combination of the similarity and the word frequency ensures the reliability and the accuracy of the determination of the target tag, is not limited by the acquisition quantity of the tagging sample, is easy to realize, and has strong reliability of the target tag.
Based on any of the above embodiments, determining the tag unit specifically includes:
a tag scoring unit, configured to determine tag scores of a plurality of tags of the document to be tagged based on a similarity between each keyword and each tag in the tag list, and a word frequency of each keyword in the document to be tagged;
and determining a target label unit, which is used for determining the target label of the document to be marked based on the label scores of the labels.
Based on any of the above embodiments, the determining tag score unit is specifically configured to:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
wherein ,representing the first document to be annotatedThe label score of the individual labels is determined,represent the firstThe number of the keywords to be used for the production of the key words,represent the firstThe number of tags to be used in the process of the label,represents the total number of keywords to be displayed,is the firstKeywords and the firstThe degree of similarity of the individual tags,is the firstWord frequency of the individual keywords in the document to be tagged,is toAnd carrying out normalized word frequency.
Based on any of the above embodiments, the target tag unit is determined to be particularly useful for:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold scores and/or the preset label number of the document to be marked, and determining the label obtained by screening as the target label of the document to be marked.
Based on any one of the above embodiments, the keyword extraction unit specifically includes:
keyword extraction is carried out on the document to be marked by applying a keyword extraction model, so that a plurality of keywords are obtained;
the keyword extraction model is trained based on sample texts and sample keywords corresponding to the sample texts.
Based on any one of the above embodiments, the step of obtaining the sample text and the sample keyword corresponding to the sample text includes:
the document obtaining unit is used for obtaining paper documents related to each tag in the tag list, wherein the paper documents carry paper keywords;
and the text and keyword determining unit is used for determining the sample text based on the paper document and determining the sample keyword corresponding to the sample text based on the paper keyword.
Based on any of the above embodiments, determining text and keyword units is specifically for:
sample text is determined based on the headlines and summaries in the paper document.
Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a document annotation method comprising: acquiring a document to be marked and a tag list; extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked; and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing the document labeling method provided by the above methods, the method comprising: acquiring a document to be marked and a tag list; extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked; and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a document labeling method provided by the above methods, the method comprising: acquiring a document to be marked and a tag list; extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked; and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (8)
1. A method for labeling a document, comprising:
acquiring a document to be marked and a tag list;
extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked;
determining a target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked;
the determining the target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
determining a target label of the document to be marked based on the label scores of the plurality of labels;
the determining the label scores of the labels of the document to be labeled based on the similarity between the keywords and the labels in the label list and the word frequency of the keywords in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
wherein ,representing documents to be annotatedFirst->Tag score of individual tags->Indicate->Keywords (e.g. Japan)>Indicate->Personal tag (S)>Representing the total number of keywords>Is->Keywords and->Similarity of individual tags->Is->Word frequency of each keyword in document to be marked, < ->Is to->And carrying out normalized word frequency.
2. The method for labeling a document according to claim 1, wherein determining the target label of the document to be labeled based on the label scores of the plurality of labels comprises:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold scores and/or the preset label number of the document to be marked, and determining the label obtained by screening as the target label of the document to be marked.
3. The method for labeling documents according to any one of claims 1 to 2, wherein the extracting keywords from the document to be labeled to obtain a plurality of keywords comprises:
keyword extraction is carried out on the document to be marked by applying a keyword extraction model, so that a plurality of keywords are obtained;
the keyword extraction model is trained based on sample texts and sample keywords corresponding to the sample texts.
4. The document labeling method according to claim 3, wherein the step of obtaining the sample text and the sample keyword corresponding to the sample text comprises:
acquiring paper documents related to each label in the label list, wherein the paper documents carry paper keywords;
and determining the sample text based on the paper document, and determining the sample keyword corresponding to the sample text based on the paper keyword.
5. The document labeling method of claim 4, wherein the determining the sample text based on the discussion document comprises:
sample text is determined based on the headlines and summaries in the paper document.
6. A document marking apparatus, comprising:
the acquisition unit is used for acquiring the document to be marked and the tag list;
the keyword extraction unit is used for extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be marked;
a label determining unit, configured to determine a target label of the document to be labeled based on a similarity between each keyword and each label in the label list and a word frequency of each keyword in the document to be labeled;
the determining the target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
determining a target label of the document to be marked based on the label scores of the plurality of labels;
the determining the label scores of the labels of the document to be labeled based on the similarity between the keywords and the labels in the label list and the word frequency of the keywords in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
wherein ,representing the +.>Label of personal labelSign score, ->Indicate->Keywords (e.g. Japan)>Indicate->Personal tag (S)>Representing the total number of keywords>Is->Keywords and->Similarity of individual tags->Is->Word frequency of each keyword in document to be marked, < ->Is to->And carrying out normalized word frequency.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document marking method of any one of claims 1 to 5 when the program is executed by the processor.
8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the document annotation method according to any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211592980.XA CN115659969B (en) | 2022-12-13 | 2022-12-13 | Document labeling method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211592980.XA CN115659969B (en) | 2022-12-13 | 2022-12-13 | Document labeling method, device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115659969A CN115659969A (en) | 2023-01-31 |
CN115659969B true CN115659969B (en) | 2023-04-28 |
Family
ID=85017459
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211592980.XA Active CN115659969B (en) | 2022-12-13 | 2022-12-13 | Document labeling method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115659969B (en) |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103235774B (en) * | 2013-04-27 | 2016-04-06 | 杭州电子科技大学 | A kind of science and technology item application form Feature Words extracting method |
CN110717092A (en) * | 2018-06-27 | 2020-01-21 | 北京京东尚科信息技术有限公司 | Method, system, device and storage medium for matching objects for articles |
CN110489649B (en) * | 2019-08-19 | 2023-06-27 | 北京创鑫旅程网络技术有限公司 | Method and device for associating content with tag |
CN110781297B (en) * | 2019-09-18 | 2022-06-21 | 国家计算机网络与信息安全管理中心 | Classification method of multi-label scientific research papers based on hierarchical discriminant trees |
CN111967262B (en) * | 2020-06-30 | 2024-01-12 | 北京百度网讯科技有限公司 | Determination method and device for entity tag |
US20220019741A1 (en) * | 2020-07-16 | 2022-01-20 | Optum Technology, Inc. | An unsupervised approach to assignment of pre-defined labels to text documents |
-
2022
- 2022-12-13 CN CN202211592980.XA patent/CN115659969B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN115659969A (en) | 2023-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107229668B (en) | Text extraction method based on keyword matching | |
US8478052B1 (en) | Image classification | |
CN113268995B (en) | Chinese academy keyword extraction method, device and storage medium | |
CA2774278C (en) | Methods and systems for extracting keyphrases from natural text for search engine indexing | |
CN111291566B (en) | Event main body recognition method, device and storage medium | |
CN109635157B (en) | Model generation method, video search method, device, terminal and storage medium | |
CN108509521B (en) | Image retrieval method for automatically generating text index | |
CN109033060B (en) | Information alignment method, device, equipment and readable storage medium | |
CN113569050B (en) | Method and device for automatically constructing government affair field knowledge map based on deep learning | |
CN111160019B (en) | Public opinion monitoring method, device and system | |
CN110968725B (en) | Image content description information generation method, electronic device and storage medium | |
CN110825998A (en) | Website identification method and readable storage medium | |
CN112188312A (en) | Method and apparatus for determining video material of news | |
CN110287314A (en) | Long text credibility evaluation method and system based on Unsupervised clustering | |
CN112183102A (en) | Named entity identification method based on attention mechanism and graph attention network | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
CN112395392A (en) | Intention identification method and device and readable storage medium | |
CN113204956B (en) | Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device | |
CN115062621A (en) | Label extraction method and device, electronic equipment and storage medium | |
CN114676346A (en) | News event processing method and device, computer equipment and storage medium | |
CN112445862B (en) | Internet of things equipment data set construction method and device, electronic equipment and storage medium | |
CN111985212A (en) | Text keyword recognition method and device, computer equipment and readable storage medium | |
CN115659969B (en) | Document labeling method, device, electronic equipment and storage medium | |
CN113255319B (en) | Model training method, text segmentation method, abstract extraction method and device | |
Das et al. | Automatic semantic segmentation and annotation of MOOC lecture videos |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |