CN115659969B - Document labeling method, device, electronic equipment and storage medium - Google Patents

Document labeling method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115659969B
CN115659969B CN202211592980.XA CN202211592980A CN115659969B CN 115659969 B CN115659969 B CN 115659969B CN 202211592980 A CN202211592980 A CN 202211592980A CN 115659969 B CN115659969 B CN 115659969B
Authority
CN
China
Prior art keywords
document
label
keyword
keywords
marked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211592980.XA
Other languages
Chinese (zh)
Other versions
CN115659969A (en
Inventor
郑玉玲
王凌云
王梓凝
刘兆蓬
宋丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengfang Financial Technology Co ltd
Original Assignee
Chengfang Financial Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengfang Financial Technology Co ltd filed Critical Chengfang Financial Technology Co ltd
Priority to CN202211592980.XA priority Critical patent/CN115659969B/en
Publication of CN115659969A publication Critical patent/CN115659969A/en
Application granted granted Critical
Publication of CN115659969B publication Critical patent/CN115659969B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of document labeling, and provides a document labeling method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a document to be marked and a tag list; extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked; and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked. According to the method, the device, the electronic equipment and the storage medium, the target label of the document to be marked is determined by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked, the reliability and the accuracy of determining the target label are ensured, the method is not limited by the acquisition quantity of marked samples, the implementation is easy, and the reliability of the target label is high.

Description

Document labeling method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of document labeling technologies, and in particular, to a method and apparatus for labeling a document, an electronic device, and a storage medium.
Background
Automatic labeling of documents aims at labeling a given document with one or more labels, facilitating subsequent classification, searching, summarization, etc. of the document.
In the prior art, the traditional machine learning document labeling method and the deep learning document labeling method are both supervised learning methods, and the training of the model is dependent on a large amount of labeling data. However, in practical application, only a part of unlabeled documents and label lists can be obtained in some scenes, and in other scenes, only the label list can be obtained due to the problems of data privacy and the like, and the defect of a labeling sample directly influences the reliability of automatic labeling of the documents.
Disclosure of Invention
The invention provides a document labeling method, a device, electronic equipment and a storage medium, which are used for solving the defect that a document labeling method with supervised learning in the prior art depends on a large amount of labeling data for training.
The invention provides a document labeling method, which comprises the following steps:
acquiring a document to be marked and a tag list;
extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked;
and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
According to the document labeling method provided by the invention, the determining the target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
and determining the target label of the document to be marked based on the label scores of the labels.
According to the document labeling method provided by the invention, the label scores of a plurality of labels of the document to be labeled are determined based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled, and the method comprises the following steps:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
Figure 721092DEST_PATH_IMAGE001
wherein ,
Figure 130208DEST_PATH_IMAGE002
representing the first document to be annotated
Figure 591276DEST_PATH_IMAGE003
The label score of the individual labels is determined,
Figure 477847DEST_PATH_IMAGE004
represent the first
Figure 391576DEST_PATH_IMAGE005
The number of the keywords to be used for the production of the key words,
Figure 869962DEST_PATH_IMAGE006
represent the first
Figure 451116DEST_PATH_IMAGE007
The number of tags to be used in the process of the label,
Figure 222500DEST_PATH_IMAGE008
represents the total number of keywords to be displayed,
Figure 889105DEST_PATH_IMAGE009
is the first
Figure 905603DEST_PATH_IMAGE010
Keywords and the first
Figure 341263DEST_PATH_IMAGE003
The degree of similarity of the individual tags,
Figure 286479DEST_PATH_IMAGE011
is the first
Figure 237117DEST_PATH_IMAGE012
Word frequency of keywords in the document to be tagged,
Figure 57306DEST_PATH_IMAGE013
is to
Figure 347473DEST_PATH_IMAGE014
And carrying out normalized word frequency.
According to the document labeling method provided by the invention, the determining the target label of the document to be labeled based on the label scores of the labels comprises the following steps:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold scores and/or the preset label number of the document to be marked, and determining the label obtained by screening as the target label of the document to be marked.
According to the document labeling method provided by the invention, the keyword extraction is carried out on the document to be labeled to obtain a plurality of keywords, and the method comprises the following steps:
keyword extraction is carried out on the document to be marked by applying a keyword extraction model, so that a plurality of keywords are obtained;
the keyword extraction model is trained based on sample texts and sample keywords corresponding to the sample texts.
According to the document labeling method provided by the invention, the sample text and the sample keyword corresponding to the sample text are acquired, and the method comprises the following steps:
acquiring paper documents related to each label in the label list, wherein the paper documents carry paper keywords;
and determining the sample text based on the paper document, and determining the sample keyword corresponding to the sample text based on the paper keyword.
According to the document labeling method provided by the invention, the method for determining the sample text based on the paper document comprises the following steps:
sample text is determined based on the headlines and summaries in the paper document.
The invention also provides a document labeling device, which comprises:
the acquisition unit is used for acquiring the document to be marked and the tag list;
the keyword extraction unit is used for extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be marked;
and the label determining unit is used for determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the document marking method as described above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document annotation method as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a document marking method as described in any one of the above.
According to the document labeling method, the device, the electronic equipment and the storage medium, the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled are combined, the target label of the document to be labeled is determined, the reliability and the accuracy of the determination of the target label are guaranteed through the combination of the similarity and the word frequency, the limitation of the acquisition quantity of labeling samples is avoided, the implementation is easy, and the reliability of the target label is high.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a document labeling method provided by the invention;
FIG. 2 is a flowchart of step 130 in the method for labeling documents according to the present invention;
FIG. 3 is a schematic flow chart of an acquisition step of a sample text and a sample keyword corresponding to the sample text provided by the invention;
FIG. 4 is a second flow chart of the document labeling method according to the present invention;
FIG. 5 is a schematic diagram of a document labeling apparatus according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the related art, automatic labeling of documents aims at labeling a given document with one or more labels, so that the documents can be conveniently classified, searched, abstracted and the like. In a document management scenario, such as an artificial intelligence scenario, a big data scenario, a blockchain scenario, etc., there is usually an existing tag library, and when a new document is put in storage, the new document needs to be tagged with a tag in the existing tag library.
The common document labeling method is a text classification method, and the text classification method solves the problem that a text label is used as a multi-classification task. The traditional text classification method firstly obtains text characteristics by using BoW (Bag of Words), TF-IDF (Term Frequency-Inverse Document Frequency) and other methods, then builds a text classification model by using Naive Bayes (Naive Bayesian algorithm), SVM (Support Vector Machine ), radon forest and other machine learning algorithms, and since the Bert model in 2019 was proposed, a deep learning text classification model based on Bert (Bidirectional Encoder Representation from Transformers) model is the mainstream text classification method.
In the labeling scenario of english text, a text classification method using only a label name without label data is proposed, however, the method relies on using the Bert model to predict synonyms of labels. In order to obtain synonyms with correct semantics, their labels must be the smallest unit of non-partitionable words, such as the common words good, bad, commerce, economy.
However, in the labeling scene of the chinese text, the label length is usually equal to or greater than 2, for example, "artificial intelligence", however, "artificial intelligence" is divided into 4 token in the Bert model, so that the Bert model is difficult to give the phrase of the correct semantic meaning finally, and the method cannot be directly applied to the labeling scene of the chinese text.
In view of the above problems, the present invention provides a document labeling method, and fig. 1 is one of the flow charts of the document labeling method provided by the present invention, as shown in fig. 1, the method includes:
and 110, acquiring a document to be annotated and a tag list.
Specifically, a document to be annotated and a tag list may be obtained, where the document to be annotated is a document to be annotated, and the document to be annotated may be a document formed by text directly input by a user, a document formed by text obtained by performing voice transcription on acquired audio, or a document formed by text obtained by acquiring an image through image acquisition equipment such as a scanner, a mobile phone, a camera, and the like, and performing OCR (Optical Character Recognition ) on the image.
The tag list here refers to a set of each tag, and the tag list may be preset or crawled on a web page, which is not particularly limited in the embodiment of the present invention.
And 120, extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked.
Specifically, after the document to be annotated is obtained, keyword extraction can be performed on the document to be annotated to obtain a plurality of keywords. The keyword extraction may be a keyword extraction model, where the keyword extraction model may be a Bert (Bidirectional Encoder Representation from Transformers) model, an LSTM-CRF (Long Short Term Memory-Conditional Random Field algorithm, long-term short-term memory network-conditional random field) algorithm, a BERT-CRF algorithm, or the like, which is not limited in particular in the embodiment of the present invention.
The keywords reflect the key points in the document to be marked, and can be "artificial intelligence", "blockchain", large data "," natural language processing ", artificial intelligence", "large data", "natural language processing", "blockchain", and the like.
After obtaining the keywords, the word frequency of each keyword in the document to be marked can be counted, where the word frequency refers to the number of times each keyword appears in the document to be marked, for example, the word frequency of each keyword in the document to be marked can be [ ("artificial intelligence", 5), ("big data", 2), ("natural language processing", 1) ] and the like.
And 130, determining a target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
Specifically, after the word frequency of each keyword in the document to be tagged is obtained through statistics, the target tag of the document to be tagged can be determined based on the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be tagged. The target label refers to a final label of a document to be marked, and the target label can be one, a plurality of target labels or a blank, and the embodiment of the invention is not limited in particular.
The similarity between each keyword and each tag in the tag list may be calculated by adopting methods such as cosine similarity and Pearson correlation coefficient (Pearson Correlation Coefficient), and before similarity calculation, word2vec embedded representation (Embedding) may be used to perform word encoding on each keyword and each tag in the tag list, and then similarity calculation is performed based on the vector after word encoding.
The degree of similarity between each keyword and each tag in the tag list reflects the degree of matching of each keyword and each tag in the tag list. It can be appreciated that the higher the similarity between each keyword and each tag in the tag list, the more matched each keyword and each tag in the tag list; the lower the similarity between each keyword and each tag in the tag list, the less matching each keyword and each tag in the tag list.
The word frequency of each keyword in the document to be marked is considered to reflect the frequency of occurrence of each keyword in the document to be marked, and the frequency of occurrence of a certain keyword in the document to be marked can reflect the importance degree of the keyword in the document to be marked.
For example, the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be tagged can be used as the criterion of the target tag of the document to be tagged, so as to obtain the target tag of the document to be tagged.
According to the method provided by the embodiment of the invention, the target label of the document to be marked is determined by combining the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked, the reliability and the accuracy of the determination of the target label are ensured by combining the similarity and the word frequency, the method is not limited by the acquisition quantity of marked samples, the implementation is easy, and the reliability of the target label is strong.
Based on the above embodiment, fig. 2 is a schematic flow chart of step 130 in the document labeling method provided by the present invention, and as shown in fig. 2, step 130 includes:
step 131, determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
and step 132, determining the target label of the document to be marked based on the label scores of the labels.
Specifically, after the word frequency of each keyword in the document to be marked is obtained, the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be marked can be weighted to obtain tag scores of a plurality of tags of the document to be marked, wherein the tag scores reflect the score of each tag as a target tag or reflect the probability of each tag as a target tag, which can be 0.5, 0.8, 0.7, and the like.
The word frequency of each keyword in the document to be marked is considered to reflect the frequency of occurrence of each keyword in the document to be marked, and the frequency of occurrence of a certain keyword in the document to be marked can reflect the importance degree of the keyword in the document to be marked. It can be understood that the greater the word frequency of the keyword in the document to be annotated, the more the keyword can influence the label score of the label of the document to be annotated, which is similar to the keyword; the smaller the word frequency of the keyword in the document to be marked is, the less the keyword affects the label scores of labels of the document to be marked, which are similar to the keyword, so that the word frequency of each keyword in the document to be marked can be used as the judgment basis of the label scores of a plurality of labels of the document to be marked.
After the label scores of the plurality of labels of the document to be annotated are obtained, the target label of the document to be annotated can be determined based on the label scores of the plurality of labels. The target label here refers to the final label of the document to be annotated.
For example, the plurality of tags may be filtered based on their tag scores, and those tags having higher scores among the tag scores of the plurality of tags may be determined as target tags of the document to be annotated.
According to the method provided by the embodiment of the invention, the target label of the document to be marked is determined based on the label scores of the labels, and the label scores reflect the scores of the labels as the target labels or reflect the probabilities of the labels as the target labels, so that the reliability and the accuracy of the target label of the document to be marked are ensured.
Based on the above embodiment, step 131 includes:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
Figure 726239DEST_PATH_IMAGE015
wherein ,
Figure 101857DEST_PATH_IMAGE016
representing the first document to be annotated
Figure 460157DEST_PATH_IMAGE017
The label score of the individual labels is determined,
Figure 870410DEST_PATH_IMAGE018
represent the first
Figure 885990DEST_PATH_IMAGE018
The number of the keywords to be used for the production of the key words,
Figure 748903DEST_PATH_IMAGE003
represent the first
Figure 176474DEST_PATH_IMAGE003
The number of tags to be used in the process of the label,
Figure 441233DEST_PATH_IMAGE019
represents the total number of keywords to be displayed,
Figure 161802DEST_PATH_IMAGE020
is the first
Figure 512012DEST_PATH_IMAGE012
Keywords and the first
Figure 212115DEST_PATH_IMAGE021
The degree of similarity of the individual tags,
Figure 331380DEST_PATH_IMAGE022
is the first
Figure 225780DEST_PATH_IMAGE023
Word frequency of the individual keywords in the document to be tagged,
Figure 63286DEST_PATH_IMAGE024
is to
Figure 567080DEST_PATH_IMAGE025
And carrying out normalized word frequency.
Based on the above embodiment, step 132 includes:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold scores and/or the preset label number of the document to be marked, and determining the label obtained by screening as the target label of the document to be marked.
Specifically, after obtaining the tag scores of the plurality of tags, the plurality of tags can be screened based on the tag scores of the plurality of tags and the threshold score, and the tag obtained by screening is determined as the target tag of the document to be marked; the labels can be screened based on the label scores of the labels and the preset label number of the document to be marked, and the label obtained by screening is determined to be the target label of the document to be marked; the method and the device can also screen the labels based on the label scores of the labels, the threshold scores and/or the preset label number of the document to be marked, and determine the label obtained by screening as the target label of the document to be marked.
The threshold score here refers to a threshold tag score, and may be preset or set according to actual situations. The preset number of labels of the document to be marked refers to the number of labels required by the document to be marked, which can be preset or set according to actual conditions, and the embodiment of the invention is not limited in particular.
For example, the threshold score is 0.5, the preset label number of the document to be marked is 5, the label scores of the labels are 0.6, 0.7 and 0.8, the label score of 0.6 corresponds to an artificial intelligence label, the label score of 0.7 corresponds to a support vector machine, the label score of 0.8 corresponds to a natural language process, and the labels are screened based on the label scores of the labels and the threshold score and/or the preset label number of the document to be marked, so that the labels obtained by screening are determined to be target labels of the document to be marked, namely the artificial intelligence label, the support vector machine label and the natural language process label.
In addition, before the labels are screened based on the label scores of the labels, the threshold scores and/or the preset label number of the document to be marked, the label scores of the labels can be ranked, and the labels can be screened based on the ranked label scores of the labels. The label scores of the plurality of labels may be ranked from high to low, or from low to high, which is not particularly limited in the embodiment of the present invention.
According to the method provided by the embodiment of the invention, based on the label scores of the labels, the labels are screened by combining the threshold scores and/or the conditions of the preset label number of the document to be marked, and the label obtained by screening is determined to be the target label of the document to be marked, so that the accuracy of determining the target label of the document to be marked is ensured.
Based on the above embodiment, step 120 includes:
step 121, keyword extraction is performed on the document to be annotated by applying a keyword extraction model, so as to obtain a plurality of keywords;
the keyword extraction model is trained based on sample texts and sample keywords corresponding to the sample texts.
Specifically, in order to be able to extract keywords of a document to be annotated, it is necessary to acquire a keyword extraction model by, before step 121, the following steps:
sample texts and sample keywords corresponding to the sample texts can be collected in advance, and an initial keyword extraction model, namely an initial model of a training keyword extraction model, can be constructed. Here, the initial keyword extraction model may include a Bert model and a classification layer, where the classification layer may be a softmax layer, or may be a CRF (Conditional Random Field algorithm ), which is not specifically limited in this embodiment of the present invention.
After the initial keyword extraction model is obtained, the pre-collected sample text and the sample keywords corresponding to the sample text can be applied to train the initial keyword extraction model:
the sample text can be input into an initial keyword extraction model, and the initial keyword extraction model is used for extracting keywords of the sample text to obtain and output predicted keywords of the sample text.
After obtaining the predicted keywords based on the initial keyword extraction model, the predicted keywords can be compared with sample keywords corresponding to the sample texts collected in advance, a loss function value is obtained through calculation according to the difference degree between the predicted keywords and the sample keywords, parameter iteration is carried out on the initial keyword extraction model based on the loss function value, and the initial keyword extraction model after parameter iteration is completed is recorded as a keyword extraction model.
It can be understood that the greater the degree of difference between the predicted keywords and the sample keywords corresponding to the sample text collected in advance, the greater the loss function value; the smaller the degree of difference of the predicted keyword and the sample keyword corresponding to the sample text collected in advance, the smaller the loss function value.
That is, in the training process of the initial keyword extraction model, keyword extraction is learned for the document to be annotated, so as to extract keywords that can be used for determining the target label of the document to be annotated.
In the related art, when a sample text and a sample keyword corresponding to the sample text are applied to perform keyword extraction model training, the sample keyword corresponding to the sample text is generally difficult to obtain, and in order to solve the above problem, in the embodiment of the present invention, the sample text is determined based on paper documents related to each tag in the tag list, and the sample keyword corresponding to the sample text is a paper keyword carried in the paper documents.
Based on the above embodiments, fig. 3 is a flowchart illustrating a sample text and a sample keyword corresponding to the sample text, where, as shown in fig. 3, the step of obtaining the sample text and the sample keyword corresponding to the sample text includes:
step 310, acquiring paper documents related to each tag in the tag list, wherein the paper documents carry paper keywords;
step 320, determining the sample text based on the paper document, and determining the sample keyword corresponding to the sample text based on the paper keyword.
Specifically, the paper documents related to each label in the label list can be obtained, and the paper documents carry paper keywords, namely the paper keywords do not need to be manually marked, so that a great amount of time cost is saved, and the obtaining efficiency of the follow-up sample text and the sample keywords corresponding to the sample text is improved.
It will be appreciated that after each tag in the tag list is obtained, the open source data set may be matched to the paper documents associated with each tag, where the open source data set may be crawled from the download website of each paper document.
After the paper documents related to each tag in the tag list are acquired, a sample text may be determined based on the paper documents. For example, the paper document may be directly taken as sample text, and for example, text that can represent the core ideas in the paper document may be taken as sample text.
Thereupon, sample keywords corresponding to the sample text may be determined based on the paper keywords. For example, the paper keywords carried by the paper document itself can be used as sample keywords corresponding to the sample text.
For example, the sample text and the sample keyword corresponding to the sample text may be [ (sample text 1, [ sample keyword 1 corresponding to sample text 1,.]) ] ], sample text 2, [ sample keyword 1 corresponding to sample text 2,.], sample text N, [ sample keyword 1 corresponding to sample text N, ] ], or the like.
According to the method provided by the embodiment of the invention, the sample text is determined based on the paper document, the paper document carries the paper keywords, and the sample keywords corresponding to the sample text are determined based on the paper keywords, namely, the sample keywords corresponding to the sample text do not need manual labeling, so that a great amount of time cost is saved.
In the related art, when a sample text and a sample keyword corresponding to the sample text are applied to perform keyword extraction model training, the whole document is generally used for the sample text, so that the training cost of the keyword extraction model is increased, and the training efficiency of the keyword extraction model is reduced.
Based on the above embodiment, step 320 includes:
sample text is determined based on the headlines and summaries in the paper document.
Specifically, after the paper documents related to the respective tags in the tag list are acquired, the sample text may be determined based on the headlines and abstracts in the paper documents. For example, headlines and summaries in paper documents may be directly taken as sample text.
According to the method provided by the embodiment of the invention, the sample text is determined based on the title and the abstract in the paper document, and compared with the traditional method based on the whole document, the sample text is determined, so that the training efficiency of the keyword extraction model is improved.
Based on any one of the above embodiments, the present invention provides a document labeling method, and fig. 4 is a second schematic flow chart of the document labeling method provided by the present invention, as shown in fig. 4, the method includes:
in step 410, a list of documents and tags to be annotated may be obtained.
Step 420, keyword extraction can be performed on the document to be marked by applying a keyword extraction model to obtain a plurality of keywords, and word frequencies of the keywords in the document to be marked are counted. The keyword extraction model is obtained through training based on sample texts and sample keywords corresponding to the sample texts.
The step of obtaining the sample text and the sample keyword corresponding to the sample text comprises the following steps:
the paper documents related to each label in the label list can be obtained, and the paper documents carry paper keywords;
sample text may be determined based on titles and summaries in the paper document, and sample keywords corresponding to the sample text may be determined based on the paper keywords.
In step 430, tag scores of a plurality of tags of the document to be tagged may be determined based on the similarity between each keyword and each tag in the tag list, and the word frequency of each keyword in the document to be tagged.
Wherein, the label scores of the plurality of labels of the document to be annotated can be determined based on the following formula:
Figure 540852DEST_PATH_IMAGE026
wherein ,
Figure 603224DEST_PATH_IMAGE027
representing the first document to be annotated
Figure 928026DEST_PATH_IMAGE028
The label score of the individual labels is determined,
Figure 969931DEST_PATH_IMAGE029
represent the first
Figure 267052DEST_PATH_IMAGE030
The number of the keywords to be used for the production of the key words,
Figure 497395DEST_PATH_IMAGE031
represent the first
Figure 43914DEST_PATH_IMAGE003
The number of tags to be used in the process of the label,
Figure 155089DEST_PATH_IMAGE032
represents the total number of keywords to be displayed,
Figure 103454DEST_PATH_IMAGE033
is the first
Figure 507628DEST_PATH_IMAGE018
Keywords and the first
Figure 338181DEST_PATH_IMAGE017
The degree of similarity of the individual tags,
Figure 721889DEST_PATH_IMAGE034
is the first
Figure 790339DEST_PATH_IMAGE035
Word frequency of the individual keywords in the document to be tagged,
Figure 368344DEST_PATH_IMAGE036
is to
Figure 889455DEST_PATH_IMAGE037
And carrying out normalized word frequency.
In step 440, the labels may be screened based on the label scores of the labels, the threshold scores and/or the preset label number of the document to be labeled, and the label obtained by screening may be determined as the target label of the document to be labeled.
The document marking device provided by the invention is described below, and the document marking device described below and the document marking method described above can be referred to correspondingly.
Based on any one of the above embodiments, the present invention provides a document labeling device, and fig. 5 is a schematic structural diagram of the document labeling device provided by the present invention, as shown in fig. 5, where the device includes:
an obtaining unit 510, configured to obtain a document to be annotated and a tag list;
the keyword extraction unit 520 is configured to extract keywords from the document to be annotated, obtain a plurality of keywords, and count word frequencies of the keywords in the document to be annotated;
a determining tag unit 530, configured to determine a target tag of the document to be tagged based on the similarity between the keywords and the tags in the tag list, and the word frequency of the keywords in the document to be tagged.
The device provided by the embodiment of the invention combines the similarity between each keyword and each tag in the tag list and the word frequency of each keyword in the document to be tagged to determine the target tag of the document to be tagged, and the combination of the similarity and the word frequency ensures the reliability and the accuracy of the determination of the target tag, is not limited by the acquisition quantity of the tagging sample, is easy to realize, and has strong reliability of the target tag.
Based on any of the above embodiments, determining the tag unit specifically includes:
a tag scoring unit, configured to determine tag scores of a plurality of tags of the document to be tagged based on a similarity between each keyword and each tag in the tag list, and a word frequency of each keyword in the document to be tagged;
and determining a target label unit, which is used for determining the target label of the document to be marked based on the label scores of the labels.
Based on any of the above embodiments, the determining tag score unit is specifically configured to:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
Figure 76854DEST_PATH_IMAGE038
wherein ,
Figure 999811DEST_PATH_IMAGE039
representing the first document to be annotated
Figure 949050DEST_PATH_IMAGE003
The label score of the individual labels is determined,
Figure 957457DEST_PATH_IMAGE004
represent the first
Figure 948547DEST_PATH_IMAGE010
The number of the keywords to be used for the production of the key words,
Figure 726010DEST_PATH_IMAGE040
represent the first
Figure 639959DEST_PATH_IMAGE041
The number of tags to be used in the process of the label,
Figure 870083DEST_PATH_IMAGE042
represents the total number of keywords to be displayed,
Figure 664864DEST_PATH_IMAGE043
is the first
Figure 296833DEST_PATH_IMAGE044
Keywords and the first
Figure 384613DEST_PATH_IMAGE045
The degree of similarity of the individual tags,
Figure 367612DEST_PATH_IMAGE046
is the first
Figure 700505DEST_PATH_IMAGE047
Word frequency of the individual keywords in the document to be tagged,
Figure 186981DEST_PATH_IMAGE048
is to
Figure 714170DEST_PATH_IMAGE049
And carrying out normalized word frequency.
Based on any of the above embodiments, the target tag unit is determined to be particularly useful for:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold scores and/or the preset label number of the document to be marked, and determining the label obtained by screening as the target label of the document to be marked.
Based on any one of the above embodiments, the keyword extraction unit specifically includes:
keyword extraction is carried out on the document to be marked by applying a keyword extraction model, so that a plurality of keywords are obtained;
the keyword extraction model is trained based on sample texts and sample keywords corresponding to the sample texts.
Based on any one of the above embodiments, the step of obtaining the sample text and the sample keyword corresponding to the sample text includes:
the document obtaining unit is used for obtaining paper documents related to each tag in the tag list, wherein the paper documents carry paper keywords;
and the text and keyword determining unit is used for determining the sample text based on the paper document and determining the sample keyword corresponding to the sample text based on the paper keyword.
Based on any of the above embodiments, determining text and keyword units is specifically for:
sample text is determined based on the headlines and summaries in the paper document.
Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a document annotation method comprising: acquiring a document to be marked and a tag list; extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked; and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing the document labeling method provided by the above methods, the method comprising: acquiring a document to be marked and a tag list; extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked; and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a document labeling method provided by the above methods, the method comprising: acquiring a document to be marked and a tag list; extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked; and determining the target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method for labeling a document, comprising:
acquiring a document to be marked and a tag list;
extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequencies of the keywords in the document to be marked;
determining a target label of the document to be marked based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be marked;
the determining the target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
determining a target label of the document to be marked based on the label scores of the plurality of labels;
the determining the label scores of the labels of the document to be labeled based on the similarity between the keywords and the labels in the label list and the word frequency of the keywords in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
Figure QLYQS_1
wherein ,
Figure QLYQS_3
representing documents to be annotatedFirst->
Figure QLYQS_9
Tag score of individual tags->
Figure QLYQS_10
Indicate->
Figure QLYQS_5
Keywords (e.g. Japan)>
Figure QLYQS_6
Indicate->
Figure QLYQS_12
Personal tag (S)>
Figure QLYQS_14
Representing the total number of keywords>
Figure QLYQS_2
Is->
Figure QLYQS_7
Keywords and->
Figure QLYQS_11
Similarity of individual tags->
Figure QLYQS_15
Is->
Figure QLYQS_4
Word frequency of each keyword in document to be marked, < ->
Figure QLYQS_8
Is to->
Figure QLYQS_13
And carrying out normalized word frequency.
2. The method for labeling a document according to claim 1, wherein determining the target label of the document to be labeled based on the label scores of the plurality of labels comprises:
and screening the plurality of labels based on the label scores of the plurality of labels, the threshold scores and/or the preset label number of the document to be marked, and determining the label obtained by screening as the target label of the document to be marked.
3. The method for labeling documents according to any one of claims 1 to 2, wherein the extracting keywords from the document to be labeled to obtain a plurality of keywords comprises:
keyword extraction is carried out on the document to be marked by applying a keyword extraction model, so that a plurality of keywords are obtained;
the keyword extraction model is trained based on sample texts and sample keywords corresponding to the sample texts.
4. The document labeling method according to claim 3, wherein the step of obtaining the sample text and the sample keyword corresponding to the sample text comprises:
acquiring paper documents related to each label in the label list, wherein the paper documents carry paper keywords;
and determining the sample text based on the paper document, and determining the sample keyword corresponding to the sample text based on the paper keyword.
5. The document labeling method of claim 4, wherein the determining the sample text based on the discussion document comprises:
sample text is determined based on the headlines and summaries in the paper document.
6. A document marking apparatus, comprising:
the acquisition unit is used for acquiring the document to be marked and the tag list;
the keyword extraction unit is used for extracting keywords from the document to be marked to obtain a plurality of keywords, and counting word frequency of each keyword in the document to be marked;
a label determining unit, configured to determine a target label of the document to be labeled based on a similarity between each keyword and each label in the label list and a word frequency of each keyword in the document to be labeled;
the determining the target label of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be labeled based on the similarity between each keyword and each label in the label list and the word frequency of each keyword in the document to be labeled;
determining a target label of the document to be marked based on the label scores of the plurality of labels;
the determining the label scores of the labels of the document to be labeled based on the similarity between the keywords and the labels in the label list and the word frequency of the keywords in the document to be labeled comprises the following steps:
determining label scores of a plurality of labels of the document to be annotated based on the following formula:
Figure QLYQS_16
wherein ,
Figure QLYQS_18
representing the +.>
Figure QLYQS_24
Label of personal labelSign score, ->
Figure QLYQS_27
Indicate->
Figure QLYQS_19
Keywords (e.g. Japan)>
Figure QLYQS_22
Indicate->
Figure QLYQS_28
Personal tag (S)>
Figure QLYQS_30
Representing the total number of keywords>
Figure QLYQS_17
Is->
Figure QLYQS_21
Keywords and->
Figure QLYQS_25
Similarity of individual tags->
Figure QLYQS_29
Is->
Figure QLYQS_20
Word frequency of each keyword in document to be marked, < ->
Figure QLYQS_23
Is to->
Figure QLYQS_26
And carrying out normalized word frequency.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document marking method of any one of claims 1 to 5 when the program is executed by the processor.
8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the document annotation method according to any of claims 1 to 5.
CN202211592980.XA 2022-12-13 2022-12-13 Document labeling method, device, electronic equipment and storage medium Active CN115659969B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211592980.XA CN115659969B (en) 2022-12-13 2022-12-13 Document labeling method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211592980.XA CN115659969B (en) 2022-12-13 2022-12-13 Document labeling method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115659969A CN115659969A (en) 2023-01-31
CN115659969B true CN115659969B (en) 2023-04-28

Family

ID=85017459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211592980.XA Active CN115659969B (en) 2022-12-13 2022-12-13 Document labeling method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115659969B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235774B (en) * 2013-04-27 2016-04-06 杭州电子科技大学 A kind of science and technology item application form Feature Words extracting method
CN110717092A (en) * 2018-06-27 2020-01-21 北京京东尚科信息技术有限公司 Method, system, device and storage medium for matching objects for articles
CN110489649B (en) * 2019-08-19 2023-06-27 北京创鑫旅程网络技术有限公司 Method and device for associating content with tag
CN110781297B (en) * 2019-09-18 2022-06-21 国家计算机网络与信息安全管理中心 Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN111967262B (en) * 2020-06-30 2024-01-12 北京百度网讯科技有限公司 Determination method and device for entity tag
US20220019741A1 (en) * 2020-07-16 2022-01-20 Optum Technology, Inc. An unsupervised approach to assignment of pre-defined labels to text documents

Also Published As

Publication number Publication date
CN115659969A (en) 2023-01-31

Similar Documents

Publication Publication Date Title
CN107229668B (en) Text extraction method based on keyword matching
US8478052B1 (en) Image classification
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
CN111291566B (en) Event main body recognition method, device and storage medium
CN109635157B (en) Model generation method, video search method, device, terminal and storage medium
CN108509521B (en) Image retrieval method for automatically generating text index
CN109033060B (en) Information alignment method, device, equipment and readable storage medium
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN111160019B (en) Public opinion monitoring method, device and system
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN110825998A (en) Website identification method and readable storage medium
CN112188312A (en) Method and apparatus for determining video material of news
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN112395392A (en) Intention identification method and device and readable storage medium
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN112445862B (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
CN115659969B (en) Document labeling method, device, electronic equipment and storage medium
CN113255319B (en) Model training method, text segmentation method, abstract extraction method and device
Das et al. Automatic semantic segmentation and annotation of MOOC lecture videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant