CN112307210A

CN112307210A - A document label prediction method, system, medium and electronic device

Info

Publication number: CN112307210A
Application number: CN202011232409.8A
Authority: CN
Inventors: 李开兴; 邓黎; 唐建烊; 宗涵
Original assignee: CISDI Engineering Co Ltd; CISDI Technology Research Center Co Ltd
Current assignee: CISDI Engineering Co Ltd; CISDI Technology Research Center Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-02-02
Anticipated expiration: 2040-11-06
Also published as: CN112307210B

Abstract

The invention provides a document label prediction method, a system, a medium and an electronic device, wherein the method comprises the following steps: extracting keywords according to the original document to obtain a keyword set of the original document; classifying the keywords in the keyword set to obtain a classification system of the documents corresponding to the keywords; labeling a classification system of the documents corresponding to the keywords to obtain a training data set; inputting a training data set into a document classification neural network for training to obtain a document label prediction model; inputting a document to be predicted into the document label prediction model, and performing label prediction on the document to be predicted; according to the document label prediction method, the original document is processed to obtain the document label prediction model, the document to be predicted is input into the document label prediction model to be trained, label prediction of the document to be predicted is achieved, the matching degree of the document and the label is high, implementation is convenient, and accuracy is high.

Description

Document tag prediction method, system, medium and electronic device

Technical Field

The present invention relates to the field of electronics, and in particular, to a method, a system, a medium, and an electronic device for predicting a document tag.

Background

The characters are carriers of human civilization, contain a large amount of valuable information, and are typical unstructured data, so that corresponding labels are marked on text contents, the application is very difficult, at present, the labels are usually added to the documents in a manual mode, the matching degree of the labels and the document contents is low, the accuracy is low, and the working efficiency is low.

Disclosure of Invention

The invention provides a document tag prediction method, a system, a medium and an electronic device, which aim to solve the problems that tags are not convenient to add to documents and the matching degree is low in the prior art.

The document tag prediction method provided by the invention comprises the following steps:

extracting keywords according to an original document to obtain a keyword set of the original document;

classifying the keywords in the keyword set to obtain a classification system of the documents corresponding to the keywords;

labeling a classification system of the document corresponding to the keyword to obtain a training data set;

inputting the training data set into a document classification neural network for training to obtain a document label prediction model;

and inputting the document to be predicted into the document label prediction model, and performing label prediction on the document to be predicted.

Optionally, the step of obtaining the training data set includes:

acquiring related vocabularies of the keywords in the keyword set, wherein the related vocabularies and the keywords in the keyword set have a superior-inferior relation;

classifying original documents according to keywords in a keyword set and the associated vocabularies to obtain a document classification system, and taking the keywords in the keyword set and the associated vocabularies as associated keywords in the document classification system;

and labeling the document classification system to obtain the training data set.

Optionally, the step of extracting the keyword according to the original document includes:

acquiring an original document;

performing word segmentation on the original document, acquiring a first original vocabulary set, and further acquiring word frequency of vocabularies in the first original vocabulary set;

determining irrelevant words according to the word frequency of the words in the first original word set, and further acquiring a disabled word set;

and screening stop words in the keyword set according to the stop word set, and further determining the keyword set.

Optionally, the step of performing label prediction on the document to be predicted includes:

performing word segmentation on the document to be predicted and removing stop words so as to obtain a word set to be predicted;

vectorizing the vocabulary to be predicted to obtain the vectorized vocabulary to be predicted;

vectorizing the document to be predicted according to the vectorized vocabulary to be predicted to obtain a document vector to be predicted;

and inputting the document vector to be predicted into the document label prediction model for training, and performing label prediction on the document to be predicted.

Optionally, the step of inputting the document vector to be predicted into the document tag prediction model for training, and the step of performing tag prediction on the document to be predicted includes:

extracting keywords from the document to be predicted according to the vector of the document to be predicted, and matching the obtained keywords with associated keywords in different categories to obtain a matching result;

classifying and labeling the documents to be predicted according to the matching result to obtain the categories of the documents to be predicted;

and performing label prediction on the document to be predicted according to the associated keywords corresponding to the category of the document to be predicted.

Optionally, the step of performing label prediction on the document to be predicted according to the associated keyword corresponding to the category of the document to be predicted includes:

acquiring the weight of the associated keywords corresponding to the category of the document to be predicted;

acquiring the scores of the associated keywords according to the weights of the associated keywords and the word frequencies of the associated keywords in the original document;

performing label prediction on the document to be predicted according to the scores of the associated keywords;

the mathematical expression of the weight of the associated keyword corresponding to the category of the document to be predicted is obtained as follows:

wherein w is the weight of the associated keyword, n_wordFor the number of occurrences of the associated keyword, n_docThe number of original documents corresponding to the associated keywords of the same category.

Optionally, the step of performing label prediction on the document to be predicted according to the score of the associated keyword includes:

when the score of the associated keyword is larger than the preset score threshold value, performing label prediction on the document to be predicted, wherein the mathematical expression of obtaining the score of the associated keyword is as follows:

wherein s is the association score, and n is the high-frequency vocabularyNumber of (2), w_iIs the weight corresponding to the high frequency vocabulary, x_iAnd i is the word frequency of the high-frequency words and the serial number of the words.

The invention also provides a document tag prediction system, comprising:

the preprocessing module is used for extracting keywords according to an original document to obtain a keyword set of the original document; classifying the keywords in the keyword set to obtain a classification system of the documents corresponding to the keywords;

the processing module is used for labeling the classification system of the document corresponding to the keyword to obtain a training data set; inputting the training data set into a document classification neural network for training to obtain a document label prediction model;

the prediction module is used for inputting the document to be predicted into the document label prediction model and performing label prediction on the document to be predicted; the preprocessing module, the processing module and the prediction module are connected.

The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method as defined in any one of the above.

The present invention also provides an electronic terminal, comprising: a processor and a memory;

the memory is adapted to store a computer program and the processor is adapted to execute the computer program stored by the memory to cause the terminal to perform the method as defined in any one of the above.

The invention has the beneficial effects that: according to the document label prediction method, the original document is processed to obtain the document label prediction model, the document to be predicted is input into the document label prediction model to be trained, label prediction of the document to be predicted is achieved, the matching degree of the document and the label is high, implementation is convenient, and accuracy is high.

Drawings

FIG. 1 is a flow chart of a document tag prediction method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a document tag prediction method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a document tag prediction system in an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The inventor finds that a character is typical unstructured data, and a corresponding label is marked on text content, so that the application is very difficult, at present, a label is usually added to a document in a manual mode, the matching degree of the label and the document content is low, the accuracy is low, and the working efficiency is low.

As shown in fig. 1, the document tag prediction method in the present embodiment includes:

s101: extracting keywords according to an original document to obtain a keyword set of the original document; wherein the original document includes: according to different application scenarios, the content of the original document can be adjusted, for example, according to text data such as news, policy, and comment: label prediction and/or recommendation are carried out on the policy files, the related policy files can be used as original documents, the relevance of the original documents is increased, and the matching degree of prediction labels and the documents to be predicted is improved;

according to the original document, the step of extracting the key words comprises the following steps:

acquiring the word frequency and the inverse document frequency of the vocabulary in the original document, acquiring the weight corresponding to the vocabulary in the original document according to the word frequency and the inverse document frequency of the vocabulary in the original document, and extracting the keywords according to the acquired weight, wherein the mathematical expression of the weight corresponding to the vocabulary in the original document is as follows:

TF-IDF＝TF×IDF

the TF-IDF is the weight of the vocabulary in the original document, the TF is the word frequency of the vocabulary in the original document, and the IDF is the inverse document frequency corresponding to the vocabulary in the original document;

the mathematical expressions for obtaining the word frequency and the inverse document frequency are as follows:

calculating to obtain the TF-IDF value of each word according to the calculation formula, and extracting keywords according to the TF-IDF value, for example: taking a vocabulary with a larger TF-IDF value as a keyword of the document;

s102: classifying the keywords in the keyword set to obtain a classification system of the documents corresponding to the keywords, labeling the classification system of the documents corresponding to the keywords to obtain a training data set;

the step of establishing a classification system comprises the following steps: merging the same and similar vocabularies to form different categories, and mining the upper and lower position relations among the different categories, for example, the automobile is the upper concept of the gearbox; collecting the categories of different topics to form classification trees of different topics, and further forming a complete classification system;

s103: inputting the training data set into a document classification neural network for training to obtain a document label prediction model;

s104: inputting a document to be predicted into the document label prediction model, and performing label prediction on the document to be predicted; the method comprises the steps of extracting keywords from a high-frequency vocabulary set in an original document, classifying and labeling the keywords in the keyword set to obtain a training data set, enabling coverage data of the training data set to be comprehensive, enabling accuracy of document classification to be high, inputting the training data set into a document classification neural network for training to obtain a document label prediction model, enabling the document label prediction model to be capable of conducting deep learning, classifying and label predicting input documents, inputting the documents to be predicted into the document label prediction model for training, enabling label prediction of the documents to be predicted to be achieved, enabling matching degree of the documents and labels to be high, and being convenient to implement, high in accuracy and low in cost.

As shown in FIG. 2, a document tag prediction method in some embodiments includes:

s201: acquiring an original document, segmenting words of the original document, acquiring a first original vocabulary set, and further acquiring word frequency of the vocabulary in the first original vocabulary set;

s202: determining a keyword set of the original document according to the word frequency of the vocabulary in the first original vocabulary set, namely determining irrelevant vocabulary according to the word frequency of the vocabulary in the first original vocabulary set so as to obtain a deactivated vocabulary set; extracting keywords according to an original document to obtain a keyword set of the original document; according to the stop word set, stop word screening is carried out on the keyword set, and then the keyword set is determined;

s203: classifying original documents according to the keywords in the keyword set and associated vocabularies having upper and lower relations with the keywords to obtain a document classification system, namely acquiring the associated vocabularies of the keywords in the keyword set, wherein the associated vocabularies have upper and lower relations with the keywords in the keyword set; classifying original documents according to keywords in a keyword set and the associated vocabularies to obtain a document classification system, and taking the keywords in the keyword set and the associated vocabularies as associated keywords of different categories in the document classification system; the method comprises the steps of classifying original documents by acquiring associated vocabularies with upper and lower relations with keywords, and classifying the original documents by the keywords in the keyword set and the associated vocabularies with the upper and lower relations with the keywords, so that the accuracy of classifying the original documents is improved;

s204: labeling the document classification system to obtain the training data set;

in some embodiments, the document classification system is labeled to obtain a test data set, the test data set is input into the document label prediction model, and the document label prediction model is tested, so that the accuracy of the document label prediction model is improved; testing the document label prediction model through a test data set to ensure the prediction precision of the document label prediction model;

s205: building a document classification neural network based on deep learning, inputting the training data set into the document classification neural network for training, and obtaining a document label prediction model;

s206: inputting the document to be predicted into the document label prediction model for training, and performing label prediction on the document to be predicted, wherein the training step comprises the following steps: performing word segmentation on the document to be predicted and removing stop words so as to obtain a word set to be predicted; vectorizing the vocabulary to be predicted to obtain the vectorized vocabulary to be predicted; vectorizing the document to be predicted according to the vectorized vocabulary to be predicted to obtain a document vector to be predicted; inputting the document vector to be predicted into the document label prediction model for training, extracting keywords from the document to be predicted by the document label prediction model according to the document vector to be predicted, and matching the obtained keywords with associated keywords in different categories to obtain a matching result; classifying and labeling the documents to be predicted according to the matching result to obtain the categories of the documents to be predicted;

performing label prediction on the document to be predicted according to the associated keywords corresponding to the category of the document to be predicted; acquiring the weight of the associated keywords corresponding to the category of the document to be predicted; acquiring the scores of the associated keywords according to the weights of the associated keywords and the word frequencies of the associated keywords in the original document; performing label prediction on the document to be predicted according to the scores of the associated keywords;

when the score of the associated keyword is greater than the preset score threshold, performing label prediction on the document to be predicted, wherein the mathematical expression of the weight of the associated keyword corresponding to the category of the document to be predicted is obtained as follows:

wherein w is the weight of the associated keyword, n_wordFor the number of occurrences of the associated keyword, n_docThe number of original documents corresponding to the associated keywords in the same category; in some embodiments, the obtained weights may be further normalized to obtain weights of the normalized associated keywords in the keyword sets of the corresponding categories;

obtaining a mathematical expression of the score of the associated keyword as:

wherein s is the association score, n is the number of the high-frequency vocabulary, and w_iIs the weight corresponding to the high frequency vocabulary, x_iAnd i is the word frequency of the high-frequency words and the serial number of the words.

The document tag prediction method provided by this embodiment may also be applied to a plurality of application scenarios, such as document search, personalized recommendation, and knowledge graph construction, for example: inputting keywords into a document label prediction model for classification and matching, selecting a document with a high matching degree for recommendation, or inputting a document to be predicted into the document label prediction model, extracting the keywords of the document to be predicted, obtaining a feature vector of the keywords of the document to be predicted, splicing and combining the feature vectors of the keywords of the document to be predicted, obtaining a spliced feature vector of the document to be predicted, inputting the spliced feature vector into the document label prediction model for training, and performing label prediction on the document to be predicted.

In some embodiments, after performing tag prediction on the document to be predicted, the document to be predicted may also be subjected to tag recommendation by a tag recommendation model, and receive feedback of a user on a recommended tag, and update the document tag prediction model according to feedback content, so as to improve accuracy of the document tag prediction, where the tag recommendation model performs tag recommendation according to a tag prediction result, and plays roles of prompting and assisting recommendation in tagging, and the tag recommendation model is obtained by: one or more label prediction results are obtained, and the label prediction results are input into a deep learning neural network for training to obtain a label recommendation model.

As shown in fig. 3, the present embodiment also provides a document tag prediction system, including:

the prediction module is used for inputting the document to be predicted into the document label prediction model and performing label prediction on the document to be predicted; the preprocessing module, the processing module and the prediction module are connected. The method comprises the steps of processing an original document to obtain a document label prediction model, inputting a document to be predicted into the document label prediction model for training, so that label prediction and/or recommendation of the document to be predicted are/is realized, the matching degree of the document and a label is high, the implementation is convenient, and the accuracy is high.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

The present embodiment further provides an electronic terminal, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the method in the embodiment.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The electronic terminal provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for completing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program so that the electronic terminal can execute the steps of the method.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A document tag prediction method, comprising:

2. The document tag prediction method of claim 1, wherein the step of obtaining a training data set comprises:

3. The method of claim 1, wherein the step of extracting keywords from the original document comprises:

acquiring an original document;

4. The document tag prediction method according to claim 1, wherein the step of performing tag prediction on the document to be predicted comprises:

5. The training set obtaining method according to claim 4, wherein the document vector to be predicted is input into the document label prediction model for training, and the step of performing label prediction on the document to be predicted comprises:

6. The document tag recommendation method according to claim 5, wherein the step of performing tag prediction on the document to be predicted according to the associated keywords corresponding to the category of the document to be predicted comprises:

7. The document tag prediction method according to claim 6, wherein the step of performing tag prediction on the document to be predicted according to the score of the associated keyword comprises:

8. A document tag prediction system, comprising:

9. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.

10. An electronic terminal, comprising: a processor and a memory;

the memory is for storing a computer program and the processor is for executing the computer program stored by the memory to cause the terminal to perform the method of any of claims 1 to 7.