CN112800223A - Content recall method and system based on long text labeling - Google Patents

Content recall method and system based on long text labeling Download PDF

Info

Publication number
CN112800223A
CN112800223A CN202110104006.3A CN202110104006A CN112800223A CN 112800223 A CN112800223 A CN 112800223A CN 202110104006 A CN202110104006 A CN 202110104006A CN 112800223 A CN112800223 A CN 112800223A
Authority
CN
China
Prior art keywords
data set
news
model
content
textcnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110104006.3A
Other languages
Chinese (zh)
Inventor
陈倩倩
景艳山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Minglue Artificial Intelligence Group Co Ltd
Original Assignee
Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Minglue Artificial Intelligence Group Co Ltd filed Critical Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority to CN202110104006.3A priority Critical patent/CN112800223A/en
Publication of CN112800223A publication Critical patent/CN112800223A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application relates to a content recall method and a content recall system based on long text labeling, wherein the method comprises the following steps: a label system construction step, which is used for constructing a label system; a data preprocessing step, which is used for acquiring an original data set and constructing input data and a dictionary based on the original data set and the label system; a news classification step, which is used for constructing a TextCNN model, training the TextCNN model into a target TextCNN model by utilizing a test data set and a label data set, performing classification prediction on the test data set by utilizing the target TextCNN model, and outputting a classification label; and a content recall step, namely representing the user data, the news content and the classification labels as graphs, and screening and recalling the news nodes according to the relevance of the news nodes relative to the user nodes. According to the method and the device, the TextCNN is used for predicting the classification labels of the data, and the news is recalled based on the classification labels, so that the variety of the recall is improved.

Description

Content recall method and system based on long text labeling
Technical Field
The application relates to the technical field of internet, in particular to a content recall method and system based on long text labeling.
Background
And news content recall work is an important work in the field of news recommendation. In the prior art, because news is large in size and large in total quantity of characteristics of news, news information recalled in a traditional news recall mode is not suitable for user interests.
The text multi-label classification is used as an important part in natural language processing, and the main purpose of the classification is to divide news text content according to a constructed classification system, label the news content to assist user analysis and user insight, help business personnel to quickly learn about news, find obvious features inside the news, obtain some business inspiration, help the people to find user preference, recommend better and more appropriate news for the users, enrich dimensionality of data, and assist business landing.
At present, machine learning and deep learning algorithms are mainly adopted for text classification, word2vec (a tool for converting words into vector forms) is utilized to effectively extract semantic information in news contents, a classification model based on machine learning and deep learning is constructed, and automatic classification of the news contents on a network news platform is realized; text classification based on a BERT (Bidirectional Encoder Representation from transforms) pre-training model is good in effect, but due to the fact that the model is complex, the industrial field is difficult to achieve at present.
Because of the limitations of video memory occupation and computational power, the input of pre-training language models such as BERT is generally 512 tokens at the maximum, and in some scenarios, BERT may not be as effective as CNN (Convolutional Neural Networks) because BERT handles long text classification. However, although CNN is significant in the text classification task, it is difficult to capture the correlation between long-term context information and non-consecutive words. Although BERT is powerful, it is generally not feasible to deploy classification models directly by BERT in low-time-consumption scenarios and few-machine scenarios, and for training a lightweight shallow BERT, because of the difference between the news domain and the original training model domain, retraining is required, even then, inferring time when adding business-related features is still a problem. In addition, there is a severe bias in classification models due to unbalanced sample distributions.
Disclosure of Invention
The embodiment of the application provides a content recall method, a content recall system, computer equipment and a computer readable storage medium based on long text labeling.
In a first aspect, an embodiment of the present application provides a content recall method based on long text labeling, including:
a label system construction step, which is used for constructing a label system;
a data preprocessing step, which is used for acquiring an original data set and constructing input data and a dictionary based on the original data set and the label system;
and a news classification step, which is used for constructing a TextCNN model, training the TextCNN model into a target TextCNN model by using a test data set and a labeling data set, performing classification prediction on the test data set by using the target TextCNN model, and outputting a classification label.
And a content recall step of representing the user data, the news content and the classification tags as graphs, screening and recalling the news nodes according to the correlation of the news nodes relative to the user nodes, specifically, representing the user data, the news content and the classification tags as the user nodes, the news nodes and the tag nodes in a graph form according to user behaviors, setting weights among the nodes, wherein the graph is composed of vertexes, edges and weights of the edges, calculating the correlation of the news nodes relative to the user nodes on the graph based on a Personalrank algorithm, and screening and recalling the news nodes in a descending order according to the correlation and a preset range. Specifically, the PersonalRank algorithm scores each node through the connected edges to calculate the relevance, and recalls N news before ranking according to the ranking from large to small of the relevance, wherein N is a natural number.
Based on the steps, the method and the device effectively solve the problem that gradient disappearance is easily generated when the LSTM model is used for long text classification in the prior art by using the TextCNN model, and increase the variety of recalling through recalling news contents by using the graph recommendation algorithm PersonalRank algorithm.
In some of these embodiments, the news classifying step further comprises:
a model building step, which is used for building the TextCNN model with an attention mechanism, wherein the attention mechanism is introduced into the model so as to facilitate the introduction of service-related features;
a data distillation step, which is used for training the BERT model based on the test data set, labeling the unlabeled data set by using the BERT model obtained by training to obtain a labeled data set, and then training the TextCNN model by using the test data set and the labeled data set to obtain a target TextCNN model, wherein the contents input into the model by the test data set and the unlabeled data set are constructed based on the data preprocessing step;
and a data label obtaining step, which is used for carrying out classification prediction on the test data set by using the target TextCNN model to obtain a corresponding classification label.
In some of these embodiments, the TextCNN model further comprises:
the word vector layer is used for converting input data into word vectors and outputting the word vectors;
an attention mechanism layer, which is used for creating a context vector for each word and effectively solving the correlation between the context and the non-continuous words;
the convolutional layers are used for inputting the word vectors and the context vectors into the convolutional layers as the representation of the words to carry out convolution operation, and then activating by using an activation function, each convolution kernel obtains a plurality of corresponding feature map maps, each convolutional layer at least comprises 6 convolution kernels, the sizes of the convolution kernels are 2 × 5, 3 × 5 and 5 × 5, and each size is 2;
the pooling layer is used for pooling a plurality of feature map output by the convolutional layer, optionally, the pooling layer extracts the maximum value of each feature map by using maximum pooling max power, and then cascades the maximum values to obtain feature representation;
and the output layer is used for classifying and splicing the output classification label based on the softmax layer and the Concat vector layer.
In some of these embodiments, the data preprocessing step further comprises:
an original data set obtaining step of obtaining the original data set, the original data set further comprising: user data, news content, and news headlines;
an input data construction step, which is used for extracting the key words of the news content and combining the news title, the key words and the news content to obtain input data;
the dictionary building step is used for dividing the input data by taking a word as a unit to obtain a plurality of word elements, counting the frequency of each word element, screening and sequencing the word elements in a descending order based on a frequency threshold value min _ freq, building a dictionary based on a set dictionary size max _ size, specifically, sequencing the word elements with the frequency greater than the preset frequency threshold value min _ freq based on the frequency threshold value min _ freq, taking the first max _ size elements, and building the dictionary according to the frequency descending order.
In some of these embodiments, the data distillation step further comprises:
a BERT model training step of training the BERT model based on the test data set by using a BERT pre-training model;
a data labeling step, namely predicting the unmarked data set by using the BERT model, and outputting a supplementary corpus with the accuracy rate of more than 0.9 to obtain a labeled data set;
and a textCNN model training step, which is used for training the textCNN model based on the test data set and the labeling data set to obtain a target textCNN model.
In order to solve the problem that the scale of unmarked data of a news recommendation scene is large, based on the steps, a BERT model is used for carrying out pseudo marking on the unmarked data, and then a TextCNN model is used for learning, so that the effectiveness and the accuracy of a classification result are effectively improved.
In some of these embodiments, the TextCNN model uses a Gaussian kernel centered on any word for attenuation in calculating the attention weights for the attention mechanism. And the time complexity of the operation is effectively reduced by adopting an improved attention mechanism.
In a second aspect, an embodiment of the present application provides a content recall system based on long text tagging, including:
the label system building module is used for building a label system;
the data preprocessing module is used for acquiring an original data set and constructing input data and a dictionary based on the original data set and the label system;
the news classification module is used for constructing a TextCNN model, training the TextCNN model into a target TextCNN model by utilizing a test data set and a labeling data set, performing classification prediction on the test data set by utilizing the target TextCNN model and outputting a classification label;
the content recall module is used for representing the user data, the news content and the classification labels as graphs, screening and recalling the news nodes according to the relevance of the news nodes relative to the user nodes, specifically, representing the user data, the news content and the classification labels as the user nodes, the news nodes and the label nodes in a graph form according to user behaviors, setting weights among the nodes, wherein the graph is composed of vertexes, edges and weights of the edges, calculating the relevance of the news nodes on the graph relative to the user nodes based on a Personalrank algorithm, and screening and recalling the news nodes in a descending order according to the relevance and a preset range. Specifically, the PersonalRank algorithm scores each node through the connected edges to calculate the relevance, and recalls N news before ranking according to the ranking from large to small of the relevance, wherein N is a natural number.
Based on the modules, the method and the device effectively solve the problem that gradient disappearance is easily generated when an LSTM model is used for long text classification in the prior art by using the TextCNN model, and increase the variety of recalling through recalling news contents by using a graph recommendation algorithm PersonalRank algorithm.
In some of these embodiments, the news classification module further comprises:
the model building module is used for building the TextCNN model with an attention mechanism, and the model introduces the attention mechanism into the model so as to facilitate the introduction of service-related features;
the data distillation module is used for training the BERT model based on the test data set, labeling the unlabeled data set by using the trained BERT model to obtain a labeled data set, and then training the TextCNN model by using the test data set and the labeled data set to obtain a target TextCNN model;
and the data label acquisition module is used for carrying out classification prediction on the test data set by using the target TextCNN model to obtain a corresponding classification label.
In some of these embodiments, the TextCNN model further comprises:
the word vector layer is used for converting input data into word vectors and outputting the word vectors;
an attention mechanism layer, which is used for creating a context vector for each word and effectively solving the correlation between the context and the non-continuous words;
the convolutional layers are used for inputting the word vectors and the context vectors into the convolutional layers as the representation of the words to carry out convolution operation, and then activating by using an activation function, each convolution kernel obtains a plurality of corresponding feature map maps, each convolutional layer at least comprises 6 convolution kernels, the sizes of the convolution kernels are 2 × 5, 3 × 5 and 5 × 5, and each size is 2;
the pooling layer is used for pooling a plurality of feature map output by the convolutional layer, optionally, the pooling layer extracts the maximum value of each feature map by using maximum pooling max power, and then cascades the maximum values to obtain feature representation;
and the output layer is used for classifying and splicing the output classification label based on the softmax layer and the Concat vector layer.
In some embodiments, the data preprocessing module further comprises:
an original data set obtaining module, configured to obtain the original data set, where the original data set further includes: user data, news content, and news headlines;
the input data construction module is used for extracting the keywords of the news content and combining the news title, the keywords and the news content to obtain input data;
the dictionary building module is used for dividing the input data by taking a word as a unit to obtain a plurality of word elements, counting the frequency of each word element, screening and sequencing the word elements in a descending order based on a frequency threshold min _ freq, building a dictionary based on a set dictionary size max _ size, specifically, sequencing the word elements with the frequency greater than the preset frequency threshold min _ freq based on the frequency threshold min _ freq, taking the first max _ size elements, and building the dictionary according to the frequency descending order.
In some of these embodiments, the data distillation module further comprises:
a BERT model training module to train the BERT model based on the test data set using a BERT pre-training model;
the data labeling module is used for predicting the unlabeled data set by using the BERT model and outputting a supplementary corpus with the accuracy rate of more than 0.9 to obtain a labeled data set;
and the TextCNN model training module is used for training the TextCNN model based on the test data set and the labeling data set to obtain a target TextCNN model.
In order to solve the problem that the scale of unmarked data of a news recommendation scene is large, based on the modules, a BERT model is used for carrying out pseudo marking on the unmarked data, and then a TextCNN model is used for learning, so that the effectiveness and the accuracy of a classification result are effectively improved.
In some of these embodiments, the TextCNN model uses a Gaussian kernel centered on any word for attenuation in calculating the attention weights for the attention mechanism. And the time complexity of the operation is effectively reduced by adopting an improved attention mechanism.
In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the content recall method based on long text tagging as described in the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the content recall method based on long text tagging as described in the first aspect above.
Compared with the related art, the content recall method, the content recall system, the computer equipment and the computer readable storage medium based on the long text labeling effectively solve the problem that gradient disappearance is easily generated when an LSTM model is used for long text classification in the prior art by using a TextCNN model, and increase the variety of recalls by recalling news contents through a graph recommendation algorithm PersonalRank algorithm.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow diagram of a content recall method based on long text tagging according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a TextCNN model according to an embodiment of the present application;
fig. 3 is a block diagram of a content recall system based on long text tagging according to an embodiment of the present application.
Description of the drawings:
1. a tag system construction module; 2. a data preprocessing module; 3. a news classification module;
4. a content recall module; 21. an original data set acquisition module; 22. an input data construction module;
23. a dictionary construction module; 31. a model building module; 32. a data distillation module;
33. a data tag acquisition module; 321. a BERT model training module;
322. a data annotation module; 323. and a TextCNN model training module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The content recalling method and system based on long text labeling can be applied to various recommendation systems, such as a news recommendation system or any other recommendation system needing to label and classify content. In order to realize the technical effects of predicting classification tags of data by using TextCNN, recalling news based on the obtained classification tags and improving the variety of recalling, the embodiment provides a content recall method based on long text labeling. Fig. 1 is a flowchart of a content recall method based on long text labeling according to an embodiment of the present application, and referring to fig. 1, the flowchart includes the following steps:
a label system construction step S1, for constructing a label system; for example and without limitation, by crawling a news data set disclosed by 'this day' headlines, the obtained original data set is totally divided into 15 categories, including livelihood, cultural, entertainment, sports, financial, real estate, automobile, education, science and technology, military, tourism, international, securities, agriculture and electronic contests, and if the number of securities news is obviously less than that of other categories due to unbalanced data of a platform, the securities are classified into the financial categories, so that the data distribution is uniform, the problem of classification model bias caused by unbalanced sample distribution is effectively solved, and for unknown tags, the tags can be preliminarily divided in a text clustering manner;
a data preprocessing step S2, configured to acquire an original data set and construct input data and a dictionary based on the original data set and a label system;
a news classification step S3, configured to construct a TextCNN model, train the TextCNN model as a target TextCNN model by using a test data set and a label data set, perform classification prediction on the test data set by using the target TextCNN model, and output a classification label;
a content recall step S4, configured to represent the user data, the news content, and the classification tags as graphs, filter and recall the news nodes according to the relevance of the news nodes with respect to the user nodes, specifically, represent the user data, the news content, and the classification tags as the user nodes, the news nodes, and the tag nodes in a graph form according to user behaviors, and set weights among the nodes, where the graph is composed of vertices, edges, and weights of the edges, calculate the relevance of the news nodes on the graph with respect to the user nodes based on a PersonalRank algorithm, and filter and recall the news nodes in a descending order according to the relevance and a preset range. Specifically, according to the ranking from big to small of the relevance, N news before ranking are recalled, specifically, N is a natural number, and the value of N can be 300-500 or other value ranges set according to requirements. For the above example of the content recall step, if the user u clicks a certain news i, and the label of the news is b, then it is denoted as (u, i, b), then an edge is added between the vertex v (u) corresponding to the user u and the vertex v (i) corresponding to the news i, it is noted that if the two vertices have edges connected, the weight of the edge should be increased by 1, and similarly, an edge needs to be added between v (u) and v (b), and an edge needs to be connected between v (i) and v (b).
Based on the steps, the method and the device effectively solve the problem that gradient disappearance is easily generated when the LSTM model is used for long text classification in the prior art by using the TextCNN model, and increase the variety of recalling through recalling news contents by using the graph recommendation algorithm PersonalRank algorithm.
In some of these embodiments, the data preprocessing step S2 further includes:
an original data set obtaining step S21, configured to obtain an original data set, the original data set further including: user data, news content, and news headlines;
an input data construction step S22, configured to extract keywords of the news content, and combine the news headlines, the keywords, and the news content to obtain input data;
a dictionary construction step S23, configured to divide the input data by taking a word as a unit to obtain a plurality of word elements, count the frequency of each word element, screen and sort the plurality of word elements in a descending order based on a frequency threshold min _ freq, and construct a dictionary based on a set dictionary size max _ size, specifically, sort the word elements with a frequency greater than the preset frequency threshold min _ freq based on the frequency threshold min _ freq, take the first max _ size elements, and construct a dictionary in a descending order according to the frequency. By way of example and not limitation, based on the above steps, a dictionary D is obtained, the original text content [ 'sphere', 'member', 'Lai', 'cloth', ], and the word vector matrix maps words to numbers based on the dictionary D, such as [4, 5, 6, 7], 4 representing the location of the 'sphere' in the dictionary D, and 5 representing the location of the 'member' in the dictionary D. The super parameters min _ freq and max _ size are values set according to the size of the empirical data volume and the distribution of the data, and can be adjusted according to actual requirements.
Considering that CNNs, although effective, have a disadvantage in text classification tasks, such as being not intuitive enough and having poor interpretability, and attention mechanism attention is a common modeling long-term memory mechanism in the field of natural language processing, which can intuitively give out the contribution of each word to the result, in some embodiments, the news classification step S3 further includes:
a model construction step S31, which is used for constructing a TextCNN model with an attention mechanism, wherein the step introduces the attention mechanism into the model so as to introduce service-related characteristics more conveniently; wherein fig. 2 is a schematic diagram of a TextCNN model according to an embodiment of the present application, and referring to fig. 2, the TextCNN model further includes: the word vector layer is used for converting input data into word vectors and outputting the word vectors; an attention mechanism layer, which is used for creating a context vector for each word and effectively solving the correlation between the context and the non-continuous words; the convolutional layer is used for inputting word vectors and context vectors into the convolutional layer as the representation of words to perform convolutional operation, and then activating by using an activation function, each convolutional kernel obtains a plurality of corresponding feature map maps, the convolutional layer at least comprises 6 convolutional kernels, the sizes of the convolutional kernels are 2 × 5, 3 × 5 and 5 × 5, and each size of the convolutional kernels is 2; the pooling layer is used for pooling a plurality of feature map output by the convolutional layer, optionally, the pooling layer extracts the maximum value of each feature map by using maximum pooling max power, and then cascading the feature maps to obtain feature representation; and the output layer is used for classifying and splicing the output classification label based on the softmax layer and the Concat vector layer.
A data distillation step S32, which is used for training the BERT model based on the test data set, labeling the unlabeled data set by using the BERT model obtained by training to obtain a labeled data set, and then training the TextCNN model by using the test data set and the labeled data set to obtain a target TextCNN model; the contents of the test data set and the unlabeled data set input into the model are constructed based on step S2.
And a data label obtaining step S33, which is used for carrying out classification prediction on the test data set by using the target TextCNN model to obtain a corresponding classification label.
In some of these embodiments, the data distillation step S32 further comprises:
a BERT model training step S321 of training a BERT model based on the test data set by using a BERT pre-training model;
a data labeling step S322, which is used for predicting the unlabeled data set by using a BERT model, outputting a supplementary corpus with the accuracy rate of more than 0.9, and obtaining a labeled data set;
and a TextCNN model training step S323, configured to train the TextCNN model based on the test data set and the labeling data set, to obtain a target TextCNN model.
In order to solve the problem that the scale of unmarked data of a news recommendation scene is large, based on the steps, a BERT model is used for carrying out pseudo marking on the unmarked data, and then a TextCNN model is used for learning, so that the effectiveness and the accuracy of a classification result are effectively improved.
Since it is intuitive that a pair of words far away from each other are often less connected, the embodiment of the present application adds distance attenuation to the attention mechanism layer, and optionally, the embodiment of the present application introduces gaussian function exponential attenuation; in some of these embodiments, the TextCNN model uses a Gaussian kernel centered on any word to attenuate when calculating the weights for attentions in the attention mechanism. And the time complexity of the operation is effectively reduced by adopting an improved attention mechanism.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The embodiment also provides a content recall system based on long text labeling, and the device is used for implementing the above embodiments and preferred embodiments, and the description of the device is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of a content recall system based on long text labeling according to an embodiment of the present application, and as shown in fig. 3, the apparatus includes:
the label system building module 1 is used for building a label system; for example and without limitation, by crawling a news data set disclosed by ' this day ' first item ', the original data set is totally divided into 15 categories including civil, cultural, entertainment, sports, financial, real estate, automobiles, education, science and technology, military, tourism, international, securities, agriculture and electronic competitions, and if the quantity of securities news is obviously less than that of other categories due to unbalanced data of a platform, the securities are classified into the financial categories, so that the data distribution is uniform, the problem of classification model bias caused by unbalanced sample distribution is effectively solved, and for unknown tags, the tags can be preliminarily divided in a text clustering mode;
the data preprocessing module 2 is used for acquiring an original data set and constructing input data and a dictionary based on the original data set and a label system;
the news classification module 3 is used for constructing a TextCNN model, training the TextCNN model into a target TextCNN model by utilizing a test data set and a label data set, performing classification prediction on the test data set by utilizing the target TextCNN model, and outputting a classification label;
and the content recall module 4 is used for representing the user data, the news content and the classification labels as graphs, screening and recalling the news nodes according to the correlation of the news nodes relative to the user nodes, specifically, representing the user data, the news content and the classification labels as the user nodes, the news nodes and the label nodes in a graph form according to user behaviors, setting weights among the nodes, wherein the graph is composed of vertexes, edges and weights of the edges, calculating the correlation of the news nodes on the graph relative to the user nodes based on a Personalrank algorithm, and screening and recalling the news nodes in a descending order according to the correlation and a preset range. Specifically, according to the ranking from big to small of the relevance, N news before ranking are recalled, wherein N is a natural number, and the value of N can be 300-500 or other value ranges set according to requirements. For the above content recall module 4, for example and without limitation, if the user u clicks a certain news i, and the label of the news is b, then it is denoted as (u, i, b), an edge is added between the vertex v (u) corresponding to the user u and the vertex v (i) corresponding to the news i, and it is noted that if the two vertices have edges connected, the weight of the edge should be increased by 1, and similarly, an edge needs to be added between v (u) and v (b), and an edge also needs to be connected between v (i) and v (b).
Based on the modules, the method and the device effectively solve the problem that gradient disappearance is easily generated when an LSTM model is used for long text classification in the prior art by using the TextCNN model, and increase the variety of recalling through recalling news contents by using a graph recommendation algorithm PersonalRank algorithm.
The data preprocessing module 2 may include: an original data set obtaining module 21, configured to obtain an original data set, where the original data set further includes: user data, news content, and news headlines; an input data construction module 22, configured to extract keywords of news content, and combine news titles, keywords, and news content to obtain input data; the dictionary building module 23 is configured to divide input data by taking a word as a unit to obtain a plurality of word elements, count the frequency of each word element, screen and sort the plurality of word elements in a descending order based on a frequency threshold min _ freq, and build a dictionary based on a set dictionary size max _ size, specifically, sort the word elements with a frequency greater than the preset frequency threshold min _ freq based on the frequency threshold min _ freq, take the first max _ size elements, and build the dictionary in a descending order according to the frequency. By way of example and not limitation, based on the above-described modules, a dictionary D is obtained, the original text content [ 'sphere', 'member', 'Lai', 'cloth', ], and the word vector matrix maps words to numbers based on the dictionary D, such as [4, 5, 6, 7], 4 representing the location of the 'sphere' in the dictionary D, and 5 representing the location of the 'member' in the dictionary D. The super parameters min _ freq and max _ size are values set according to the size of the empirical data volume and the distribution of the data, and can be adjusted according to actual requirements.
Considering that CNN has a disadvantage in text classification task, although it is significant, it is not intuitive enough and has a poor interpretability, and attention mechanism attention is a common modeling long-term memory mechanism in the natural language processing field, and can intuitively give out the contribution of each word to the result, therefore, in some embodiments, the news classification module 3 further includes: a model building module 31, configured to build a TextCNN model with attention mechanism to facilitate introduction of service-related features by introducing attention mechanism; the data distillation module 32 is used for training the BERT model based on the test data set, labeling the unlabeled data set by using the trained BERT model to obtain a labeled data set, and then training the TextCNN model by using the test data set and the labeled data set to obtain a target TextCNN model; and the data label obtaining module 33 is configured to perform classification prediction on the test data set by using the target TextCNN model to obtain a corresponding classification label. Wherein fig. 2 is a schematic diagram of a TextCNN model according to an embodiment of the present application, and referring to fig. 2, the TextCNN model further includes: the word vector layer is used for converting input data into word vectors and outputting the word vectors; an attention mechanism layer, which is used for creating a context vector for each word and effectively solving the correlation between the context and the non-continuous words; the convolutional layer is used for inputting word vectors and context vectors into the convolutional layer as the representation of words to perform convolutional operation, and then activating by using an activation function, each convolutional kernel obtains a plurality of corresponding feature map maps, the convolutional layer at least comprises 6 convolutional kernels, the sizes of the convolutional kernels are 2 × 5, 3 × 5 and 5 × 5, and each size of the convolutional kernels is 2; the pooling layer is used for pooling a plurality of feature map output by the convolutional layer, optionally, the pooling layer extracts the maximum value of each feature map by using maximum pooling max power, and then cascading the feature maps to obtain feature representation; and the output layer is used for classifying and splicing the output classification label based on the softmax layer and the Concat vector layer. In addition, since intuitively it is often less connected to a pair of words far away from each other, distance attenuation is added to the attention mechanism layer in the embodiment of the present application, and optionally, gaussian function exponential attenuation is introduced in the embodiment of the present application; the TextCNN model uses a Gaussian kernel centered on any word for attenuation in calculating the weights for attentions. And the time complexity of the operation is effectively reduced by adopting an improved attention mechanism.
Wherein the data distillation module 32 further comprises: a BERT model training module 321 that trains BERT models based on the test data set using BERT pre-training models; the data labeling module 322 is configured to predict an unlabeled data set by using a BERT model, and output a supplementary corpus with an accuracy rate greater than 0.9 to obtain a labeled data set; and the TextCNN model training module 323 is used for training the TextCNN model based on the test data set and the labeled data set to obtain a target TextCNN model.
In order to solve the problem that the scale of unmarked data of a news recommendation scene is large, based on the modules, a BERT model is used for carrying out pseudo marking on the unmarked data, and then a TextCNN model is used for learning, so that the effectiveness and the accuracy of a classification result are effectively improved.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
In addition, the content recall method based on long text labeling in the embodiment of the present application described in conjunction with fig. 1 can be implemented by a computer device. The computer device may include a processor and a memory storing computer program instructions.
In particular, the processor may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
The memory may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor.
The processor reads and executes the computer program instructions stored in the memory to implement any one of the above-described embodiments of the method for recalling content based on long text tagging.
In addition, in combination with the content recall method based on long text labeling in the foregoing embodiments, embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the above embodiments of a long text tagging-based content recall method.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A content recall method based on long text labeling is characterized by comprising the following steps:
a label system construction step, which is used for constructing a label system;
a data preprocessing step, which is used for acquiring an original data set and constructing input data and a dictionary based on the original data set and the label system;
a news classification step, which is used for constructing a TextCNN model, training the TextCNN model into a target TextCNN model by utilizing a test data set and a label data set, performing classification prediction on the test data set by utilizing the target TextCNN model, and outputting a classification label;
and a content recall step, namely representing the user data, the news content and the classification labels as graphs, and screening and recalling the news nodes according to the relevance of the news nodes relative to the user nodes.
2. The method for recalling content based on long text labeling according to claim 1, wherein the news classification step further comprises:
a model construction step, which is used for constructing the TextCNN model with an attention mechanism;
a data distillation step, which is used for training a BERT model based on the test data set, labeling an unlabeled data set by using the BERT model obtained by training to obtain a labeled data set, and then training a TextCNN model by using the test data set and the labeled data set to obtain a target TextCNN model;
and a data label obtaining step, which is used for carrying out classification prediction on the test data set by using the target TextCNN model to obtain a corresponding classification label.
3. The method for recalling content based on long text labeling according to claim 1 or 2, wherein the data preprocessing step further comprises:
an original data set obtaining step of obtaining the original data set, the original data set further comprising: user data, news content, and news headlines;
an input data construction step, which is used for extracting the key words of the news content and combining the news title, the key words and the news content to obtain input data;
and a dictionary construction step, namely segmenting the input data by taking a word as a unit to obtain a plurality of word elements, counting the frequency of each word element, screening the word elements based on a frequency threshold, sequencing the word elements in a descending order, and constructing a dictionary based on a set dictionary size.
4. The method for recalling content based on long text labeling according to claim 2, wherein the data distilling step further comprises:
a BERT model training step of training the BERT model based on the test data set by using a BERT pre-training model;
a data labeling step, which is used for predicting the unmarked data set by using the BERT model to obtain a labeled data set;
and a textCNN model training step, which is used for training the textCNN model based on the test data set and the labeling data set to obtain a target textCNN model.
5. The method for recalling content based on long text labeling according to claim 1, wherein the TextCNN model is attenuated by using a gaussian kernel centered on any word when calculating attention weights of attention mechanism.
6. A content recall system based on long text tagging, comprising:
the label system building module is used for building a label system;
the data preprocessing module is used for acquiring an original data set and constructing input data and a dictionary based on the original data set and the label system;
the news classification module is used for constructing a TextCNN model, training the TextCNN model into a target TextCNN model by utilizing a test data set and a labeling data set, performing classification prediction on the test data set by utilizing the target TextCNN model and outputting a classification label;
and the content recall module is used for representing the user data, the news content and the classification labels as graphs, and screening and recalling the news nodes according to the relevance of the news nodes relative to the user nodes.
7. The long-text tagging-based content recall system of claim 6 wherein the news classification module further comprises:
a model building module for building the TextCNN model with attention mechanism;
the data distillation module is used for training a BERT model based on the test data set, labeling an unlabeled data set by using the trained BERT model to obtain a labeled data set, and then training a TextCNN model by using the test data set and the labeled data set to obtain a target TextCNN model;
and the data label acquisition module is used for carrying out classification prediction on the test data set by using the target TextCNN model to obtain a corresponding classification label.
8. The system according to claim 6 or 7, wherein the data preprocessing module further comprises:
an original data set obtaining module, configured to obtain the original data set, where the original data set further includes: user data, news content, and news headlines;
the input data construction module is used for extracting the keywords of the news content and combining the news title, the keywords and the news content to obtain input data;
and the dictionary construction module is used for dividing the input data by taking a word as a unit to obtain a plurality of word elements, counting the frequency of each word element, screening the word elements based on a frequency threshold, sequencing the word elements in a descending order and constructing a dictionary based on a set dictionary size.
9. The long-text tagging-based content recall system of claim 7 wherein the data distillation module further comprises:
a BERT model training module to train the BERT model based on the test data set using a BERT pre-training model;
the data labeling module is used for predicting the unmarked data set by using the BERT model to obtain a labeled data set;
and the TextCNN model training module is used for training the TextCNN model based on the test data set and the labeling data set to obtain a target TextCNN model.
10. The long-text tagging-based content recall system of claim 6 wherein the TextCNN model uses a gaussian kernel centered around any word for attenuation in calculating attention weights for the attention mechanism.
CN202110104006.3A 2021-01-26 2021-01-26 Content recall method and system based on long text labeling Pending CN112800223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110104006.3A CN112800223A (en) 2021-01-26 2021-01-26 Content recall method and system based on long text labeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110104006.3A CN112800223A (en) 2021-01-26 2021-01-26 Content recall method and system based on long text labeling

Publications (1)

Publication Number Publication Date
CN112800223A true CN112800223A (en) 2021-05-14

Family

ID=75811830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110104006.3A Pending CN112800223A (en) 2021-01-26 2021-01-26 Content recall method and system based on long text labeling

Country Status (1)

Country Link
CN (1) CN112800223A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209815A (en) * 2019-05-23 2019-09-06 国家计算机网络与信息安全管理中心 A kind of news Users' Interests Mining method of convolutional neural networks
CN110569353A (en) * 2019-07-03 2019-12-13 重庆大学 Attention mechanism-based Bi-LSTM label recommendation method
CN111008278A (en) * 2019-11-22 2020-04-14 厦门美柚股份有限公司 Content recommendation method and device
CN111309912A (en) * 2020-02-24 2020-06-19 深圳市华云中盛科技股份有限公司 Text classification method and device, computer equipment and storage medium
CN111444428A (en) * 2020-03-27 2020-07-24 腾讯科技(深圳)有限公司 Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
CN111460150A (en) * 2020-03-27 2020-07-28 北京松果电子有限公司 Training method, classification method and device of classification model and storage medium
CN111737476A (en) * 2020-08-05 2020-10-02 腾讯科技(深圳)有限公司 Text processing method and device, computer readable storage medium and electronic equipment
CN112183099A (en) * 2020-10-09 2021-01-05 上海明略人工智能(集团)有限公司 Named entity identification method and system based on semi-supervised small sample extension

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209815A (en) * 2019-05-23 2019-09-06 国家计算机网络与信息安全管理中心 A kind of news Users' Interests Mining method of convolutional neural networks
CN110569353A (en) * 2019-07-03 2019-12-13 重庆大学 Attention mechanism-based Bi-LSTM label recommendation method
CN111008278A (en) * 2019-11-22 2020-04-14 厦门美柚股份有限公司 Content recommendation method and device
CN111309912A (en) * 2020-02-24 2020-06-19 深圳市华云中盛科技股份有限公司 Text classification method and device, computer equipment and storage medium
CN111444428A (en) * 2020-03-27 2020-07-24 腾讯科技(深圳)有限公司 Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
CN111460150A (en) * 2020-03-27 2020-07-28 北京松果电子有限公司 Training method, classification method and device of classification model and storage medium
CN111737476A (en) * 2020-08-05 2020-10-02 腾讯科技(深圳)有限公司 Text processing method and device, computer readable storage medium and electronic equipment
CN112183099A (en) * 2020-10-09 2021-01-05 上海明略人工智能(集团)有限公司 Named entity identification method and system based on semi-supervised small sample extension

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
廖胜兰;吉建民;俞畅;陈小平: "基于 BERT 模型与知识蒸馏的意图分类方法", 计算机工程 *
汤人杰,江涛,杨巧节: "基于自然语言学习的智能云导诊技术", 电信科学 *
罗胤: "基于神经网络与自适应分形分析的股评情感分析", CNKI优秀硕士学位论文全文库(信息科技辑), pages 4 *
马晨峰: "混合深度学习模型在新闻文本分类中的应用", CNKI优秀硕士学位论文全文库(信息科技辑) *

Similar Documents

Publication Publication Date Title
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
CN106649818B (en) Application search intention identification method and device, application search method and server
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN107085581A (en) Short text classification method and device
CN109918560A (en) A kind of answering method and device based on search engine
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN111539197A (en) Text matching method and device, computer system and readable storage medium
CN112232058A (en) False news identification method and system based on deep learning three-layer semantic extraction framework
JP6848091B2 (en) Information processing equipment, information processing methods, and programs
CN111783754B (en) Human body attribute image classification method, system and device based on part context
CN114332680A (en) Image processing method, video searching method, image processing device, video searching device, computer equipment and storage medium
Chatfield et al. Efficient on-the-fly category retrieval using convnets and GPUs
CN115659008A (en) Information pushing system and method for big data information feedback, electronic device and medium
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
CN109344246B (en) Electronic questionnaire generating method, computer readable storage medium and terminal device
CN112307048B (en) Semantic matching model training method, matching method, device, equipment and storage medium
CN113656563A (en) Neural network searching method and related equipment
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
CN113761291A (en) Processing method and device for label classification
CN116263785A (en) Training method, classification method and device of cross-domain text classification model
CN112800223A (en) Content recall method and system based on long text labeling
CN115618950A (en) Data processing method and related device
Demidova et al. Semantic image-based profiling of users’ interests with neural networks
Monteiro et al. Fish recognition model for fraud prevention using convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination