CN113590963A

CN113590963A - Balanced text recommendation method

Info

Publication number: CN113590963A
Application number: CN202110891346.5A
Authority: CN
Inventors: 罗列异; 任益斌; 程韶曦; 王强; 吴昭琪
Original assignee: Huang Jiqi; Zhejiang Xinlan Network Media Co ltd
Current assignee: Huang Jiqi; Zhejiang Xinlan Network Media Co ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-11-02

Abstract

The invention discloses a balanced text recommendation method, which comprises the following steps: acquiring a plurality of balanced training texts; pre-training the built Bert-embedding classification model through a training text; classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model; acquiring a first text currently browsed by a user of a platform as a contrast text; extracting other first texts in the same category as the comparison texts to serve as comparison text sets; calculating first similarity between all first texts in the contrast text set and the contrast texts; sequencing all first texts under the comparison text set according to the calculated first similarity; and recommending the sorted first texts to the user as recommended texts. The balanced text recommendation method has the advantages that the classification effect of the classification model is better, and meanwhile, the recommendation of related contents is more accurate based on the similarity of sentence vector levels.

Description

Balanced text recommendation method

Technical Field

The invention relates to a balanced text recommendation method.

Background

The existing text related recommendation generally classifies contents and then recommends classified news to users. The scheme has the problem of unbalanced platform data categories, and when content-related recommendations are made for users, the problems that too much news in the classifications result in too wide recommendations and too little news result in insufficient recommendation quantity to attract the users to keep are caused. Meanwhile, the problem that the classification is not accurate enough exists.

Disclosure of Invention

The invention provides a balanced text recommendation method, which adopts the following technical scheme:

a balanced text recommendation method comprises the following specific steps:

acquiring a plurality of balanced training texts;

pre-training the built Bert-embedding classification model through a training text;

classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model;

acquiring a first text currently browsed by a user of a platform as a contrast text;

extracting other first texts in the same category as the comparison texts to serve as comparison text sets;

calculating first similarity between all first texts in the contrast text set and the contrast texts;

sequencing all first texts under the comparison text set according to the calculated first similarity;

and recommending the sorted first texts to the user as recommended texts.

Further, a specific method for extracting a plurality of other first texts in the same category as the comparison text set is as follows:

extracting a plurality of other first texts in the same category as the contrast text;

calculating the number of other first texts in the same category as the comparison text;

judging whether the quantity reaches a threshold value;

when the number does not reach a threshold value, extracting a plurality of first texts under another category most related to the category of the comparison file;

and forming a contrast text set by all the extracted first texts.

Further, when the number of the first texts does not reach the threshold value, a specific method for extracting a plurality of first texts under another category most relevant to the category of the comparison file is as follows:

calculating a second similarity of the other categories to the category of the comparison file;

sorting the other categories according to the second similarity;

the highest ranked other category is taken as the most relevant.

Further, after the platform text data containing a plurality of first texts are classified through the trained Bert-embedding classification model,

the classified categories of the first text comprise a plurality of first-level categories and a plurality of second-level categories subordinate to the first-level categories.

extracting a plurality of other first texts which are in the same second-level category as the comparison text;

calculating the number of other first texts which are in the same second-level category as the comparison text;

judging whether the quantity reaches a threshold value;

when the quantity does not reach the threshold value, calculating second similarity of all other second-level categories under the first-level category corresponding to the contrast text and the second-level categories of the contrast text;

sorting all other second-level categories under the first-level categories corresponding to the comparison texts according to the second similarity;

extracting the first texts in other second-level categories in sequence according to the sequence, and adding the extracted first texts into a plurality of other first texts in the same second-level category as the comparison texts until the number of all the first texts reaches a threshold value;

and combining all the extracted first texts into a contrast text set.

Further, if the number of other first texts in the first-level category corresponding to the comparison text is smaller than the threshold value, and the missing number is set as a second number;

calculating the third similarity between other first-level categories and the first-level categories of the comparison files;

sorting other first-level categories according to the third similarity;

randomly selecting a second quantity of first texts from another first-level category with the highest ranking;

and combining the first text randomly selected from another first-level category with the highest ranking and other first texts under the same first-level category corresponding to the contrast text to form a contrast text set.

Further, a specific method for calculating the first similarity between all the first texts in the text set and the comparison text is as follows:

respectively coding all first texts and the comparison texts under the text set by using Bert-server to obtain respective corresponding sentence vectors;

and respectively calculating first similarities of sentence vectors of all the first texts and sentence vectors of the contrast texts under the text set through cosine similarity calculation.

Further, a specific method for acquiring a plurality of balanced training texts is as follows:

acquiring a plurality of training documents from classified mature websites;

cleaning a plurality of training documents;

the washed training documents were screened so that the data volume of the training documents in each category was the same.

Further, the specific method for cleaning a plurality of training documents is as follows:

and cleaning the training documents through regular rules.

and cleaning the training documents through the regular rules and the manual cleaning rules manually set for the websites for acquiring the training documents.

The text recommendation method has the beneficial effects that the classification effect of the classification model is better, and meanwhile, the recommendation of related contents is more accurate based on the similarity of sentence vector levels.

The method has the beneficial effects that the balanced text recommendation method can solve the problem of insufficient recommendation quantity caused by less classified sample data.

Drawings

FIG. 1 is a schematic diagram of a balanced text recommendation method of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and the embodiments.

Fig. 1 shows a balanced text recommendation method of the present invention, which mainly includes the following specific steps: step S1: and acquiring a plurality of balanced training texts. Step S2: and pre-training the built Bert-embedding classification model through the training text. Step S3: and classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model. Step S4: and acquiring a first text currently browsed by a user of the platform as a comparison text. Step S5: and extracting other first texts in the same category with the contrast texts as a contrast text set. Step S6: and calculating first similarity of all the first texts under the contrast text set and the contrast text. Step S7: and ordering all the first texts under the comparison text set according to the calculated first similarity. Step S8: and recommending the sorted first texts to the user as recommended texts. By the balanced text recommendation method, the classification effect of the classification model is better, and meanwhile, the recommendation of related contents is more accurate based on the similarity of sentence vector levels. The above steps are specifically described below.

For step S1: and acquiring a plurality of balanced training texts.

The specific method for acquiring a plurality of balanced training texts comprises the following steps: several training documents are obtained from a web site that is classified mature. Several training documents were cleaned. The washed training documents were screened so that the data volume of the training documents in each category was the same.

Specifically, news websites with relatively mature news categories are selected, the number of the categories is moderate, and the category meanings are differentiated. Categories such as politics and society have large overlap, semantic distinction is not large, and classification models are interfered. In the application, the classified mature websites mean that the news articles in the websites are accurately classified and have small overlapping rate. Therefore, in the subsequent training of the classification model, the trained classification model is more accurate.

The specific method for cleaning a plurality of training documents is as follows: and cleaning the training documents through regular rules.

It can be understood that, since the content in the news website generally originates from different news sources and html tags cannot be resolved in a targeted manner, a highly adaptive html resolution step is invented for the situation, and finally text content is left. Specifically, it is preferable that a manual cleaning rule is also set for the website by a manual setting method, and then the training document is cleaned by the regular rule and the manual cleaning rule. Thus, the text data after being cleaned is cleaner.

For step S2: and pre-training the built Bert-embedding classification model through the training text.

Specifically, the Bert-embedding vector expression mode is vectorized based on character levels, so that the steps of word segmentation, word stop removal, low word frequency removal and the like are omitted, a large amount of time is saved, and negative effects possibly brought by word segmentation are avoided. Bert-embedding specifically uses an ine pre-training model. The erin is an improved version of a bert model, has the same basic architecture, and is specially optimized for Chinese corpuses on the basis of the bert model. The model training is based on the neural network DNN for training, and as semantic information of news corpora is covered in the embedding, only a simple neural network structure is needed to change the final classification quantity. The effect of the traditional machine learning method is improved greatly and is much smaller than the scale of the currently common BI-LSTM model parameters.

For step S3: and classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model.

Specifically, the classification models are classified by data obtained from the websites, and then platform text data for browsing and consulting for the user are classified by the classification models.

For step S4: and acquiring a first text currently browsed by a user of the platform as a comparison text.

For step S5: and extracting other first texts in the same category with the contrast texts as a contrast text set.

Specifically, the specific method for extracting a plurality of other first texts in the same category as the comparison text set is as follows: and extracting other first texts in the same category with the contrast text. The number of other first texts in the same category as the comparison text is calculated. And judging whether the quantity reaches a threshold value. And extracting a plurality of first texts under another category which is most relevant to the category of the comparison file when the number does not reach the threshold value. And forming a contrast text set by all the extracted first texts.

It will be appreciated that certain categories in the platform are relatively small in data size, in which case if recommendations are made using only the first text under that category to calculate similarity, there is a problem of too little news resulting in a recommendation that is not large enough to attract the user to remain. Therefore, in the present application, when the number of the first texts in the same category as the comparison text is small, the text data is extracted from the categories with higher approximation degree for data expansion. The threshold value may be set specifically according to the specific situation, and in the present application, the threshold value is set to 200 parts.

As a preferred embodiment, when the number does not reach the threshold, a specific method for extracting the first texts under another category most relevant to the category of the comparison file is as follows: a second similarity of the other categories to the category of the comparison file is calculated. The other categories are ranked according to the second similarity. The highest ranked other category is taken as the most relevant.

For step S6: and calculating first similarity of all the first texts under the contrast text set and the contrast text.

Specifically, the specific method for calculating the first similarity between all the first texts in the text set and the comparison text is as follows:

and respectively coding all the first texts and the comparison texts under the text set by using the Bert-server to obtain the sentence vectors corresponding to the first texts and the comparison texts.

The invention uses the Bert-server to carry out sentence-level coding. The sentence-level vector coding brings the direct advantage that the vector covers the whole semantic information, rather than obtaining the word coding first and then obtaining the sentence coding by means of average coding and the like. Cosine similarity calculation is carried out on the sentence vectors to obtain contents with high similarity, and the contents are very similar in character level and similar in semantics. This solves the problem that too much content under classification results in too broad recommendations.

For step S7: and ordering all the first texts under the comparison text set according to the calculated first similarity.

It is understood that the first similarity is ordered from large to small.

For step S8: and recommending the sorted first texts to the user as recommended texts.

Preferably, the first text of the top 10 ranking is recommended to the user as the recommended text.

In another preferred embodiment, in classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model, the classified classes of the first texts comprise a plurality of first-level classes and a plurality of second-level classes subordinate to the first-level classes.

The specific method for extracting a plurality of other first texts in the same category as the comparison text set comprises the following steps: and extracting other first texts in the same second-level category as the contrast text. The number of other first texts in the same second-level category as the comparison text is calculated. And judging whether the quantity reaches a threshold value. And when the quantity does not reach the threshold value, calculating the second similarity of all other second-level categories under the first-level category corresponding to the contrast text and the second-level categories of the contrast text. And sorting all other second-level categories under the first-level categories corresponding to the comparison texts according to the second similarity. And extracting the first texts in other second-level categories in sequence according to the sequence, and adding the extracted first texts into a plurality of other first texts in the same second-level category as the comparison texts until the number of all the first texts reaches a threshold value. And combining all the extracted first texts into a contrast text set.

It will be appreciated that in some application scenarios, the classification of the classification model comprises two-level classifications, i.e. a number of first-level classes, each of which comprises a number of second-level classifications. In such a case, when data recommendation is performed, a text with a higher similarity is preferentially found from the second-level classification corresponding to the comparison text. And under the condition that the data volume of the second-level classification corresponding to the contrast text is less, expanding the data from the second-level classification under the same first-level classification, wherein the sequence of expansion is based on the similarity of the second-level classification. The higher the similarity, the more preferential the addition.

As a preferred embodiment, if the number of other first texts in the first-level category corresponding to the comparison text is smaller than the threshold, and the missing number is set as the second number. A third similarity of the other first level categories to the first level categories of the comparison file is calculated. And sorting the other first-level categories according to the third similarity. A second quantity of first text is randomly selected from another first-level category with the highest ranking. And combining the first text randomly selected from another first-level category with the highest ranking and other first texts under the same first-level category corresponding to the contrast text to form a contrast text set.

It will be appreciated that when the sum of the data volumes of all the second-level classifications under the same first-level classification has not reached the threshold, the text data from the other first-level classifications may be selected to expand the set of comparison texts. The basis of expansion is still based on similarity. In the present application, a second amount of first text is randomly selected from another first-level category with the highest ranking.

It can be understood that the similarity between the second-level category under another first-level category and the second-level category of the comparison file may also be calculated, and then the selection is performed in sequence according to the similarity until the number of the first texts in the comparison text set reaches the threshold.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalent alternatives or equivalent variations fall within the scope of the present invention.

Claims

1. A balanced text recommendation method is characterized by comprising the following specific steps:

acquiring a plurality of balanced training texts;

pre-training the built Bert-embedding classification model through the training text;

classifying the platform text data containing a plurality of first texts through the trained Bert-embedding classification model;

acquiring the first text currently browsed by a user of the platform as a contrast text;

extracting other first texts in the same category as the contrast texts to serve as contrast text sets;

calculating first similarity of all the first texts under the contrast text set and the contrast text;

sequencing all the first texts under the contrast text set according to the calculated first similarity;

and recommending the first plurality of ordered pieces of the first text as recommended texts to the user.

2. The balanced text recommendation method of claim 1,

the specific method for extracting a plurality of other first texts in the same category as the comparison text set comprises the following steps:

extracting a plurality of other first texts in the same category as the contrast texts;

calculating the number of other first texts in the same category as the contrast text;

judging whether the quantity reaches a threshold value;

and forming the contrast text set by the first texts obtained by extraction.

3. The balanced text recommendation method of claim 2,

the specific method for extracting the plurality of first texts under the other category most relevant to the category of the comparison file when the number does not reach the threshold value is as follows:

calculating a second similarity of other categories to the category of the comparison file;

sorting other categories according to the second similarity;

the highest ranked other category is taken as the most relevant.

4. The balanced text recommendation method of claim 1,

after the platform text data containing a plurality of first texts are classified through the trained Bert-embedding classification model,

5. The balanced text recommendation method of claim 4,

extracting a plurality of other first texts which are in the same second-level category as the contrast texts;

calculating the number of other first texts in the same second-level category as the contrast text;

judging whether the quantity reaches a threshold value;

when the quantity does not reach a threshold value, calculating second similarity of all other second-level categories under the first-level category corresponding to the contrast text and the second-level categories of the contrast text;

sorting all other second-level categories under the first-level category corresponding to the contrast text according to the second similarity;

extracting the first texts in other second-level categories according to the sequence, and adding the extracted first texts into a plurality of other first texts in the same second-level category as the comparison texts until the number of all the first texts reaches a threshold value;

and forming the contrast text set by all the extracted first texts.

6. The balanced text recommendation method of claim 5,

if the quantity of other first texts under the first-level category corresponding to the contrast text is smaller than a threshold value, and the missing quantity is set as a second quantity;

calculating a third similarity between the other first-level categories and the first-level categories of the comparison file;

sorting other first-level categories according to the third similarity;

randomly selecting the second amount of the first text from another first-level category with the highest ranking;

and combining the first text randomly selected from another first-level category with the highest ranking and other first texts under the same first-level category corresponding to the contrast text together to form the contrast text set.

7. The balanced text recommendation method of claim 1,

the specific method for calculating the first similarity between all the first texts in the text set and the comparison text is as follows:

using Bert-server to respectively encode all the first texts and the comparison texts under the text set to obtain respective corresponding sentence vectors;

and respectively calculating first similarities of sentence vectors of all the first texts and sentence vectors of the contrast texts in the text set through cosine similarity calculation.

8. The balanced text recommendation method of claim 1,

the specific method for acquiring a plurality of balanced training texts comprises the following steps:

acquiring a plurality of training documents from classified mature websites;

cleaning a plurality of the training documents;

and screening the washed training documents to ensure that the data volume of the training documents under each category is the same.

9. The balanced text recommendation method of claim 8,

the specific method for cleaning the training documents comprises the following steps:

and cleaning the training documents through a regular rule.

10. The balanced text recommendation method of claim 8,

and cleaning the training documents through regular rules and manual cleaning rules which are manually set aiming at websites for acquiring the training documents.