CN113590963A - Balanced text recommendation method - Google Patents

Balanced text recommendation method Download PDF

Info

Publication number
CN113590963A
CN113590963A CN202110891346.5A CN202110891346A CN113590963A CN 113590963 A CN113590963 A CN 113590963A CN 202110891346 A CN202110891346 A CN 202110891346A CN 113590963 A CN113590963 A CN 113590963A
Authority
CN
China
Prior art keywords
texts
text
category
contrast
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110891346.5A
Other languages
Chinese (zh)
Inventor
罗列异
任益斌
程韶曦
王强
吴昭琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huang Jiqi
Zhejiang Xinlan Network Media Co ltd
Original Assignee
Huang Jiqi
Zhejiang Xinlan Network Media Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huang Jiqi, Zhejiang Xinlan Network Media Co ltd filed Critical Huang Jiqi
Priority to CN202110891346.5A priority Critical patent/CN113590963A/en
Publication of CN113590963A publication Critical patent/CN113590963A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a balanced text recommendation method, which comprises the following steps: acquiring a plurality of balanced training texts; pre-training the built Bert-embedding classification model through a training text; classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model; acquiring a first text currently browsed by a user of a platform as a contrast text; extracting other first texts in the same category as the comparison texts to serve as comparison text sets; calculating first similarity between all first texts in the contrast text set and the contrast texts; sequencing all first texts under the comparison text set according to the calculated first similarity; and recommending the sorted first texts to the user as recommended texts. The balanced text recommendation method has the advantages that the classification effect of the classification model is better, and meanwhile, the recommendation of related contents is more accurate based on the similarity of sentence vector levels.

Description

Balanced text recommendation method
Technical Field
The invention relates to a balanced text recommendation method.
Background
The existing text related recommendation generally classifies contents and then recommends classified news to users. The scheme has the problem of unbalanced platform data categories, and when content-related recommendations are made for users, the problems that too much news in the classifications result in too wide recommendations and too little news result in insufficient recommendation quantity to attract the users to keep are caused. Meanwhile, the problem that the classification is not accurate enough exists.
Disclosure of Invention
The invention provides a balanced text recommendation method, which adopts the following technical scheme:
a balanced text recommendation method comprises the following specific steps:
acquiring a plurality of balanced training texts;
pre-training the built Bert-embedding classification model through a training text;
classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model;
acquiring a first text currently browsed by a user of a platform as a contrast text;
extracting other first texts in the same category as the comparison texts to serve as comparison text sets;
calculating first similarity between all first texts in the contrast text set and the contrast texts;
sequencing all first texts under the comparison text set according to the calculated first similarity;
and recommending the sorted first texts to the user as recommended texts.
Further, a specific method for extracting a plurality of other first texts in the same category as the comparison text set is as follows:
extracting a plurality of other first texts in the same category as the contrast text;
calculating the number of other first texts in the same category as the comparison text;
judging whether the quantity reaches a threshold value;
when the number does not reach a threshold value, extracting a plurality of first texts under another category most related to the category of the comparison file;
and forming a contrast text set by all the extracted first texts.
Further, when the number of the first texts does not reach the threshold value, a specific method for extracting a plurality of first texts under another category most relevant to the category of the comparison file is as follows:
calculating a second similarity of the other categories to the category of the comparison file;
sorting the other categories according to the second similarity;
the highest ranked other category is taken as the most relevant.
Further, after the platform text data containing a plurality of first texts are classified through the trained Bert-embedding classification model,
the classified categories of the first text comprise a plurality of first-level categories and a plurality of second-level categories subordinate to the first-level categories.
Further, a specific method for extracting a plurality of other first texts in the same category as the comparison text set is as follows:
extracting a plurality of other first texts which are in the same second-level category as the comparison text;
calculating the number of other first texts which are in the same second-level category as the comparison text;
judging whether the quantity reaches a threshold value;
when the quantity does not reach the threshold value, calculating second similarity of all other second-level categories under the first-level category corresponding to the contrast text and the second-level categories of the contrast text;
sorting all other second-level categories under the first-level categories corresponding to the comparison texts according to the second similarity;
extracting the first texts in other second-level categories in sequence according to the sequence, and adding the extracted first texts into a plurality of other first texts in the same second-level category as the comparison texts until the number of all the first texts reaches a threshold value;
and combining all the extracted first texts into a contrast text set.
Further, if the number of other first texts in the first-level category corresponding to the comparison text is smaller than the threshold value, and the missing number is set as a second number;
calculating the third similarity between other first-level categories and the first-level categories of the comparison files;
sorting other first-level categories according to the third similarity;
randomly selecting a second quantity of first texts from another first-level category with the highest ranking;
and combining the first text randomly selected from another first-level category with the highest ranking and other first texts under the same first-level category corresponding to the contrast text to form a contrast text set.
Further, a specific method for calculating the first similarity between all the first texts in the text set and the comparison text is as follows:
respectively coding all first texts and the comparison texts under the text set by using Bert-server to obtain respective corresponding sentence vectors;
and respectively calculating first similarities of sentence vectors of all the first texts and sentence vectors of the contrast texts under the text set through cosine similarity calculation.
Further, a specific method for acquiring a plurality of balanced training texts is as follows:
acquiring a plurality of training documents from classified mature websites;
cleaning a plurality of training documents;
the washed training documents were screened so that the data volume of the training documents in each category was the same.
Further, the specific method for cleaning a plurality of training documents is as follows:
and cleaning the training documents through regular rules.
Further, the specific method for cleaning a plurality of training documents is as follows:
and cleaning the training documents through the regular rules and the manual cleaning rules manually set for the websites for acquiring the training documents.
The text recommendation method has the beneficial effects that the classification effect of the classification model is better, and meanwhile, the recommendation of related contents is more accurate based on the similarity of sentence vector levels.
The method has the beneficial effects that the balanced text recommendation method can solve the problem of insufficient recommendation quantity caused by less classified sample data.
Drawings
FIG. 1 is a schematic diagram of a balanced text recommendation method of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and the embodiments.
Fig. 1 shows a balanced text recommendation method of the present invention, which mainly includes the following specific steps: step S1: and acquiring a plurality of balanced training texts. Step S2: and pre-training the built Bert-embedding classification model through the training text. Step S3: and classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model. Step S4: and acquiring a first text currently browsed by a user of the platform as a comparison text. Step S5: and extracting other first texts in the same category with the contrast texts as a contrast text set. Step S6: and calculating first similarity of all the first texts under the contrast text set and the contrast text. Step S7: and ordering all the first texts under the comparison text set according to the calculated first similarity. Step S8: and recommending the sorted first texts to the user as recommended texts. By the balanced text recommendation method, the classification effect of the classification model is better, and meanwhile, the recommendation of related contents is more accurate based on the similarity of sentence vector levels. The above steps are specifically described below.
For step S1: and acquiring a plurality of balanced training texts.
The specific method for acquiring a plurality of balanced training texts comprises the following steps: several training documents are obtained from a web site that is classified mature. Several training documents were cleaned. The washed training documents were screened so that the data volume of the training documents in each category was the same.
Specifically, news websites with relatively mature news categories are selected, the number of the categories is moderate, and the category meanings are differentiated. Categories such as politics and society have large overlap, semantic distinction is not large, and classification models are interfered. In the application, the classified mature websites mean that the news articles in the websites are accurately classified and have small overlapping rate. Therefore, in the subsequent training of the classification model, the trained classification model is more accurate.
The specific method for cleaning a plurality of training documents is as follows: and cleaning the training documents through regular rules.
It can be understood that, since the content in the news website generally originates from different news sources and html tags cannot be resolved in a targeted manner, a highly adaptive html resolution step is invented for the situation, and finally text content is left. Specifically, it is preferable that a manual cleaning rule is also set for the website by a manual setting method, and then the training document is cleaned by the regular rule and the manual cleaning rule. Thus, the text data after being cleaned is cleaner.
For step S2: and pre-training the built Bert-embedding classification model through the training text.
Specifically, the Bert-embedding vector expression mode is vectorized based on character levels, so that the steps of word segmentation, word stop removal, low word frequency removal and the like are omitted, a large amount of time is saved, and negative effects possibly brought by word segmentation are avoided. Bert-embedding specifically uses an ine pre-training model. The erin is an improved version of a bert model, has the same basic architecture, and is specially optimized for Chinese corpuses on the basis of the bert model. The model training is based on the neural network DNN for training, and as semantic information of news corpora is covered in the embedding, only a simple neural network structure is needed to change the final classification quantity. The effect of the traditional machine learning method is improved greatly and is much smaller than the scale of the currently common BI-LSTM model parameters.
For step S3: and classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model.
Specifically, the classification models are classified by data obtained from the websites, and then platform text data for browsing and consulting for the user are classified by the classification models.
For step S4: and acquiring a first text currently browsed by a user of the platform as a comparison text.
For step S5: and extracting other first texts in the same category with the contrast texts as a contrast text set.
Specifically, the specific method for extracting a plurality of other first texts in the same category as the comparison text set is as follows: and extracting other first texts in the same category with the contrast text. The number of other first texts in the same category as the comparison text is calculated. And judging whether the quantity reaches a threshold value. And extracting a plurality of first texts under another category which is most relevant to the category of the comparison file when the number does not reach the threshold value. And forming a contrast text set by all the extracted first texts.
It will be appreciated that certain categories in the platform are relatively small in data size, in which case if recommendations are made using only the first text under that category to calculate similarity, there is a problem of too little news resulting in a recommendation that is not large enough to attract the user to remain. Therefore, in the present application, when the number of the first texts in the same category as the comparison text is small, the text data is extracted from the categories with higher approximation degree for data expansion. The threshold value may be set specifically according to the specific situation, and in the present application, the threshold value is set to 200 parts.
As a preferred embodiment, when the number does not reach the threshold, a specific method for extracting the first texts under another category most relevant to the category of the comparison file is as follows: a second similarity of the other categories to the category of the comparison file is calculated. The other categories are ranked according to the second similarity. The highest ranked other category is taken as the most relevant.
For step S6: and calculating first similarity of all the first texts under the contrast text set and the contrast text.
Specifically, the specific method for calculating the first similarity between all the first texts in the text set and the comparison text is as follows:
and respectively coding all the first texts and the comparison texts under the text set by using the Bert-server to obtain the sentence vectors corresponding to the first texts and the comparison texts.
And respectively calculating first similarities of sentence vectors of all the first texts and sentence vectors of the contrast texts under the text set through cosine similarity calculation.
The invention uses the Bert-server to carry out sentence-level coding. The sentence-level vector coding brings the direct advantage that the vector covers the whole semantic information, rather than obtaining the word coding first and then obtaining the sentence coding by means of average coding and the like. Cosine similarity calculation is carried out on the sentence vectors to obtain contents with high similarity, and the contents are very similar in character level and similar in semantics. This solves the problem that too much content under classification results in too broad recommendations.
For step S7: and ordering all the first texts under the comparison text set according to the calculated first similarity.
It is understood that the first similarity is ordered from large to small.
For step S8: and recommending the sorted first texts to the user as recommended texts.
Preferably, the first text of the top 10 ranking is recommended to the user as the recommended text.
In another preferred embodiment, in classifying the platform text data containing a plurality of first texts through a trained Bert-embedding classification model, the classified classes of the first texts comprise a plurality of first-level classes and a plurality of second-level classes subordinate to the first-level classes.
The specific method for extracting a plurality of other first texts in the same category as the comparison text set comprises the following steps: and extracting other first texts in the same second-level category as the contrast text. The number of other first texts in the same second-level category as the comparison text is calculated. And judging whether the quantity reaches a threshold value. And when the quantity does not reach the threshold value, calculating the second similarity of all other second-level categories under the first-level category corresponding to the contrast text and the second-level categories of the contrast text. And sorting all other second-level categories under the first-level categories corresponding to the comparison texts according to the second similarity. And extracting the first texts in other second-level categories in sequence according to the sequence, and adding the extracted first texts into a plurality of other first texts in the same second-level category as the comparison texts until the number of all the first texts reaches a threshold value. And combining all the extracted first texts into a contrast text set.
It will be appreciated that in some application scenarios, the classification of the classification model comprises two-level classifications, i.e. a number of first-level classes, each of which comprises a number of second-level classifications. In such a case, when data recommendation is performed, a text with a higher similarity is preferentially found from the second-level classification corresponding to the comparison text. And under the condition that the data volume of the second-level classification corresponding to the contrast text is less, expanding the data from the second-level classification under the same first-level classification, wherein the sequence of expansion is based on the similarity of the second-level classification. The higher the similarity, the more preferential the addition.
As a preferred embodiment, if the number of other first texts in the first-level category corresponding to the comparison text is smaller than the threshold, and the missing number is set as the second number. A third similarity of the other first level categories to the first level categories of the comparison file is calculated. And sorting the other first-level categories according to the third similarity. A second quantity of first text is randomly selected from another first-level category with the highest ranking. And combining the first text randomly selected from another first-level category with the highest ranking and other first texts under the same first-level category corresponding to the contrast text to form a contrast text set.
It will be appreciated that when the sum of the data volumes of all the second-level classifications under the same first-level classification has not reached the threshold, the text data from the other first-level classifications may be selected to expand the set of comparison texts. The basis of expansion is still based on similarity. In the present application, a second amount of first text is randomly selected from another first-level category with the highest ranking.
It can be understood that the similarity between the second-level category under another first-level category and the second-level category of the comparison file may also be calculated, and then the selection is performed in sequence according to the similarity until the number of the first texts in the comparison text set reaches the threshold.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalent alternatives or equivalent variations fall within the scope of the present invention.

Claims (10)

1. A balanced text recommendation method is characterized by comprising the following specific steps:
acquiring a plurality of balanced training texts;
pre-training the built Bert-embedding classification model through the training text;
classifying the platform text data containing a plurality of first texts through the trained Bert-embedding classification model;
acquiring the first text currently browsed by a user of the platform as a contrast text;
extracting other first texts in the same category as the contrast texts to serve as contrast text sets;
calculating first similarity of all the first texts under the contrast text set and the contrast text;
sequencing all the first texts under the contrast text set according to the calculated first similarity;
and recommending the first plurality of ordered pieces of the first text as recommended texts to the user.
2. The balanced text recommendation method of claim 1,
the specific method for extracting a plurality of other first texts in the same category as the comparison text set comprises the following steps:
extracting a plurality of other first texts in the same category as the contrast texts;
calculating the number of other first texts in the same category as the contrast text;
judging whether the quantity reaches a threshold value;
when the number does not reach a threshold value, extracting a plurality of first texts under another category most related to the category of the comparison file;
and forming the contrast text set by the first texts obtained by extraction.
3. The balanced text recommendation method of claim 2,
the specific method for extracting the plurality of first texts under the other category most relevant to the category of the comparison file when the number does not reach the threshold value is as follows:
calculating a second similarity of other categories to the category of the comparison file;
sorting other categories according to the second similarity;
the highest ranked other category is taken as the most relevant.
4. The balanced text recommendation method of claim 1,
after the platform text data containing a plurality of first texts are classified through the trained Bert-embedding classification model,
the classified categories of the first text comprise a plurality of first-level categories and a plurality of second-level categories subordinate to the first-level categories.
5. The balanced text recommendation method of claim 4,
the specific method for extracting a plurality of other first texts in the same category as the comparison text set comprises the following steps:
extracting a plurality of other first texts which are in the same second-level category as the contrast texts;
calculating the number of other first texts in the same second-level category as the contrast text;
judging whether the quantity reaches a threshold value;
when the quantity does not reach a threshold value, calculating second similarity of all other second-level categories under the first-level category corresponding to the contrast text and the second-level categories of the contrast text;
sorting all other second-level categories under the first-level category corresponding to the contrast text according to the second similarity;
extracting the first texts in other second-level categories according to the sequence, and adding the extracted first texts into a plurality of other first texts in the same second-level category as the comparison texts until the number of all the first texts reaches a threshold value;
and forming the contrast text set by all the extracted first texts.
6. The balanced text recommendation method of claim 5,
if the quantity of other first texts under the first-level category corresponding to the contrast text is smaller than a threshold value, and the missing quantity is set as a second quantity;
calculating a third similarity between the other first-level categories and the first-level categories of the comparison file;
sorting other first-level categories according to the third similarity;
randomly selecting the second amount of the first text from another first-level category with the highest ranking;
and combining the first text randomly selected from another first-level category with the highest ranking and other first texts under the same first-level category corresponding to the contrast text together to form the contrast text set.
7. The balanced text recommendation method of claim 1,
the specific method for calculating the first similarity between all the first texts in the text set and the comparison text is as follows:
using Bert-server to respectively encode all the first texts and the comparison texts under the text set to obtain respective corresponding sentence vectors;
and respectively calculating first similarities of sentence vectors of all the first texts and sentence vectors of the contrast texts in the text set through cosine similarity calculation.
8. The balanced text recommendation method of claim 1,
the specific method for acquiring a plurality of balanced training texts comprises the following steps:
acquiring a plurality of training documents from classified mature websites;
cleaning a plurality of the training documents;
and screening the washed training documents to ensure that the data volume of the training documents under each category is the same.
9. The balanced text recommendation method of claim 8,
the specific method for cleaning the training documents comprises the following steps:
and cleaning the training documents through a regular rule.
10. The balanced text recommendation method of claim 8,
the specific method for cleaning the training documents comprises the following steps:
and cleaning the training documents through regular rules and manual cleaning rules which are manually set aiming at websites for acquiring the training documents.
CN202110891346.5A 2021-08-04 2021-08-04 Balanced text recommendation method Pending CN113590963A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110891346.5A CN113590963A (en) 2021-08-04 2021-08-04 Balanced text recommendation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110891346.5A CN113590963A (en) 2021-08-04 2021-08-04 Balanced text recommendation method

Publications (1)

Publication Number Publication Date
CN113590963A true CN113590963A (en) 2021-11-02

Family

ID=78254929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110891346.5A Pending CN113590963A (en) 2021-08-04 2021-08-04 Balanced text recommendation method

Country Status (1)

Country Link
CN (1) CN113590963A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170193291A1 (en) * 2015-12-30 2017-07-06 Ryan Anthony Lucchese System and Methods for Determining Language Classification of Text Content in Documents
US20180260490A1 (en) * 2016-07-07 2018-09-13 Tencent Technology (Shenzhen) Company Limited Method and system for recommending text content, and storage medium
CN110287312A (en) * 2019-05-10 2019-09-27 平安科技(深圳)有限公司 Calculation method, device, computer equipment and the computer storage medium of text similarity
CN110737839A (en) * 2019-10-22 2020-01-31 京东数字科技控股有限公司 Short text recommendation method, device, medium and electronic equipment
CN112784013A (en) * 2021-01-13 2021-05-11 北京理工大学 Multi-granularity text recommendation method based on context semantics
CN113204956A (en) * 2021-07-06 2021-08-03 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170193291A1 (en) * 2015-12-30 2017-07-06 Ryan Anthony Lucchese System and Methods for Determining Language Classification of Text Content in Documents
US20180260490A1 (en) * 2016-07-07 2018-09-13 Tencent Technology (Shenzhen) Company Limited Method and system for recommending text content, and storage medium
CN110287312A (en) * 2019-05-10 2019-09-27 平安科技(深圳)有限公司 Calculation method, device, computer equipment and the computer storage medium of text similarity
CN110737839A (en) * 2019-10-22 2020-01-31 京东数字科技控股有限公司 Short text recommendation method, device, medium and electronic equipment
CN112784013A (en) * 2021-01-13 2021-05-11 北京理工大学 Multi-granularity text recommendation method based on context semantics
CN113204956A (en) * 2021-07-06 2021-08-03 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
詹新惠: "《网络与新媒体编辑运营实务》", 西安电子科技大学出版社 *

Similar Documents

Publication Publication Date Title
CN106156204B (en) Text label extraction method and device
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN106201465B (en) Software project personalized recommendation method for open source community
CN103870973B (en) Information push, searching method and the device of keyword extraction based on electronic information
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN107239529A (en) A kind of public sentiment hot category classification method based on deep learning
CN108228541B (en) Method and device for generating document abstract
CN103123633A (en) Generation method of evaluation parameters and information searching method based on evaluation parameters
CN104392006B (en) A kind of event query processing method and processing device
CN112507711A (en) Text abstract extraction method and system
CN106844632A (en) Based on the product review sensibility classification method and device that improve SVMs
US20140229486A1 (en) Method and apparatus for unsupervised learning of multi-resolution user profile from text analysis
CN107506472B (en) Method for classifying browsed webpages of students
CN109446423B (en) System and method for judging sentiment of news and texts
CN103810162A (en) Method and system for recommending network information
CN110516074A (en) Website theme classification method and device based on deep learning
CN110222172A (en) A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering
CN108681548A (en) A kind of lawyer's information processing method and system
CN111966832A (en) Evaluation object extraction method and device and electronic equipment
CN111368529B (en) Mobile terminal sensitive word recognition method, device and system based on edge calculation
CN113312476A (en) Automatic text labeling method and device and terminal
CN114722198A (en) Method, system and related device for determining product classification code
CN113010705B (en) Label prediction method, device, equipment and storage medium
CN110569351A (en) Network media news classification method based on restrictive user preference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination