CN112270177A

CN112270177A - News cover mapping method and device based on content similarity and computing equipment

Info

Publication number: CN112270177A
Application number: CN201910611721.9A
Authority: CN
Inventors: 陈茂森; 罗玄; 黄君实
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2021-01-26

Abstract

The invention discloses a news cover mapping method, a device, a computing device and a computer storage medium based on content similarity, wherein the method comprises the following steps: extracting news titles and news contents of news to be matched with the images to obtain news corpora corresponding to the news to be matched with the images; obtaining a bag-of-words vector corresponding to the news to be matched according to first word frequency data of each word in the news corpus corresponding to the news to be matched; performing theme analysis on the word bag vector corresponding to the news of the graph to be matched to obtain a theme vector corresponding to the news of the graph to be matched; and searching a news sample matched with the news of the graph to be matched from the news sample library according to the theme vector corresponding to the news of the graph to be matched, and determining the news cover of the matched news sample as the news cover of the news of the graph to be matched. According to the scheme, automatic and accurate configuration of the news cover based on content similarity is achieved, and the configuration efficiency of the news cover is improved.

Description

News cover mapping method and device based on content similarity and computing equipment

Technical Field

The invention relates to the technical field of internet, in particular to a news cover page matching method and device based on content similarity, computing equipment and a computer storage medium.

Background

News is a genre used for recording society, transmitting information, and reflecting the era. With the rapid development of information technology, a great amount of news is generated every moment. In order to graphically and vividly represent news, a news cover of the news is firstly presented to a user, generally, the news cover is a picture which can represent the whole news or attract eyes, and the appropriate news cover can increase the click rate and exposure of the news and plays an important role in the attention and the propagation of the news.

However, in real life, many news exist without corresponding news covers, which seriously affects the click rate and exposure of the news and is not beneficial to the dissemination of the news; in addition, in the prior art, the configuration of the news cover page mainly depends on the careful selection of workers such as the editing of news websites and the like. Today, where news media is so convenient, the number of news is increasing explosively, and the selection of news covers will certainly take a lot of time for workers to edit, etc. Therefore, the prior art lacks a method for automatically and precisely configuring the news cover page for the news.

Disclosure of Invention

In view of the above, the present invention has been developed to provide a method, apparatus, computing device and computer storage medium for content similarity-based cover mapping that overcome or at least partially address the above-identified problems.

According to one aspect of the invention, a method for matching a news cover based on content similarity is provided, and the method comprises the following steps:

extracting news titles and news contents of news to be matched with the images to obtain news corpora corresponding to the news to be matched with the images;

obtaining a bag-of-words vector corresponding to the news to be matched according to first word frequency data of each word in the news corpus corresponding to the news to be matched; performing theme analysis on the word bag vector corresponding to the news of the graph to be matched to obtain a theme vector corresponding to the news of the graph to be matched;

and searching a news sample matched with the news of the graph to be matched from the news sample library according to the theme vector corresponding to the news of the graph to be matched, and determining the news cover of the matched news sample as the news cover of the news of the graph to be matched.

According to another aspect of the present invention, there is provided a news cover mapping apparatus based on content similarity, the apparatus including:

the first generation module is suitable for extracting news titles and news contents of news with pictures to be matched to obtain news corpora corresponding to the news with the pictures to be matched;

the first processing module is suitable for obtaining a bag-of-words vector corresponding to the news to be matched according to first word frequency data of each word in the news corpus corresponding to the news to be matched; performing theme analysis on the word bag vector corresponding to the news of the graph to be matched to obtain a theme vector corresponding to the news of the graph to be matched;

and the matching module is suitable for searching a news sample matched with the news of the graph to be matched from the news sample library according to the theme vector corresponding to the news of the graph to be matched, and determining the news cover of the matched news sample as the news cover of the news of the graph to be matched.

According to yet another aspect of the present invention, there is provided a computing device comprising: the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the corresponding operation of the news cover mapping method based on the content similarity.

According to yet another aspect of the present invention, a computer storage medium is provided, in which at least one executable instruction is stored, and the executable instruction causes a processor to perform operations corresponding to the content similarity-based news cover mapping method as described above.

According to the technical scheme provided by the invention, the topic vector corresponding to the news to be matched can be conveniently obtained by extracting news titles and news contents, processing word frequency data, analyzing topics and the like of the news to be matched; according to the theme vector corresponding to the news of the to-be-matched graph, the news sample matched with the news of the to-be-matched graph can be quickly searched from the news sample library, the news cover of the matched news sample is determined as the news cover of the news of the to-be-matched graph, automatic and accurate configuration of the news cover based on content similarity is achieved, configuration efficiency of the news cover is improved, the configured news cover and the news of the to-be-matched graph have strong association relation in content, and the content of the news of the to-be-matched graph can be accurately reflected.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a flow diagram of a news cover mapping method based on content similarity according to one embodiment of the present invention;

FIG. 2 is a flow diagram illustrating a news cover mapping method based on content similarity according to another embodiment of the present invention;

FIG. 3 is a block diagram of a news cover mapping apparatus based on content similarity according to an embodiment of the present invention;

FIG. 4 shows a schematic structural diagram of a computing device according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 is a flow chart of a news cover mapping method based on content similarity according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

and S101, extracting news titles and news contents of the news with the pictures to be matched to obtain news corpora corresponding to the news with the pictures to be matched.

The news to be matched is news which contains news titles and news contents and is not provided with a news cover. In order to configure a suitable news cover related to the content of the news to be matched, in step S101, a news title and news content of the news to be matched are extracted, and then a news corpus corresponding to the news to be matched is obtained according to the extracted news title and news content of the news to be matched.

Considering that the extracted news headlines and news contents of the news to be matched contain a plurality of words which have no practical meaning and are frequently appeared in other news without distinguishing capability, such as 'yes', and the like, after the news headlines and the news contents of the news to be matched are extracted, the extracted news headlines and the news contents of the news to be matched can be processed according to a preset filtering strategy, and news corpora corresponding to the news to be matched are obtained. The preset filtering strategy can be set by a person skilled in the art according to actual needs, and is not specifically limited herein. For example, the preset filtering policy may include a stop word filtering policy, a preset common word filtering policy, a word frequency filtering policy, and the like. According to the preset filtering strategy, after the extracted news title and news content of the news to be matched are processed, words which have no actual meaning and have no distinguishing capability in the news title and the news content of the news to be matched can be effectively removed, the reserved words can be used for reflecting key words of the essential content of the news, and then the reserved words can be used for obtaining news corpora corresponding to the news to be matched. Through the processing mode, the number of words contained in the news corpus corresponding to the news to be matched can be effectively reduced, the data processing amount in the news corpus processing process is reduced, and the processing efficiency of processing the news corpus is improved.

Step S102, obtaining a bag-of-words vector corresponding to the news to be matched according to first word frequency data of each word in the news corpus corresponding to the news to be matched.

After the news corpus corresponding to the news to be matched is obtained, the frequency of occurrence of each word in the news corpus corresponding to the news to be matched is counted, and the word frequency data of each word in the news corpus corresponding to the news to be matched are obtained. After the first word frequency data are obtained, a preset weighting model can be used for obtaining word bag vectors corresponding to the news of the graph to be matched according to the first word frequency data.

The word bag vector is a high-dimensional vector, the dimension of the word bag vector corresponding to the news to be matched is equal to the total number of all words in the news corpus corresponding to the news to be matched, and the elements in the word bag vector corresponding to the news to be matched can be word vectors of all words in the news corpus corresponding to the news to be matched. In a specific embodiment, the word vector of each word may include a Term Frequency-Inverse file Frequency (TF-IDF) value of each word, and the like, which is not specifically limited herein.

And S103, performing theme analysis on the bag-of-words vector corresponding to the news of the graph to be matched to obtain a theme vector corresponding to the news of the graph to be matched.

After the bag-of-word vector corresponding to the news of the graph to be matched is obtained, topic analysis can be carried out on the bag-of-word vector corresponding to the news of the graph to be matched by using the trained topic analysis model, so that the bag-of-word vector corresponding to the news of the graph to be matched is subjected to dimension reduction, and the topic vector corresponding to the news of the graph to be matched is obtained, wherein the topic vector is a low-dimensional vector. In a specific embodiment, the trained topic analysis model may be a Latent Semantic Indexing (LSI) model, and the like, and is not limited herein.

And S104, searching a news sample matched with the news of the graph to be matched from the news sample library according to the theme vector corresponding to the news of the graph to be matched, and determining the news cover of the matched news sample as the news cover of the news of the graph to be matched.

The news sample library includes a large number of news samples, and the news samples are news which are used as samples and include news titles and news contents and news covers. Each news sample is processed in advance to obtain a theme vector corresponding to each news sample. In step S104, a news sample whose topic vector is closest to the topic vector corresponding to the news to be matched can be searched from the news sample library, and in order to increase the search speed of the news sample, a Faiss library with a higher speed can be used to search the news sample on the GPU at a high speed. And determining the news sample matched with the news to be matched with the chart according to the news sample with the closest theme vector. The matched news sample and the to-be-matched graph news have higher similarity in content, so that the news cover of the matched news sample can be determined as the news cover of the to-be-matched graph news, automatic and accurate configuration of the news cover based on content similarity is conveniently achieved, the configured news cover and the to-be-matched graph news have stronger association relation in content, and the content of the to-be-matched graph news can be accurately reflected.

According to the news cover mapping method based on content similarity provided by the embodiment, the topic vector corresponding to the news to be mapped can be conveniently obtained by extracting news titles and news contents, processing word frequency data, analyzing topics and the like of the news to be mapped; according to the theme vector corresponding to the news of the to-be-matched graph, the news sample matched with the news of the to-be-matched graph can be quickly searched from the news sample library, the news cover of the matched news sample is determined as the news cover of the news of the to-be-matched graph, automatic and accurate configuration of the news cover based on content similarity is achieved, configuration efficiency of the news cover is improved, the configured news cover and the news of the to-be-matched graph have strong association relation in content, and the content of the news of the to-be-matched graph can be accurately reflected.

Fig. 2 is a flow chart of a news cover mapping method based on content similarity according to another embodiment of the present invention, as shown in fig. 2, the method includes the following steps:

step S201, extracting the news title and the news content of each news sample from the news sample library to obtain a news corpus corresponding to each news sample.

The news sample library includes a large number of news samples, and the news samples are news which are used as samples and include news titles and news contents and news covers. The number of news samples included in the news sample library can be set by those skilled in the art according to actual needs, and is not limited specifically here. For example, there may be 300 ten thousand news samples included in the news sample library.

The method comprises the steps of extracting news titles and news contents of each news sample from a news sample library, setting a preset filtering strategy comprising a stop word filtering strategy, a preset common word filtering strategy and a word frequency filtering strategy according to the preset filtering strategy by considering that the extracted news titles and news contents of each news sample contain a plurality of words which have no practical meanings and often appear in other news but have no distinguishing capability, and processing the extracted news titles and news contents of each news sample according to the preset filtering strategy to obtain news corpora corresponding to each news sample.

The extracted stop words and preset common words contained in the news titles and the news contents of the news samples are screened out for each news sample in the news sample library to obtain the preprocessed corpus corresponding to the news sample, wherein the stop words may include "yes", "except" and the like, and a person skilled in the art can set the words included in the preset common words according to actual needs, for example, the preset common words may be words that occur in 50% of news obtained by counting a large number of words in the news.

After the stop words and the preset common words are screened out, the obtained preprocessed corpus corresponding to the news sample still contains a large number of words, and in order to further simplify the preprocessed corpus and reduce the data processing amount in the process of processing the news corpus, the words contained in the preprocessed corpus corresponding to the news sample need to be further filtered according to a word frequency filtering strategy.

Specifically, the frequency of occurrence of each word in the preprocessed corpus corresponding to all news samples is counted, and third word frequency data of each word in the preprocessed corpus corresponding to all news samples is obtained through calculation. And after the third word frequency data are obtained, obtaining a news corpus corresponding to the news sample by using all words of which the third word frequency data in the preprocessed corpus accord with preset word frequency conditions.

In one embodiment, the preset word frequency condition may be set as: in the third word frequency data of all words in the preprocessed corpus corresponding to all news samples, the third word frequency data of the word is ranked in the range of the top 3 ten thousand in the order from high to low. Suppose that a certain news sample corresponds to a preprocessed corpus containing 5 words, i.e. word 1, word 2, word 3, word 4, and word 5, the preprocessed corpus can be represented as (word 1, word 2, word 3, word 4, word 5). If all words in the preprocessed corpus corresponding to all news samples are sequenced from high to low according to the third word frequency data, and only the third word frequency data of the word 1, the word 2 and the word 3 in the preprocessed corpus are ranked in the range of the top 3 ten thousands, which indicates that the third word frequency data of the word 1, the word 2 and the word 3 in the preprocessed corpus conform to the preset word frequency condition, the news corpus corresponding to the news samples is obtained by using the word 1, the word 2 and the word 3, and then the news corpus corresponding to the news samples can be represented as (word 1, word 2 and word 3).

Step S202, a news corpus is constructed by utilizing news corpora corresponding to all news samples.

After the news corpora corresponding to each news sample are obtained, the news corpora corresponding to all the news samples are collected, and a news corpus is obtained through construction, namely the news corpus comprises the news corpora corresponding to all the news samples.

Step S203, obtaining a bag-of-words vector corresponding to each news sample according to second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample.

After obtaining the news corpus corresponding to each news sample, counting the occurrence frequency of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample, and obtaining the word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample. After the second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample is obtained through calculation, the bag-of-word vector corresponding to each news sample can be obtained according to the second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample.

In one embodiment, the bag-of-word vector corresponding to each news sample may be computed using a TF-IDF model, which is based on the principle that the importance of a word is proportional to the number of times the word appears in its news corpus and inversely proportional to the number of times the word appears in the news corpus. And determining a word vector of each word by using the TF-IDF value of each word obtained by calculation based on the TF-IDF model, and determining a bag-of-word vector by using the word vector of each word. The TF-IDF value of each word is not only increased in proportion to the second word frequency data of the word in the news corpus corresponding to the news sample, but also decreased in inverse proportion to the frequency of the word in the news corpus.

In order to accurately calculate the bag-of-word vector corresponding to each news sample, it is necessary to determine the second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample, and also determine the reverse frequency data of each word.

Specifically, for each word in the news corpus corresponding to each news sample, the reverse frequency data of the word is calculated by using the total number of the news corpora in the news corpus and the number of the news corpora including the word, in practical application, the total number of the news corpora can be divided by the number of the news corpora including the word to obtain an intermediate calculation result, then a logarithm with 10 as a base is taken from the intermediate calculation result, and the obtained numerical value is used as the reverse frequency data of the word. And then, obtaining a word vector of each word according to the second word frequency data of each word in the news corpus corresponding to the news sample and the inverse frequency data of each word, for example, performing multiplication operation on the second word frequency data of each word in the news corpus corresponding to the news sample and the inverse frequency data of each word, determining the obtained numerical value as the TF-IDF value of the word, and determining the word vector of the word by using the TF-IDF value of the word. And then, taking the word vector of each word in the news corpus corresponding to the news sample as an element in the word bag vector corresponding to the news sample, and obtaining the word bag vector corresponding to the news sample by using the word vectors of all words in the news corpus corresponding to the news sample. The bag-of-words vector is a high-dimensional vector, and the dimension of the bag-of-words vector corresponding to the news sample is equal to the total number of all words in the news corpus corresponding to the news sample.

And step S204, performing theme analysis on the bag-of-word vector corresponding to each news sample to obtain a theme vector corresponding to each news sample.

The bag-of-word vector corresponding to each news sample can be input into the trained topic analysis model to obtain a topic vector corresponding to each news sample, and the topic vector is a low-dimensional vector. Specifically, the trained topic analysis model may be an LSI model, the LSI model may analyze the hidden meaning of a word according to the environment where the word is located, and the core idea of the LSI model is to map the word to a potential topic space by an unsupervised method to generate a potential topic vector, and perform a dimension reduction process on a bag-of-words vector by Singular Value Decomposition (SVD) to obtain a low-dimensional topic vector, thereby effectively reducing the complexity of data and reducing noise in the data.

Step S205, extracting news titles and news contents of the news with the images to be matched to obtain news corpora corresponding to the news with the images to be matched.

And processing the extracted news headlines and news contents of the news to be matched by adopting the processing mode of the extracted news headlines and news contents of the news samples recorded in the step S201 to obtain news corpora corresponding to the news to be matched. Specifically, the extracted news title of the news to be matched and stop words and preset common words contained in the news content are screened out, and the preprocessed corpus corresponding to the news to be matched is obtained; calculating third word frequency data of each word in the preprocessed corpus corresponding to all news samples; and obtaining a news corpus corresponding to the news to be matched with the graph by using all words of which the third word frequency data in the preprocessed corpus accord with preset word frequency conditions.

Step S206, according to the first word frequency data of each word in the news corpus corresponding to the news to be matched, a word bag vector corresponding to the news to be matched is obtained.

And determining the bag-of-word vector corresponding to the news to be matched by adopting the determination mode of the bag-of-word vector corresponding to the news sample recorded in the step S203. Specifically, for each word in a news corpus corresponding to the news to be matched, the total number of the news corpora in the news corpus and the number of the news corpora containing the word are used for calculating to obtain reverse frequency data of the word; obtaining a word vector of each word according to first word frequency data of each word in a news corpus corresponding to the news to be matched and reverse frequency data of each word; and obtaining a word bag vector corresponding to the news of the graph to be matched by using the word vectors of all words in the news corpus corresponding to the news of the graph to be matched.

And step S207, performing theme analysis on the bag-of-words vector corresponding to the news of the picture to be matched to obtain a theme vector corresponding to the news of the picture to be matched.

The bag-of-words vector corresponding to the news of the graph to be matched can be input into the same trained topic analysis model adopted in step S204, so as to obtain the topic vector corresponding to the news of the graph to be matched.

Step S208, calculating the Euclidean distance between the theme vector corresponding to the news to be matched and the theme vector corresponding to each news sample in the news sample library.

The euclidean distance is an euclidean distance, specifically, a real distance between two points in an m-dimensional space, or a natural length of a vector. The euclidean distance between the topic vector corresponding to the news to be matched and the topic vector corresponding to each news sample in the news sample library can be calculated by using a calculation formula of the euclidean distance in the prior art.

Step S209 is to select a news sample with the smallest euclidean distance between the topic vectors corresponding to the news to be matched from all the news samples.

The smaller the euclidean distance between the topic vector corresponding to the to-be-matched graph news and the topic vector corresponding to a certain news sample is, the more similar the to-be-matched graph news and the news sample are in content, the news sample with the minimum euclidean distance between the to-be-matched graph news and the topic vector corresponding to the to-be-matched graph news can be selected from all the news samples, and the news sample with the minimum euclidean distance is the news sample which is most similar to the to-be-matched graph news in content.

The number of news samples with the minimum euclidean distance between topic vectors corresponding to the news to be matched may be one or multiple. If the number of the news samples with the minimum Euclidean distance between the topic vectors corresponding to the news to be matched is one, the number of the selected news samples is one, and then step S210 is executed; if the number of news samples having the smallest euclidean distance between the topic vectors corresponding to the news to be matched is multiple, the number of the selected news samples is multiple, and step S211 is executed.

Step S210, determining the selected news sample as a news sample matched with the news of the to-be-matched picture, and determining a news cover of the matched news sample as a news cover of the news of the to-be-matched picture.

If the number of the selected news samples is one, the selected news samples are directly determined as the news samples matched with the news of the to-be-matched graph, the matched news samples and the news of the to-be-matched graph have higher similarity in content, and then the news cover of the matched news samples can be determined as the news cover of the news of the to-be-matched graph.

Step S211, calculating the similarity between the news headline of the news to be matched and the news headline of each selected news sample.

If the number of the selected news samples is multiple, the selected news samples need to be further finely ranked based on news titles, and news samples which are most similar to the news to be matched in news titles are determined from the selected news samples. Specifically, considering that the nouns in the news headlines can reflect the headline content to a large extent, part-of-speech tagging may be performed on each word in the news headlines of the to-be-matched news and each word in the news headlines of each selected news sample, the nouns in the news headlines of the to-be-matched news and the nouns in the news headlines of each selected news sample may be extracted, then the nouns in the news headlines of the to-be-matched news and the nouns in the news headlines of each selected news sample may be compared, the similarity between the nouns in the news headlines of the to-be-matched news and the nouns in the news headlines of each selected news sample may be calculated, and the similarity between the news headlines of the to-be-matched news and the news headlines of each selected news sample may be determined according to the calculation result.

In step S212, the news sample with the maximum similarity is determined as the news sample matched with the news of the to-be-matched graph, and the news cover of the matched news sample is determined as the news cover of the news of the to-be-matched graph.

After the similarity between the news title of the news to be matched and the news title of each selected news sample is obtained through calculation, the selected news samples are sequenced according to the sequence of the similarity from small to large, the news sample with the highest similarity is the news sample which is most similar to the news to be matched on the news title, the news sample with the highest similarity is determined to be the news sample matched with the news to be matched, and the news cover of the matched news sample is determined to be the news cover of the news to be matched.

According to the content similarity-based news cover mapping method provided by the embodiment, news titles and news contents are processed according to the stop-word filtering strategy, the preset common-word filtering strategy and the word frequency filtering strategy, words which have no actual meaning and have no distinguishing capability and are contained in the news titles and the news contents can be effectively removed, the number of words contained in news corpora is reduced, the data processing amount in the process of processing the news corpora is reduced, and the processing efficiency of processing the news corpora is improved; moreover, according to word frequency data of each word in the news corpus in the corresponding news, a corresponding word bag vector can be accurately calculated, and then the trained topic analysis model is used for carrying out dimensionality reduction on the word bag vector to obtain a low-dimensional topic vector, so that the complexity of the data is effectively reduced, and meanwhile, the noise in the data is effectively reduced; according to the theme vector corresponding to the news of the graph to be matched and the theme vector corresponding to the news sample, the news sample which is most similar to the news of the graph to be matched in content can be quickly found from the news sample library; in addition, when the number of the news samples most similar to the content of the to-be-matched graph news is multiple, the multiple news samples are further subjected to fine sequencing based on news titles, accurate determination of the matched news samples is achieved, the news cover of the matched news samples is determined as the news cover of the to-be-matched graph news, the configured news cover and the to-be-matched graph news have strong association in content, and the content of the to-be-matched graph news can be accurately reflected.

Fig. 3 is a block diagram illustrating a structure of a news cover mapping apparatus based on content similarity according to an embodiment of the present invention, as shown in fig. 3, the apparatus including: a first generation module 301, a first processing module 302 and a matching module 303.

The first generation module 301 is adapted to: and extracting news titles and news contents of the news to be matched with the images to obtain news corpora corresponding to the news to be matched with the images.

The first processing module 302 is adapted to: obtaining a bag-of-words vector corresponding to the news to be matched according to first word frequency data of each word in the news corpus corresponding to the news to be matched; and performing theme analysis on the word bag vector corresponding to the news of the graph to be matched to obtain a theme vector corresponding to the news of the graph to be matched.

The matching module 303 is adapted to: and searching a news sample matched with the news of the graph to be matched from the news sample library according to the theme vector corresponding to the news of the graph to be matched, and determining the news cover of the matched news sample as the news cover of the news of the graph to be matched.

Optionally, the apparatus may further comprise: a second generation module 304, a construction module 305 and a second processing module 306.

The second generation module 304 is adapted to: and extracting the news title and the news content of each news sample from the news sample library to obtain news corpora corresponding to each news sample.

The building block 305 is adapted to: constructing a news corpus by utilizing news corpora corresponding to all news samples; wherein the news sample contains a news cover.

The second processing module 306 is adapted to: obtaining a word bag vector corresponding to each news sample according to second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample; and performing theme analysis on the bag-of-word vector corresponding to each news sample to obtain a theme vector corresponding to each news sample.

Optionally, the second generating module 304 is further adapted to: screening out extracted stop words and preset common words contained in news titles and news contents of the news samples aiming at each news sample in a news sample library to obtain a preprocessed corpus corresponding to the news samples; calculating third word frequency data of each word in the preprocessed corpus corresponding to all news samples; and obtaining a news corpus corresponding to the news sample by using all words of which the third word frequency data in the preprocessed corpus accord with preset word frequency conditions.

Optionally, the second processing module 306 is further adapted to: aiming at each word in the news corpus corresponding to each news sample, calculating to obtain reverse frequency data of the word by utilizing the total quantity of the news corpora in the news corpus and the quantity of the news corpora containing the word; obtaining a word vector of each word according to second word frequency data of each word in a news corpus corresponding to the news sample and reverse frequency data of each word; and obtaining a word bag vector corresponding to the news sample by using the word vectors of all words in the news corpus corresponding to the news sample.

Optionally, the second processing module 306 is further adapted to: and inputting the bag-of-word vector corresponding to each news sample into the trained topic analysis model to obtain the topic vector corresponding to each news sample.

Optionally, the matching module 303 is further adapted to: calculating Euclidean distance between a theme vector corresponding to the news of the graph to be matched and a theme vector corresponding to each news sample in a news sample library; selecting a news sample with the minimum Euclidean distance between topic vectors corresponding to news to be matched with the graph from all news samples; if the number of the selected news samples is one, determining the selected news samples as news samples matched with the news of the graph to be matched; if the number of the selected news samples is multiple, calculating the similarity between the news headline of the news to be matched with the image and the news headline of each selected news sample; and determining the news sample with the maximum similarity as the news sample matched with the news to be matched with the map.

According to the content similarity-based news cover mapping device provided by the embodiment, news titles and news contents are processed according to the stop-word filtering strategy, the preset common-word filtering strategy and the word frequency filtering strategy, words which have no practical meaning and have no distinguishing capability and are contained in the news titles and the news contents can be effectively removed, the number of words contained in news corpora is reduced, the data processing amount in the process of processing the news corpora is reduced, and the processing efficiency of processing the news corpora is improved; moreover, according to word frequency data of each word in the news corpus in the corresponding news, a corresponding word bag vector can be accurately calculated, and then the trained topic analysis model is used for carrying out dimensionality reduction on the word bag vector to obtain a low-dimensional topic vector, so that the complexity of the data is effectively reduced, and meanwhile, the noise in the data is effectively reduced; according to the theme vector corresponding to the news of the graph to be matched and the theme vector corresponding to the news sample, the news sample which is most similar to the news of the graph to be matched in content can be quickly found from the news sample library; in addition, when the number of the news samples most similar to the content of the to-be-matched graph news is multiple, the multiple news samples are further subjected to fine sequencing based on news titles, accurate determination of the matched news samples is achieved, the news cover of the matched news samples is determined as the news cover of the to-be-matched graph news, the configured news cover and the to-be-matched graph news have strong association in content, and the content of the to-be-matched graph news can be accurately reflected.

The invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores at least one executable instruction, and the executable instruction can execute the news cover mapping method based on content similarity in any method embodiment.

Fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.

As shown in fig. 4, the computing device may include: a processor (processor)402, a Communications Interface 404, a memory 406, and a Communications bus 408.

Wherein:

the processor 402, communication interface 404, and memory 406 communicate with each other via a communication bus 408.

A communication interface 404 for communicating with network elements of other devices, such as clients or other servers.

The processor 402 is configured to execute the program 410, and may specifically execute the relevant steps in the above embodiment of the method for matching a news cover page based on content similarity.

In particular, program 410 may include program code comprising computer operating instructions.

The processor 402 may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 406 for storing a program 410. Memory 406 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 410 may be specifically configured to cause the processor 402 to execute a news cover-page mapping method based on content similarity in any of the above-described method embodiments. For specific implementation of each step in the program 410, reference may be made to corresponding steps and corresponding descriptions in units in the above embodiment of content similarity-based news cover book matching, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in accordance with embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The invention discloses: A1. a method for content similarity-based mapping of news covers, the method comprising:

obtaining a bag-of-words vector corresponding to the news to be matched according to first word frequency data of each word in the news corpus corresponding to the news to be matched; performing topic analysis on the bag-of-word vector corresponding to the to-be-matched graph news to obtain a topic vector corresponding to the to-be-matched graph news;

and searching a news sample matched with the news of the graph to be matched from a news sample library according to the theme vector corresponding to the news of the graph to be matched, and determining the news cover of the matched news sample as the news cover of the news of the graph to be matched.

A2. The method according to a1, wherein before the extracting of the news headlines and the news contents of the news to be matched and obtaining of the news corpus corresponding to the news to be matched, the method further includes:

extracting the news title and the news content of each news sample from the news sample library to obtain a news corpus corresponding to each news sample, and constructing a news corpus by using the news corpora corresponding to all the news samples; wherein the news sample contains a news cover;

obtaining a word bag vector corresponding to each news sample according to second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample; and performing theme analysis on the bag-of-word vector corresponding to each news sample to obtain a theme vector corresponding to each news sample.

A3. The method according to a2, wherein the extracting the news headlines and the news contents of each news sample from the news sample library to obtain the news corpus corresponding to each news sample further includes:

screening out extracted stop words and preset common words contained in news titles and news contents of the news samples aiming at each news sample in the news sample library to obtain a preprocessed corpus corresponding to the news samples;

calculating third word frequency data of each word in the preprocessed corpus corresponding to all news samples;

and obtaining a news corpus corresponding to the news sample by using all words of which the third word frequency data in the preprocessed corpus accord with preset word frequency conditions.

A4. The method according to a2 or A3, wherein the obtaining the bag-of-word vector corresponding to each news sample according to the second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample further includes:

calculating to obtain reverse frequency data of each word in the news corpus corresponding to each news sample by utilizing the total quantity of the news corpora in the news corpus and the quantity of the news corpora containing the word;

obtaining a word vector of each word according to second word frequency data of each word in a news corpus corresponding to the news sample and reverse frequency data of each word;

and obtaining a word bag vector corresponding to the news sample by using the word vectors of all words in the news corpus corresponding to the news sample.

A5. The method according to any one of a2 to a4, wherein the performing topic analysis on the bag-of-word vector corresponding to each news sample to obtain a topic vector corresponding to each news sample further comprises:

and inputting the bag-of-word vector corresponding to each news sample into the trained topic analysis model to obtain the topic vector corresponding to each news sample.

A6. The method according to any one of a1 to a5, wherein the searching the news sample matching the news to be matched from the news sample library according to the topic vector corresponding to the news to be matched further comprises:

calculating the Euclidean distance between the theme vector corresponding to the news of the graph to be matched and the theme vector corresponding to each news sample in the news sample library;

selecting a news sample with the minimum Euclidean distance between topic vectors corresponding to the news to be matched with the graph from all news samples;

if the number of the selected news samples is one, determining the selected news samples as news samples matched with the news of the graph to be matched;

if the number of the selected news samples is multiple, calculating the similarity between the news headline of the news to be matched with the image and the news headline of each selected news sample; and determining the news sample with the maximum similarity as the news sample matched with the news to be matched.

The invention also discloses: B7. a content similarity-based news cover mapping apparatus, the apparatus comprising:

the first processing module is suitable for obtaining a word bag vector corresponding to the news to be matched according to first word frequency data of each word in the news corpus corresponding to the news to be matched; performing topic analysis on the bag-of-word vector corresponding to the to-be-matched graph news to obtain a topic vector corresponding to the to-be-matched graph news;

and the matching module is suitable for searching a news sample matched with the news of the graph to be matched from a news sample library according to the theme vector corresponding to the news of the graph to be matched and determining the news cover of the matched news sample as the news cover of the news of the graph to be matched.

B8. The apparatus of B7, wherein the apparatus further comprises:

the second generation module is suitable for extracting the news title and the news content of each news sample from the news sample library to obtain news corpora corresponding to each news sample;

the building module is suitable for building a news corpus by utilizing news corpora corresponding to all news samples; wherein the news sample contains a news cover;

the second processing module is suitable for obtaining a bag-of-words vector corresponding to each news sample according to second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample; and performing theme analysis on the bag-of-word vector corresponding to each news sample to obtain a theme vector corresponding to each news sample.

B9. The apparatus of B8, wherein the second generating module is further adapted to:

B10. The apparatus of B8 or B9, wherein the second processing module is further adapted to:

B11. The apparatus of any one of B8-B10, wherein the second processing module is further adapted to:

B12. The apparatus of any one of B8-B11, wherein the matching module is further adapted to:

The invention also discloses: C13. a computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the content similarity-based news cover mapping method as described in any one of A1-A6.

The invention also discloses: D14. a computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the content similarity-based news cover mapping method as described in any one of a1-a 6.

Claims

1. A method for content similarity-based mapping of news covers, the method comprising:

2. The method according to claim 1, wherein before the extracting of the news headlines and the news contents of the news to be matched to obtain the news corpus corresponding to the news to be matched, the method further comprises:

3. The method of claim 2, wherein the extracting the news headlines and the news content of each news sample from the news sample library to obtain the news corpus corresponding to each news sample further comprises:

4. The method according to claim 2 or 3, wherein the obtaining of the bag-of-word vector corresponding to each news sample according to the second word frequency data of each word in the news corpus corresponding to each news sample in the news corpus corresponding to the news sample further comprises:

5. The method of any one of claims 2 to 4, wherein the performing topic analysis on the bag-of-word vector corresponding to each news sample to obtain a topic vector corresponding to each news sample further comprises:

6. The method according to any one of claims 1 to 5, wherein the searching the news sample matching the news to be matched from the news sample library according to the topic vector corresponding to the news to be matched further comprises:

7. A content similarity-based news cover mapping apparatus, the apparatus comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises:

9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the content similarity-based news cover mapping method according to any one of claims 1-6.

10. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the content similarity-based news cover mapping method of any one of claims 1-6.